Fault Tolerance at 400,000 Kilometers: The Systems Engineering Behind Artemis II's Computer
Source: hackernews
When people hear “fault-tolerant computer” in the context of a crewed spacecraft, the mental model tends toward simple redundancy: keep a spare, switch to it if the primary fails. The actual engineering behind Artemis II’s flight computers is considerably more interesting than that, and the reasons why reveal something fundamental about what it takes to build computing systems that have to be right the first time.
Artemis II is NASA’s first crewed Artemis mission, a lunar flyby targeting roughly 70,000 kilometers beyond the Moon before returning to Earth. That trajectory matters enormously for the computing architecture, because it puts the spacecraft well outside the Van Allen radiation belts for an extended period. The inner Van Allen belt sits between about 1,000 and 12,000 kilometers altitude; the outer extends to around 60,000 kilometers. Beyond that, Earth’s magnetosphere provides only partial shielding against galactic cosmic rays and solar energetic particles. The radiation flux that Artemis II’s computers will experience is qualitatively different from what you’d encounter on the International Space Station in low-Earth orbit.
What Radiation Actually Does to Computers
The failure modes that radiation causes in semiconductor devices are specific and worth understanding concretely. A Single Event Upset (SEU) occurs when a high-energy particle passes through a memory cell and deposits enough charge to flip a stored bit. SEUs are soft errors: the hardware isn’t permanently damaged, but the data is silently wrong. A Single Event Latchup (SEL) is more serious, causing a parasitic transistor path to activate and create a short circuit that can destroy the device if current isn’t cut quickly. Total Ionizing Dose (TID) effects accumulate over time, gradually degrading transistor characteristics until the device fails.
In LEO, Earth’s magnetic field deflects a significant fraction of incoming particles. Beyond the magnetosphere, galactic cosmic ray flux increases substantially, and solar particle events, which can spike radiation levels dramatically over short timescales, are unshielded. The SEU rate a processor experiences in deep space can be orders of magnitude higher than in LEO.
The response to this isn’t just shielding. Aluminum shielding has diminishing returns and adds mass; at some energies, secondary radiation from shielding material is worse than the primary particles. The real answer is radiation-hardened silicon and architectural redundancy.
BAE Systems’ RAD5545 processor, used in modern space avionics including Orion, is built on a 45nm Silicon-on-Insulator process. The SOI substrate physically separates transistors in ways that make latchup nearly impossible and reduce SEU cross-sections. Each core runs at 333 MHz with roughly 2,400 DMIPS of throughput. The processor includes hardware error correction on internal caches and supports the PowerPC architecture, which has a long heritage in space applications. Its predecessor, the RAD750 (derived from the PowerPC 750 and running at around 200 MHz for about 266 MIPS), flew on Mars Curiosity, Mars Reconnaissance Orbiter, and dozens of other missions. The RAD5545 represents a significant performance jump while maintaining the rad-hard design philosophy.
Six Decades of Space Fault Tolerance
Understanding the Artemis II architecture is easier with historical context, because each generation of space computers has added a layer to what fault tolerance means.
The Apollo Guidance Computer, running at 2.048 MHz with 4 kilobytes of erasable core memory and 72 kilobytes of fixed rope memory, achieved its reliability through simplicity and software design rather than hardware redundancy. Margaret Hamilton’s team at MIT built an operating system with priority-based task scheduling that could shed lower-priority work under load. When the abort alarm codes fired during Apollo 11’s landing, the AGC was doing exactly what it was designed to do: recognizing executive overload and protecting the tasks that mattered. The hardware itself wasn’t redundant in the modern sense; the reliability came from thorough verification and a software architecture that degraded gracefully.
The Space Shuttle moved to explicit hardware voting. Four IBM AP-101S computers ran identical flight software in lockstep, synchronized to execute the same instructions at the same time. A fifth computer ran entirely separate software, written by a different team, as a Backup Flight System. The four primary computers compared their outputs continuously; if one disagreed with the other three, it was voted out and isolated. This scheme could tolerate a single failure in the primary set with continued normal operation, and could fall back to the BFS on a second failure. The AP-101S ran at 25 MHz and produced about 1.2 MFLOPS per processor. The flight software was written in HAL/S, a language designed specifically for real-time aerospace applications.
The key insight the Shuttle architecture established was that fault tolerance requires not just redundant hardware but synchronized execution and deterministic comparison. Three computers running the same algorithm will produce the same output only if they’re executing identically, which requires careful clock synchronization and deterministic scheduling.
The Consensus Problem in Spacecraft
Modern spacecraft avionics, including Orion’s, use Triple Modular Redundancy (TMR) as the base architecture: three computers run the same computation, and a voter examines their outputs and selects the majority result. One faulty computer gets outvoted; the system continues correctly. This sounds straightforward until you think about the voter itself. If the voter is a single point of failure, you’ve solved nothing. The voter must also be redundant or must be implemented in a way that its failure is detectable.
There’s a deeper problem here that anyone familiar with distributed systems will recognize: the Byzantine Generals problem. A faulty processor might not simply produce wrong outputs; it might produce inconsistent outputs, telling different voters different things. A classical TMR voter with three inputs and majority logic handles random bit errors well but can be fooled by a processor that behaves inconsistently. Byzantine fault tolerance, which requires 3f+1 components to tolerate f Byzantine failures, is harder to implement in hardware but necessary for a system that needs to handle not just random failures but potentially erratic or oscillating fault behavior.
The synchronization problem is similarly non-trivial. For three processors to produce comparable outputs at the same time, they need to be executing the same instruction at the same clock cycle, or at least close enough that their outputs can be meaningfully compared. Clock drift, instruction pipeline differences, and interrupt timing can all cause divergence that looks like a fault when it isn’t. Spacecraft avionics teams spend significant effort on the synchronization protocol, and the protocol itself must be fault-tolerant.
Software as the Other Half
Hardware redundancy handles component failures, but software failures are a different category. A bug in the flight software runs on all three redundant computers simultaneously and produces the wrong answer on all three. TMR doesn’t help if the fault is in the algorithm rather than the hardware executing it.
The software response to this is formal verification, exhaustive testing, and strict coding standards. Flight software for crewed spacecraft is developed under processes analogous to DO-178C (the aviation software standard), with every requirement traced to code and every branch covered by tests. The RTOS is typically a proven, certifiable real-time OS; VxWorks has flown on Curiosity, Opportunity, the Phoenix lander, and numerous other missions, and its deterministic scheduling behavior is well-characterized.
Fault Detection, Isolation, and Recovery (FDIR) is implemented as a software layer that monitors system health, identifies anomalies, and executes predefined recovery procedures. The hierarchy typically runs from the lowest level, individual device health monitoring, up through subsystem management to spacecraft-level responses. FDIR logic is itself safety-critical software and undergoes the same verification scrutiny as the flight algorithms it monitors.
The autonomy requirement sharpens all of this. The round-trip light time to the Moon is about 2.6 seconds; to the farthest point of Artemis II’s trajectory, slightly longer. A failure that requires human intervention can wait that long for guidance from mission control in most scenarios. But many failures, particularly in propulsion or life support, require responses in milliseconds or seconds. The computers must detect, diagnose, and respond to critical failures faster than any human in the loop could react, regardless of communication delay. This means the fault response logic must be comprehensive, pre-planned, and thoroughly tested before launch.
What This Reveals About Hard Systems Engineering
The ACM article on Artemis II’s computer is worth reading for the specific implementation decisions NASA and its contractors made. What the broader picture shows is that fault-tolerant computing for deep space isn’t a single technology choice. It’s a stack of decisions: radiation-hardened silicon at the bottom, synchronization protocols above that, voting architecture on top of that, FDIR software above that, and formal verification processes throughout.
Each layer compensates for the limitations of the one below it. Hardware hardening reduces SEU rates but doesn’t eliminate them. TMR tolerates the SEUs that get through but doesn’t handle software bugs. FDIR catches anomalous behavior that static testing missed but depends on the FDIR logic itself being correct. No single layer is sufficient; the system is the combination.
From a software engineering perspective, the most striking constraint is determinism. Modern software development treats non-determinism as a nuisance at worst; race conditions are bugs to fix. In spacecraft flight software, non-determinism is an existential threat to the entire fault tolerance architecture, because a voter that can’t predict when outputs will arrive can’t determine whether a discrepancy is a fault or just a scheduling artifact. Every millisecond of jitter in the execution schedule has to be accounted for.
Building systems that have to work correctly without any possibility of a patch, operating in an environment that will actively try to corrupt their state, making decisions faster than humans can supervise, represents a genuinely different engineering discipline from most software development. The choices made for Artemis II’s computer are the accumulated result of six decades of learning what that discipline actually requires.