· 6 min read ·

Redundancy All the Way Down: Inside Artemis II's Fault-Tolerant Computer

Source: hackernews

When engineers at NASA and their contractors sit down to design a flight computer for a crewed lunar mission, the requirements document looks nothing like what a product team at a cloud company faces. There is no gradual rollout. There is no hotfix pipeline. There is no on-call rotation that can reach the hardware. There are four people 240,000 miles from Earth, and the computer either works or it does not.

The ACM Communications article on Artemis II’s fault-tolerant computer has been circulating widely, and it’s worth going deeper than the summary. The engineering choices NASA makes here are not arbitrary conservatism. Each one is a response to a specific class of failure that has either killed people or come close to it.

The Threat Model Is Literally the Universe

Space hardware fails for reasons that don’t exist in data centers. The most common and insidious is the single-event upset (SEU): a high-energy particle, typically from galactic cosmic rays or solar energetic particles, passes through a transistor and flips a stored bit. On the ground, memory errors are rare enough that ECC is a sufficient mitigation. In deep space, particularly beyond the Van Allen belts where Artemis II will fly, particle flux is orders of magnitude higher and the particles carry far more energy.

An SEU might flip a bit in a control variable. It might corrupt an instruction pointer. It might silently alter a sensor reading that a guidance algorithm treats as ground truth. These events are not detectable by the CPU itself. There is no exception, no page fault, no segfault. The processor continues executing with corrupted state as if nothing happened, which is exactly what makes them so dangerous.

Beyond SEUs, you have single-event latch-up (SEL), where a particle triggers a parasitic transistor structure that creates a short circuit and can permanently destroy the chip if not quickly power-cycled. And total ionizing dose (TID) accumulates over the mission, gradually degrading transistor characteristics until performance degrades or logic fails outright.

Radiation hardening addresses some of this at the silicon level. The fabrication processes NASA-qualified processors use, such as the BAE Systems RAD750 that has flown on dozens of deep-space missions including the Mars rovers, sacrifice raw clock speed for tolerance to these effects. A RAD750 runs at around 200 MHz with roughly 400 MIPS throughput. A modern consumer CPU running at 4 GHz would outperform it by two to three orders of magnitude, but a consumer CPU would likely latch up or corrupt state on its first day past the magnetopause.

Triple Modular Redundancy and the Voting Problem

Radiation hardening gets you resilience to gradual effects and reduces SEU susceptibility, but it does not eliminate it. The remaining risk is managed architecturally through triple modular redundancy (TMR).

The concept is simple to state: run the same computation on three independent hardware channels, then take a majority vote on the outputs. If one channel produces a different answer than the other two, it is either experiencing a fault or receiving corrupted sensor data, and it is outvoted. The system continues operating on the consensus output while flagging the disagreeing channel for diagnostic attention.

The implementation is considerably less simple. For voting to work, the three channels must be synchronized: they need to be executing the same instruction at the same logical time, receiving the same sensor inputs at the same moment, and producing outputs that can be compared before any of them acts on those outputs. Achieving this synchronization without introducing a single point of failure in the synchronization mechanism itself requires careful design. The voting logic itself must be fault-tolerant, or you have solved nothing.

A deeper issue is what happens when two channels agree on a wrong answer. If a sensor provides a corrupted reading that all three channels receive identically, they will all compute the same wrong result and vote unanimously for it. This is why sensor redundancy and cross-channel sensor comparison are a parallel concern to compute redundancy. You cannot protect computation without also protecting the inputs to that computation.

There is also the question of Byzantine faults: cases where a faulty channel does not simply produce a wrong answer, but produces different wrong answers to different parts of the system, actively undermining the voting process. Byzantine fault tolerance in distributed systems literature requires four nodes to tolerate one Byzantine fault, not three. NASA’s designs address this through careful channel isolation: a faulty channel should not be able to inject different data into different channels. Physical isolation of the channel buses and dedicated point-to-point connections between specific components prevent the cross-contamination that Byzantine scenarios require.

Heritage vs. Modern: A Calculated Conservatism

The Orion avionics architecture carries significant heritage from earlier programs, and this is a feature, not an oversight. The shuttle’s main engine controller, the Apollo Guidance Computer, and intermediate programs like the Space Shuttle Main Engine Controller each contributed design patterns that became baseline assumptions for subsequent programs. When a design pattern has flown 135 shuttle missions without a computer-caused fatality, there is a very strong argument for continuing to use it.

This stands in contrast to SpaceX’s approach with Falcon 9 and Dragon. SpaceX uses commercial-off-the-shelf (COTS) x86 processors, running Linux, relying on software-defined redundancy and the statistical argument that modern COTS hardware is reliable enough that triple voting plus rapid detection and reset handles the residual risk. The Dragon spacecraft uses triple-redundant flight computers running custom real-time software on what is functionally server-grade hardware. This approach yields far greater compute performance at lower cost and mass, with the bet that software resilience and rapid fault recovery compensate for less radiation-hardened silicon.

Neither approach is wrong. They reflect different risk models, different organizational cultures, and different missions. Dragon flies to low Earth orbit, inside the Van Allen belts, where radiation flux is significantly lower. Artemis II will spend days in deep space, outside that protection, which shifts the calculus toward hardened silicon and conservative heritage design.

The Software Verification Cost

The hardware story is only half of it. The software running on Artemis II’s flight computers must meet certification standards that make DO-178C, the civil aviation software standard, look approachable. Every requirement must be traced to code. Every branch of every function must be covered by tests. The full verification record runs to millions of documents.

NASA’s flight software often uses model-based design: engineers define behavior in MATLAB/Simulink models, and code generation tools produce the C code that runs on the hardware. This approach moves verification effort earlier in the process, where mistakes are cheaper to fix, and provides a higher-level specification that can be formally checked in ways that hand-written C cannot.

The discipline of keeping flight software deterministic is worth appreciating. No dynamic memory allocation after initialization. No floating-point operations where integer arithmetic suffices. Bounded loop iterations with statically provable termination. Stack depths that are analyzable at compile time. These constraints exist because undefined behavior in a C program is merely annoying in a web service and potentially fatal on a spacecraft.

What This Costs and Why It Matters

Fault-tolerant computing for crewed spaceflight is expensive in every currency: mass, power, development time, and money. A triple-redundant avionics system is roughly three times the hardware of a simplex system, plus the added complexity of voting logic, cross-channel interconnects, and isolation mechanisms. The software verification burden compounds this. NASA’s approach to these costs reflects the nature of the mission: the probability of losing crew must remain below a defined threshold across the entire flight, and the computer system is one of the largest contributors to that probability budget.

Reading through the engineering details in the CACM article, what becomes clear is that fault-tolerant design for crewed spaceflight is less about any single clever technique and more about the systematic discipline of closing every gap. You harden the silicon, then you add TMR, then you isolate the channels, then you protect the sensors, then you verify the software formally, then you test the whole system under simulated radiation environments. Each layer closes a failure mode that the previous layer left open.

The computers that flew Apollo flew with 4 KB of RAM and 72 KB of read-only rope memory. The computers that will fly Artemis II are incomparably more capable, and the engineering rigor that protects them has kept pace with that capability. That combination, disciplined redundancy on capable hardware with formally verified software, is what makes it reasonable to put four people in a capsule and send them past the Moon.

Was this interesting?