Redundancy All the Way Down: The Engineering Behind Artemis II's Flight Computer
Source: hackernews
Space is hostile to computing in ways that software engineers rarely have to think about. Cosmic rays punch through hardware and flip bits at random. Temperature swings of hundreds of degrees stress solder joints and capacitors. There is no reaching into the rack and swapping a failed drive. When NASA engineers sat down to design the flight computer for Artemis II, the first crewed lunar mission since Apollo 17 in 1972, every architectural decision carried the weight of that constraint. The result, as detailed in a recent Communications of the ACM piece, is a system that treats failure not as an exceptional condition but as a design input.
The Baseline: Triple Modular Redundancy
The core concept behind Artemis II’s flight computers is triple modular redundancy, or TMR. Three independent processors execute identical instruction streams simultaneously. Their outputs are fed into a voting circuit, and the majority result wins. If one processor produces a deviant output, it gets outvoted. The system continues operating correctly, and the anomaly is logged for later analysis.
TMR is not a new idea. It was used in the Saturn V’s guidance system in the 1960s, and the Space Shuttle’s four general-purpose computers (with a fifth backup running a completely independent software stack) extended the concept further. What has changed over the decades is the granularity at which voting happens, the sophistication of the error detection logic, and the level of software involvement in recovery decisions.
In Orion’s avionics suite, built substantially by Honeywell, redundancy is layered. The Command and Data Handling subsystem manages multiple vehicle computers running in lockstep. These are not simply redundant copies sitting idle; they are actively processing in parallel, synchronized to within microseconds. The voting logic runs continuously. A single-event upset, where a cosmic ray strikes a memory cell and flips a bit, gets caught before it can propagate into a flight-critical output like a thruster firing command.
Radiation Hardening: Hardware and Software Together
The processors inside Orion’s flight computers are radiation-hardened variants, purpose-built to tolerate the particle radiation environment of deep space. The workhorse of crewed spaceflight for the past two decades has been the BAE Systems RAD750, a radiation-hardened derivative of the PowerPC 750 processor. It runs at speeds that would look antique by consumer standards, typically under 200 MHz, but it can absorb total ionizing doses measured in hundreds of kilorad and survive single-event latchup conditions that would permanently damage commercial silicon.
Radiation hardening happens at multiple levels. At the process level, chip fabricators use silicon-on-insulator techniques that reduce the charge collection volume when a particle strikes. At the circuit level, individual flip-flops are replaced with hardened variants that require multiple simultaneous bit flips to corrupt state, a physically unlikely event. At the system level, error-correcting code memory scrubs stored bits continuously, finding and correcting single-bit errors before they accumulate.
Software plays a role too. Memory scrubbing daemons run at scheduled intervals, reading and rewriting memory to catch and correct soft errors before they spread. Watchdog timers reset processors that stop responding within expected deadlines. The flight software itself is written with explicit assertions that sanity-check intermediate state, and out-of-range sensor readings trigger reconfiguration rather than being silently propagated.
The Software Stack: cFS and Formal Discipline
NASA’s Core Flight System (cFS) is the open-source software framework that underpins Orion’s flight software, along with many other NASA missions including the James Webb Space Telescope and Mars helicopter Ingenuity. It provides a component-based architecture where software applications communicate through a publish-subscribe message bus called the Software Bus. Applications are isolated from each other; a bug in a non-critical component cannot directly corrupt the state of a flight-critical one.
The language of choice across NASA’s flight software tradition is Ada, and for good reasons that go beyond institutional inertia. Ada’s type system enforces range constraints at the language level. A variable declared to hold a value between 0 and 360 degrees cannot silently overflow to a negative number without a runtime exception. The SPARK subset of Ada goes further, enabling formal proof of absence for runtime errors. NASA’s most safety-critical components use SPARK with full proof obligations discharged, meaning the compiler can mathematically verify that certain classes of bugs are impossible.
This is a different discipline from how most application software is written. There are no null pointer exceptions in formally verified Ada code, not because the programmers were careful, but because the tools prove they cannot occur.
Byzantine Faults and the Limits of Voting
TMR handles a specific threat model: one processor fails in a detectable way, producing outputs that differ from the other two. But a subtler failure mode exists, the Byzantine fault, where a processor fails in a way that produces different wrong answers to different observers. A processor with damaged memory interconnects might, in principle, report one value to voter circuit A and a different value to voter circuit B. Three-way voting cannot reliably resolve this.
The academic treatment of Byzantine fault tolerance traces to Lamport, Shostak, and Pease’s 1982 paper, which established that tolerating up to f Byzantine faults requires at least 3f+1 components. Tolerating a single Byzantine fault requires four systems, not three. This is why critical infrastructure like aircraft fly-by-wire systems often use four independent computers rather than three, and it informed similar decisions in Orion’s architecture.
The practical mitigation in spacecraft is to design hardware such that Byzantine failure modes are physically unlikely, using well-understood failure mechanisms rather than hoping Byzantine conditions never occur. Radiation-induced faults tend to produce detectable errors: stuck bits, corrupted memory that ECC flags, or completely unresponsive processors. The pathological case where a processor silently produces subtly wrong-but-plausible outputs is rare in well-characterized hardware.
Lessons for Non-Space Computing
There is a useful exercise in looking at how spacecraft engineers think about fault tolerance and then mapping those concepts back to distributed systems work. The parallels are instructive even when the stakes differ.
Voting logic appears in distributed consensus protocols. Raft and Paxos require a majority of nodes to agree before committing a write. Three nodes tolerate one failure; five nodes tolerate two. The mathematics is identical to TMR, applied at the level of networked servers instead of synchronized processors. Byzantine fault-tolerant consensus protocols like PBFT and HotStuff apply the 3f+1 requirement to distributed systems where nodes might behave arbitrarily.
Memory scrubbing maps onto the practice of checksumming data at rest and regularly verifying it. ZFS does this continuously, reading blocks and checking them against stored checksums, correcting errors when redundant copies exist. The failure mode being addressed, silent data corruption, is directly analogous to the soft errors that spacecraft memory scrubbers hunt.
Watchdog timers show up in Kubernetes liveness probes and health checks. A container that stops responding gets restarted, just as a spacecraft processor that stops meeting its deadlines gets reset.
The difference is not really conceptual. It is in the rigor of testing and verification. NASA subjects flight computers to accelerated radiation testing in particle accelerators, deliberately inducing the faults the system must handle and verifying that recovery proceeds correctly. Software is reviewed line by line, and coverage metrics are measured against strict thresholds. The DO-178C standard for aviation software and NASA’s own NPR 7150.2 coding standards impose structural testing requirements that commercial software almost never faces.
History Compressed into Architecture
Every decision in Artemis II’s flight computer is, in some sense, a response to a past failure. The Apollo 1 fire was not a computing failure, but it reshaped NASA’s entire culture of verification. The loss of Challenger taught lessons about the management of known risks. Columbia reinforced them. The Mars Climate Orbiter, lost due to a unit conversion error between imperial and metric measurements, is why NASA flight software now has explicit unit annotations and cross-checks.
The Orion avionics team inherited this institutional memory. The redundancy architectures, the formal verification practices, the radiation testing, the voting logic, none of it was invented fresh. It is a distillation of fifty years of learning what happens when computers fail in space, applied with modern tools to a system that will carry human beings around the Moon.
Artemis II is scheduled to carry four astronauts on a roughly ten-day free-return trajectory around the Moon, the farthest from Earth that humans will have traveled. The computers running that mission will be unremarkable by modern performance standards and extraordinary by almost every other measure. That gap between raw speed and genuine reliability is worth sitting with. It says something about what it costs to build software that genuinely cannot afford to fail.