Why Space Computers Vote: Inside Artemis II's Fault-Tolerant Architecture
Source: hackernews
The CACM piece on how NASA built Artemis II’s fault-tolerant computer is worth reading as a systems engineering case study, not just aerospace history. The technical constraints that shaped the Orion spacecraft’s computing architecture are the same constraints that show up in medical devices, aircraft fly-by-wire systems, and nuclear instrumentation, just pushed to an extreme that makes every trade-off sharper.
The Environment That Shapes Every Decision
Beyond Earth’s magnetosphere, the radiation environment turns ordinary computing into a probabilistic problem. Galactic cosmic rays and solar particle events send high-energy protons and heavy ions through spacecraft structure and into silicon, where they can deposit enough charge to flip bits in memory or alter the state of logic gates. These single-event upsets (SEUs) are effectively random writes to your hardware, triggered by physics rather than software bugs.
At ISS altitudes, Earth’s magnetic field still provides meaningful shielding. Artemis II is different. The Orion spacecraft carries four crew members on a trajectory that transits the Van Allen radiation belts and continues into cislunar space, where that protection disappears. SEU rates increase; total ionizing dose accumulates faster; the consequences of an undetected fault are not a degraded user experience but potentially the loss of crew.
Three failure modes dominate space computing reliability work. Single-event upsets are transient and usually recoverable with error-correcting code (ECC) memory, which can detect and correct single-bit errors and detect (though not correct) double-bit errors. Total ionizing dose (TID) is cumulative damage that gradually shifts transistor characteristics, increases leakage current, and degrades performance over the mission lifetime; components are rated in krad, and mission design must keep accumulated exposure within those limits. Latch-up is the most severe: high-energy particles can trigger parasitic bipolar transistors inherent in CMOS structures, causing a self-sustaining short circuit that draws catastrophic current until the device is power-cycled or destroyed. Radiation-hardened chips address latch-up at the process level through silicon-on-insulator construction, guard ring implants, and careful layout rules that physically suppress the parasitic paths.
Triple Modular Redundancy in Practice
The solution NASA has used across crewed programs since the 1970s is majority voting across redundant hardware channels, a pattern called triple modular redundancy (TMR). Three independent computing channels execute the same operations on the same inputs, compare results, and accept the majority output. A channel that produces a divergent result is either faulty or the remaining two are; the minority output is rejected, the anomalous channel is flagged, and the system continues operating on the remaining channels.
Stating the principle takes one sentence. Building a system where it actually works reliably is substantially harder. The voting mechanism itself must be trustworthy; a fault in the voter corrupts the entire redundancy scheme. The three channels must be genuinely independent, sharing no failure modes, because common-cause failures defeat redundancy entirely. And the channels must be synchronized: in a flight control system, comparing results from three channels that are processing different points in time produces meaningless output.
The Space Shuttle addressed synchronization with five IBM AP-101 computers running in lockstep, a 4-out-of-5 voting configuration. Four machines executed primary flight software in tight synchronization, continuously comparing outputs. A fifth ran an independently developed backup flight system written by a separate contractor to a separate specification, expressly to avoid common-mode software failures. If all four primary computers failed identically because of a software bug, the backup could still fly the vehicle. This architectural decision reflects a hard-won insight: redundant hardware running identical software can and does fail identically, so software diversity matters as much as hardware redundancy.
Orion’s avionics architecture extends this lineage. Multiple independent computing strings, each running the same flight software on separate radiation-hardened processors, with voting logic that compares outputs and isolates strings that diverge. The data buses connecting avionics units use the MIL-STD-1553 standard, a 1 Mbps serial bus that has flown on spacecraft since the 1970s and whose behavior under fault conditions is thoroughly characterized.
The Processor Problem
Commercial semiconductor development is driven by performance density: shrink the transistors, switch faster, pack more logic per square millimeter. Smaller transistors use less power and switch faster, but they also reduce the charge required to flip a bit, making them more susceptible to SEUs, and their smaller feature sizes make radiation-hardening techniques less effective.
This is why space-grade processors operate at clock speeds that would seem antique on a desktop. The BAE Systems RAD750, a radiation-hardened derivative of the PowerPC architecture, runs at roughly 200 MHz and carries a total ionizing dose tolerance exceeding 100 krad. It is the primary processor in the Curiosity and Perseverance Mars rovers and dozens of other deep-space spacecraft. The more recent RAD5545 offers quad-core operation at up to 1.2 GHz, still far behind commercial parts but a meaningful capability increase for onboard processing.
The European Space Agency’s LEON processor family, a SPARC-based RISC design originally developed by ESA, takes a similar approach. LEON3 and LEON4 variants appear in Sentinel Earth observation satellites, the ExoMars rover, and BepiColombo’s Mercury mission. The LEON architecture is notable partly because radiation-hardened variants are available from multiple vendors, reducing single-source dependency.
Orion’s flight computers use processors in this heritage: clock speeds constrained by radiation tolerance requirements, operating in redundant configurations with hardware voting logic. The flight software executes on a real-time operating system providing deterministic scheduling, which is a prerequisite for tight synchronization across channels. VxWorks has historically been the standard choice in crewed spaceflight applications for this reason.
Verifying What You Cannot Easily Test
Building the hardware and software is one engineering challenge; verifying that it behaves correctly under the actual radiation environment is another. Ground testing uses particle accelerator facilities where heavy ion beams can be directed at components to simulate cosmic ray strikes, and proton irradiation tests replicate solar particle events. Total ionizing dose testing uses sustained gamma irradiation, typically from cobalt-60 sources, running components to their specification limits and monitoring for degradation. This testing is expensive and slow, and it can consume the actual flight components under test rather than spares.
Software verification for fault-tolerant systems adds a distinct dimension beyond ordinary testing. You need to demonstrate not just that the normal operational case works, but that every combination of faults, including faults in the fault-detection and reconfiguration code itself, is handled correctly. NASA uses formal verification methods for safety-critical code paths and extensive hardware-in-the-loop simulation environments, but the ground-based test setup is always a model of the actual space environment rather than the environment itself. Some faults will only manifest in flight.
Sixty Years of the Same Hard Problem
The Apollo Guidance Computer ran at 1 MHz with 64 kilobytes of core rope memory. It was a remarkable piece of constrained engineering, but it was fundamentally a single computer with a separate backup (the Abort Guidance System) rather than true fault-tolerant redundancy in the TMR sense. Mission success depended heavily on the hardware working as designed, supplemented by remarkable software defensive design, including the famous priority scheduling that allowed the computer to shed less critical tasks and focus on guidance when overloaded during Apollo 11’s lunar descent.
The Shuttle’s multi-computer voting architecture represented a genuine advance, and it demonstrated real fault tolerance in operation across 135 missions. Multiple missions flew with degraded computing configurations and landed safely. The system worked.
Artemis II extends this progression with modern radiation-hardened processors, higher-bandwidth avionics buses, and more sophisticated fault detection and isolation software. The mission goes further and stays out longer than any crewed mission since Apollo 17 in 1972; the radiation exposure accumulates; the fault tolerance requirements are correspondingly higher. But the fundamental architecture, redundancy, voting, software diversity, careful isolation of failure domains, traces directly back to decisions made in the 1970s and validated in operation.
The lead time involved in this work is part of the story. Space-grade components have qualification cycles measured in years. The processors flying in Artemis II were selected and qualified long before the mission; by the time they fly, they are already generations behind commercial applications. This is not a management failure; it is the correct outcome of the reliability requirements. Changing a component late in development means re-qualification, re-testing, and re-verification of all the fault tolerance logic that depends on that component’s timing and behavior characteristics. The conservatism is deliberate and appropriate.
The alternative is flying unqualified hardware and accepting the risk, which is not a trade-off that makes sense when the payload is four people in cislunar space.