· 8 min read ·

Redundancy as a First Principle: The Fault-Tolerant Architecture Behind Artemis II's Flight Computers

Source: hackernews

The computers aboard Artemis II have to work. Not “usually work” or “work 99.9% of the time.” They have to function correctly through radiation hits, hardware failures, and the kind of transient faults that shake loose from cosmic rays at 250,000 miles from the nearest repair shop. Building software systems that are merely reliable is an engineering problem. Building ones that cannot fail is a different discipline entirely.

NASA’s approach to this, detailed in a recent ACM CACM piece, sits at the intersection of hardware redundancy, formal software verification, and decades of hard-won lessons from Apollo, Skylab, and the Space Shuttle. The engineering choices in Orion’s avionics are not especially exotic in isolation; each technique has been used before. What makes Artemis II interesting is the specific way these techniques are composed, and what that composition implies for how we think about reliability at the system level.

From Apollo’s Single Point of Failure to Shuttle’s Voting Cabal

The Apollo Guidance Computer ran on a single processor with rope-core memory. It had no hardware redundancy to speak of. Error detection existed, including parity checks and cycle-stealing from the display keyboard interface, but if the AGC went down, it went down. The astronauts had some manual override capability, but the guidance computer was effectively a single point of failure in a mission where a failure meant death.

The Space Shuttle changed this dramatically. The Shuttle’s avionics used five IBM AP-101 general-purpose computers, four running identical software simultaneously in lockstep, one running a separately developed backup flight system written by a different contractor using different code. The four primary computers voted on outputs; if one disagreed, it was outvoted and isolated. The backup flight system was kept in sync but ran independent software so that a common-mode software bug could not take down the entire stack.

This five-computer approach worked, but it came with significant complexity costs. Keeping four computers synchronized tightly enough for meaningful voting required careful design of the synchronization frames, and the Shuttle software team spent decades chasing subtle bugs in the synchronization and voting logic. The backup flight system, written separately, diverged from the primary software in ways that occasionally caused confusion during testing and training. The whole stack was deeply coupled in ways that made it hard to evolve.

Orion’s Three-Lane Architecture

Artemis II’s Orion crew module uses a three-lane architecture: three independent flight computers running the same software, fed by independent sensor suites, with a hardware voter comparing outputs before they reach actuators and critical systems. The voter implements 2-of-3 majority logic. If one computer produces an output that disagrees with the other two, the voter selects the majority result and flags the disagreeing lane for monitoring. The spacecraft continues operating normally in a degraded mode until the anomaly is resolved or formally acknowledged.

Honeywell, Orion’s avionics prime contractor, designed the lane architecture to be electrically isolated. Power, ground references, and data buses for each lane are physically separated, so a power fault or a short circuit in one lane cannot propagate to another. The buses use MIL-STD-1553, a 1 Mbps serial protocol developed for military avionics in the 1970s that remains standard for safety-critical aerospace systems precisely because its timing and failure behavior are well understood and deterministic.

The choice of three lanes rather than four or five reflects a deliberate simplification relative to the Shuttle. Three-lane systems are easier to reason about formally: you have one voter, one voting threshold (majority), and one failure mode to manage (single-lane loss). Four-lane systems introduce the ambiguous case where two lanes disagree with two others, requiring an additional tiebreaker rule. The Shuttle’s four-primary architecture had to deal with this, and it was a recurring source of complexity in the software.

The Processor: Radiation-Hardened and Deliberately Conservative

The flight computers use radiation-hardened processors derived from commercial designs but manufactured with processes that resist the effects of ionizing radiation. The BAE Systems RAD750, based on the PowerPC 750 (the same architecture as Apple’s G3), has been a workhorse of deep-space computing for over two decades, flying on the Mars Reconnaissance Orbiter, the Curiosity rover, the Lunar Reconnaissance Orbiter, and numerous other missions. It runs at 200 MHz with roughly 400 MIPS of throughput, which sounds modest next to a smartphone, but the point of a radiation-hardened processor is not raw performance. The point is predictable behavior when struck by a high-energy particle.

A cosmic ray passing through a conventional processor’s memory cell can flip a bit, causing a Single Event Upset (SEU). In commercial hardware this happens rarely enough that most applications never notice. In the Van Allen belts or in transit to the Moon, the flux is high enough that SEUs are a routine design consideration. The RAD750 uses triple modular redundancy internally for critical registers and logic, error-correcting code (ECC) memory for SRAM and DRAM, and latch-up protection circuitry to prevent a single particle strike from causing a destructive high-current event that could permanently damage the chip.

The operating system is VxWorks, Wind River’s real-time operating system, which has been used in aerospace and defense systems since the 1980s. VxWorks provides deterministic scheduling, meaning that the time between an event and its handler executing is bounded and known. This matters for voting systems because the voter’s logic assumes that all three lanes produce outputs on a regular, predictable schedule. If a lane’s software scheduler jitters, outputs arrive late, and the voter has to decide whether that lateness is a fault or a timing anomaly. Eliminating jitter simplifies that decision considerably.

The Synchronization Problem

The hardest part of building a redundant voting system is not the voting. It is keeping the computers synchronized tightly enough that they produce comparable outputs at the right time.

Three computers running the same software from the same initial state will diverge if anything in their environments differs. Sensor readings arrive at slightly different times. Interrupts fire in slightly different orders. Floating-point operations, if they involve any non-determinism, can produce subtly different results. Any of these can cause the computers to reach different states, after which their outputs legitimately differ even without any hardware fault.

NASA’s approach for Artemis uses synchronization frames: fixed-duration time windows in which all three computers must complete a deterministic set of operations and exchange state checksums before proceeding to the next frame. If a computer falls behind, it waits at the frame boundary. If a computer falls too far behind, it is considered to have faulted. This frame-based synchronization is conceptually similar to barrier synchronization in parallel computing, but operating on microsecond timescales with hardware-enforced timing.

The navigation and control software is structured to be entirely deterministic within each frame: no dynamic memory allocation, no unbounded loops, no floating-point operations that could produce different results on slightly different hardware. The computation is a fixed sequence of operations that runs in a bounded, measured time. This is uncomfortable for software engineers trained in modern practices, but it is what makes formal reasoning about the system’s behavior tractable.

Testing: Fault Injection and Radiation Chambers

Validating a fault-tolerant system requires proving not just that it works correctly under normal conditions but that it responds correctly to every credible fault scenario. NASA uses fault injection testing, where failures are deliberately introduced into one or more lanes while the system is running and the response is verified against the expected behavior. This includes hardware faults (disconnecting power to a lane, injecting bit errors into memory), software faults (corrupting data structures), and timing faults (delaying a sensor input).

Radiation testing is done at particle accelerator facilities, where researchers expose hardware to controlled beams of protons and heavy ions at energies that simulate the space environment. Each component in the flight computer stack is characterized for its SEU rate and latch-up threshold under these conditions, and the system-level architecture is designed so that the expected number of SEUs per mission profile does not exceed the fault tolerance budget.

The formal verification side uses a combination of requirements traceability, code coverage analysis, and in some cases formal model checking for the most critical state machines. The voting logic is a natural candidate for formal verification because it is small enough to be completely specified and its correctness is critical enough to justify the cost. Tools like SPIN for model checking and Frama-C for C code analysis have seen increasing use in safety-critical aerospace software over the past decade.

Comparison: SpaceX’s Different Bet

SpaceX took a different approach with Dragon. Rather than hardware voting between identical redundant computers, Dragon uses a combination of software-based fault detection, command and data handling computers with independent watchdog timers, and a design philosophy that emphasizes software recovery over hardware elimination of faults. Dragon’s avionics run on x86-based hardware with less aggressive radiation hardening, relying on ECC memory and software restarts to handle SEU events rather than preventing them from affecting the output in real time.

Neither approach is objectively superior. Dragon’s approach is cheaper, easier to update, and leverages commercial hardware and software ecosystems. Orion’s approach provides harder guarantees about in-flight fault handling and is more amenable to formal safety certification under crewed spaceflight regulations. The right choice depends on mission profile, risk tolerance, and the regulatory framework you are operating under.

What This Means Beyond Spacecraft

Most software does not run in the Van Allen belts, and most developers are not building triple-redundant avionics systems. But the design principles underlying Artemis II’s flight computers are not specific to spacecraft.

The value of electrical isolation between redundant channels shows up in datacenter design, where power domains, network paths, and failure domains are separated deliberately. The discipline of deterministic, frame-based execution is what makes real-time systems analyzable; the same discipline, applied less strictly, is what makes microservice latency predictable. The requirement that voting logic be formally specified and verified, rather than tested empirically, is increasingly standard practice in safety-critical automotive and medical device software under standards like ISO 26262 and IEC 62304.

What the Artemis II computers illustrate, beyond the aerospace specifics, is that fault tolerance is not a feature you add to a system after the fact. It is a constraint that shapes every architectural decision from the start, from processor selection to memory organization to scheduling policy to software structure. The computers are reliable because reliability was the first requirement, not a later concern. That order of operations is the lesson that applies regardless of whether your code is going anywhere near the Moon.

Was this interesting?