· 6 min read ·

Redundancy Is Not Reliability: The Layered Engineering Behind Artemis II's Flight Computers

Source: hackernews

When Artemis II carries four crew members around the Moon, probably sometime in 2026, the flight computers keeping them alive will have been engineered to a level of rigor that most software developers never encounter. The ACM piece on how NASA built Artemis II’s fault-tolerant computer gives a rare window into that process, and it is worth unpacking what makes spacecraft computing so different from everything else.

The short version: redundancy alone does not give you reliability. What it gives you is the possibility of reliability, if you build every other layer correctly.

The Radiation Problem Is Worse Than You Think

The Orion spacecraft does not stay in low Earth orbit. Artemis II is a cislunar free-return trajectory, which means the crew spends days traveling through regions far beyond the Van Allen belts, where Earth’s magnetosphere no longer provides meaningful shielding. The radiation environment there is qualitatively different from what the International Space Station experiences.

Cosmic rays and energetic solar particles cause two categories of problems for electronics. Total ionizing dose (TID) accumulates over time and degrades semiconductor junctions, eventually causing permanent failures. Single event upsets (SEUs) are more insidious: a single high-energy particle passes through a memory cell or logic gate and flips a bit, right now, with no warning.

An SEU in a register during a guidance computation can corrupt a trajectory calculation. An SEU in a control flow variable can redirect execution to the wrong branch. Ordinary software has no defense against this because ordinary software assumes the hardware is deterministic. Spacecraft software cannot make that assumption.

This is why Orion’s flight computers use radiation-hardened processors, specifically variants built on the PowerPC architecture hardened against TID and SEU effects. The BAE Systems RAD750, which has been the workhorse of NASA’s deep-space missions for two decades, can survive 1 million rads of total ionizing dose and tolerate SEU rates that would destroy commercial processors within hours. It runs at 166 to 200 MHz, which sounds embarrassingly slow by consumer standards, but raw speed is not the design constraint. Predictability, determinism, and survivability are.

Even hardened processors need ECC memory. Error-correcting code memory uses Hamming codes or similar schemes to detect and correct single-bit flips and detect (but not correct) two-bit flips. The hardware does this transparently, scrubbing memory contents periodically to prevent the accumulation of soft errors before they compound into something uncorrectable.

Three Computers Voting on Reality

Hardware hardening reduces the SEU rate but does not eliminate it. So Orion’s avionics use triple modular redundancy: three independent flight computers running identical software on identical inputs, with a voting circuit comparing outputs before any command is executed.

If all three agree, the command goes through. If one disagrees with the other two, the majority wins and the disagreeing computer is flagged as potentially faulty. This architecture tolerates a single complete computer failure with no loss of function, and it detects faults that hardened hardware alone would miss.

The subtlety is in the word “identical.” For voting to work correctly, the three computers must produce bit-for-bit identical outputs from identical inputs at identical times. This means the software must be deterministic, the clocks must be synchronized, and the inputs must be distributed identically. Any nondeterminism, any timing jitter that causes one computer to read a sensor at a slightly different moment, can produce false disagreements that degrade the system’s ability to identify real faults.

This is one reason spacecraft flight software looks nothing like the software most engineers write day-to-day. It avoids dynamic memory allocation almost entirely. Interrupts are managed carefully. Execution timing is bounded and verified. The software is written largely in Ada or a restricted subset of C, with static analysis tools verifying properties that cannot be checked at runtime. Certification to DO-178C Level A, the most stringent avionics software standard, requires demonstrating that every line of object code has been traced back to a requirement and that every decision branch has been exercised by tests.

FDIR: The Software Side of Fault Tolerance

Triple modular redundancy handles the hardware layer. Fault Detection, Isolation, and Recovery (FDIR) handles the software and system layer. Where TMR asks “did this computer produce the wrong answer,” FDIR asks “is this component behaving correctly within the system.”

FDIR is a layered decision hierarchy. At the lowest level, hardware monitors detect out-of-range sensor values, overcurrent conditions, and communication timeouts. The flight software above that interprets patterns of lower-level faults, identifies the likely failing component, and executes a predefined recovery procedure: switching to a redundant unit, reconfiguring a data bus, or entering a safe mode.

Safe mode is the spacecraft equivalent of a circuit breaker. When FDIR cannot isolate and recover from a fault automatically, or when conditions exceed the scope of its decision logic, the spacecraft transitions to a minimal-power, minimal-activity state where it can survive while ground controllers analyze the situation. Safe mode design is an exercise in asking what the spacecraft needs to keep doing even if almost everything else is wrong: maintain attitude control, maintain thermal limits, maintain communication. Everything else waits.

For Artemis II, safe mode has a harder constraint than for a robotic mission. The crew is on board. Life support continues regardless. The FDIR design has to account for situations where the crew’s autonomous intervention is both possible and preferable to waiting for ground commands, given the communication delays and the fact that the humans on the spacecraft have information the ground does not.

Testing What You Cannot Reproduce

The hardest part of building these systems is that you cannot fully test them on Earth. You can irradiate components in a particle accelerator to simulate TID and measure SEU cross-sections. You can inject faults into running software to verify that FDIR responds correctly. You can run the voting computers in a hardware-in-the-loop simulation for thousands of hours. What you cannot do is reproduce cislunar space in a laboratory.

NASA’s approach is verification through analysis and heritage. If a component or software module has flown on previous missions, its behavior in the radiation environment is known empirically. The Space Shuttle’s IBM AP-101 flight computers flew over 100 missions; the knowledge accumulated from that fleet was invaluable for subsequent programs. Artemis inherits significant heritage from the Shuttle program and from Orion’s uncrewed Artemis I test flight in 2022, which was partly designed to validate the avionics in a real deep-space environment.

Artemis I’s 25-day mission exposed Orion’s systems to the actual cislunar radiation environment and confirmed that the computers performed within predictions. That data closes one of the gaps that purely ground-based testing cannot fill.

What This Engineering Discipline Costs

Building software to DO-178C Level A costs roughly 10 to 100 times what building equivalent commercial software costs, depending on how you measure. Static analysis, formal verification, full branch coverage testing, requirements tracing, independent verification and validation, configuration management down to the bit level: all of it takes time and specialized expertise that the commercial software industry rarely needs to employ.

There is a temptation, in an era when SpaceX is demonstrating that faster and cheaper iteration works for launch vehicles, to ask whether this level of rigor is still necessary for spacecraft avionics. The answer is that the two are not directly comparable. Launch vehicles can tolerate higher failure rates because they are designed with abort systems and, for cargo missions, because the cost model of occasional failure is acceptable. A crewed deep-space vehicle with no realistic abort option beyond a specific window cannot make that trade. The fault-tolerant computer is not over-engineered; it is engineered to match the actual failure cost.

The engineering behind Artemis II’s flight computers is the accumulated knowledge of sixty years of spaceflight, distilled into silicon, software, and voting logic. Every layer, from ECC memory to FDIR to triple modular redundancy, exists because some earlier mission taught NASA what happens when that layer is absent. That is a hard way to build a knowledge base, but in this domain, it is the only way to know if the design is actually right.

Was this interesting?