· 6 min read ·

Consensus Under Radiation: What Artemis II's Fault-Tolerant Computer Had to Get Right

Source: hackernews

The crewed Artemis II mission will carry four astronauts around the Moon, the first time humans have traveled that far from Earth since Apollo 17 in 1972. The ACM piece on how NASA built the fault-tolerant computer for Artemis II is worth reading in full, but the engineering challenge it describes goes deeper than any single article can cover. The core problem is one that distributed systems engineers encounter in abstract form all the time, but in spacecraft it is entirely concrete: how do you make decisions when your hardware might be lying to you, and you cannot reboot your way out of it?

The Environment Makes Normal Assumptions Wrong

In low Earth orbit, the Van Allen radiation belts and Earth’s magnetosphere provide partial shielding. On a cislunar trajectory, that protection disappears. Cosmic rays, solar protons, and heavy ions pass through spacecraft shielding and interact directly with silicon. A high-energy particle strike on a memory cell can flip a stored bit without any software error, no stack overflow, no null pointer, just a changed value. This is called a Single Event Upset, or SEU. A more energetic event can cause Single Event Latchup, where a parasitic current path forms and the device draws excessive current until it is power-cycled or permanently damaged.

Radiation-hardened processors address this at the fabrication level. BAE Systems’ RAD750, a radiation-hardened derivative of the PowerPC 750, runs at around 200 MHz and can survive radiation doses exceeding 200,000 rads. It is the same processor family used in the Mars Curiosity rover, the Mars Reconnaissance Orbiter, and the Kepler space telescope. The trade-off is significant: commercial silicon at the same technology node runs an order of magnitude faster or more. You are paying for reliability with performance, and the penalty is steep. For flight software that must execute in hard real-time with deterministic latency, that slower clock budget shapes everything about how the software is architected.

From Apollo’s Single Computer to the Shuttle’s Voting Architecture

Understanding what NASA built for Artemis II requires understanding what came before. The Apollo Guidance Computer, designed at MIT’s Instrumentation Laboratory, was a single 2.048 MHz machine with 4,096 words of erasable core memory and 36,864 words of fixed rope memory. It was remarkably reliable for its era, but there was no redundant voting mechanism; if the computer failed, you relied on manual procedures and ground-based backup. The AGC’s software was written entirely in assembly, managed by a priority-scheduled real-time executive that could shed lower-priority tasks when resources were exhausted. The famous 1202 alarm during Apollo 11’s lunar descent was that executive doing exactly its job, dropping navigation updates to keep guidance running.

The Space Shuttle raised the bar substantially. The Shuttle used five IBM AP-101 computers, each clocked around 1.2 million instructions per second. Four of them ran identical flight software in lockstep and voted on outputs before commanding any actuator. The fifth ran the Backup Flight System, written in a completely independent software implementation by a separate team, precisely so that a common software bug could not bring down all five simultaneously. The primary software was written in HAL/S, a structured language specifically designed for avionics, with explicit support for real-time tasking and strict avoidance of dynamic memory allocation. The redundancy management software that coordinated the vote between the four primaries was itself one of the most complex pieces of the system.

Voting Is a Distributed Consensus Problem

The conceptual framework here is familiar from distributed systems: you have multiple nodes, each observing the world and producing outputs, and you need to agree on a single result despite the possibility that some nodes are wrong. In database systems, you call this consensus. In spacecraft avionics, the traditional solution is Triple Modular Redundancy, where three identical computing strings run in parallel and a voter takes the majority result. The voter is simpler than the computers it adjudicates, which means it is less likely to fail, but it remains a single point of failure. Distributed voting, where each string votes independently, addresses this at the cost of additional complexity in synchronization.

Fault-tolerance theorists distinguish between fail-stop faults, where a component simply stops producing outputs, and Byzantine faults, where a component produces arbitrary or inconsistent outputs. A three-node majority vote can tolerate one fail-stop fault. To tolerate one Byzantine fault, you need four nodes, because you need to be able to distinguish the faulty node from the correct ones when it is actively producing plausible-looking wrong answers. In spacecraft, cosmic-ray-induced bit flips can produce exactly this kind of Byzantine behavior: a processor that appears to be running correctly but has a corrupted register or memory location.

For Artemis II, NASA needed to design for the lunar environment where radiation levels are higher and where any recovery procedure takes seconds of communication delay, not milliseconds. The architecture described in the ACM article reflects lessons accumulated across decades of flight experience, including failures that ground teams managed to recover from and a few that they did not.

Software That Assumes Its Own Hardware Is Wrong

One of the more interesting aspects of modern spacecraft fault-tolerant software is that it cannot simply trust its own memory. Error-correcting codes, specifically SECDED (Single Error Correct, Double Error Detect) memory controllers, are standard in radiation-hardened designs. These detect and correct single-bit flips in RAM automatically at the hardware level. But the processor’s own registers and caches are not always protected the same way, and the solution is periodic scrubbing: software that reads and rewrites memory to clear accumulated errors before they compound.

The flight software for Orion was built on NASA’s Core Flight System, known as cFS, an open-source reusable framework developed at NASA Goddard. The cFS architecture separates platform services from mission applications through a publish-subscribe message bus. Applications communicate by posting messages to named software buses rather than calling each other directly. This decoupling makes it possible to run independent application instances, compare their outputs, and fail one over without the rest of the system needing to know about the internal structure of the failed application. It is the same architectural instinct behind microservices, applied to embedded real-time avionics.

Flight-critical code in NASA programs typically follows the NASA/JPL Power of Ten rules, a set of ten coding constraints developed by Gerard Holzmann. No dynamic memory allocation after initialization. No recursion. All loops must have fixed upper bounds. No function pointer indirection. These constraints exist because they make the software statically analyzable: you can reason about memory usage, stack depth, and control flow without running the program, which matters when your test environment cannot fully replicate deep space radiation.

The Testing Problem No Lab Can Fully Solve

You cannot verify a fault-tolerant architecture without injecting faults. NASA and contractors use fault injection campaigns where testers deliberately introduce hardware failures, corrupt memory, kill computing strings, and observe whether the voting and reconfiguration logic responds correctly. This is expensive and time-consuming but unavoidable: the entire value of redundancy is behavior under failure, and behavior under failure is exactly what routine testing skips.

Radiation testing uses particle accelerators to bombard components with heavy ions and protons, characterizing the SEU cross-section of each chip, which quantifies the probability of a bit flip per unit of particle fluence. This data feeds mission planning models that estimate how many SEUs the system is likely to experience over the course of the mission, which in turn determines how often scrubbing cycles need to run and what error rates the fault-tolerance logic must handle.

The gap between ground testing and flight is real and permanent. You can characterize components. You can simulate environments. You can inject known fault patterns. What you cannot do is run the actual mission in advance. The history of spacecraft anomalies is full of events that no one anticipated during testing because the specific combination of conditions had never been modeled. The Artemis II architecture, like every spacecraft before it, is a bet that the designers have correctly identified the failure modes that matter and have provided adequate margins for the ones they have not thought of yet.

The ACM article represents a rare public technical accounting of how this engineering actually works. Most of it happens quietly inside contractor facilities and NASA centers, producing reports that are never widely read. When it surfaces, it is a reminder that the hardest part of putting people in deep space is not thrust or trajectory; it is building computers that can still tell the truth after weeks of cosmic punishment.

Was this interesting?