· 5 min read ·

The Computer That Cannot Afford to Fail: Engineering Fault Tolerance for Artemis II

Source: hackernews

There is a class of software engineering problem where the usual fallback strategies dissolve completely. You cannot roll back a deploy. You cannot page someone on-call. You cannot restart the process and hope the state recovers. The fault-tolerant computer aboard NASA’s Artemis II Orion spacecraft, detailed in a recent CACM piece, sits squarely in this category. Astronauts will be 240,000 miles from the nearest repair depot, passing through the Van Allen belts and into deep space, relying on a system that has to reason about its own failures in real time.

The engineering choices NASA made illuminate a set of principles that are worth understanding, not just because space computing is impressive, but because the constraints are so extreme that every trade-off gets fully exposed.

Fault Avoidance Is Not Enough

The first instinct in safety-critical systems is to build hardware that simply does not fail. Use rad-hard components. Screen every part. Qualify every connector. This is fault avoidance, and it is necessary but insufficient. Deep space is an adversarial radiation environment in a way that most engineers never encounter in terrestrial systems.

Beyond Earth’s magnetosphere, the spacecraft encounters galactic cosmic rays, solar energetic particles, and trapped radiation during Van Allen belt transit. Any of these can cause a Single Event Upset (SEU), where a high-energy particle flips a bit in memory or a register. Radiation-hardened processors, like the BAE Systems RAD5545 used in modern deep space avionics, are designed to resist this, but resistance is not immunity. An SEU that hits a register mid-computation can produce a wrong answer from an otherwise healthy processor. The system continues running normally, producing incorrect outputs. This is the class of failure called a Byzantine fault, and it is the hardest kind to handle.

Silent crashes are simple to detect: the component stops responding, and you cut it out. Byzantine failures require that you have something to compare against.

Voting and Lockstep

The standard architecture for Byzantine fault tolerance in flight computers is redundant lockstep execution with a voting mechanism. The Orion avionics architecture uses multiple computers running the same software on the same inputs in parallel. Periodically, the computers compare outputs. If one disagrees with the others, it gets flagged. With three computers, a simple majority vote can identify and isolate a single Byzantine failure, since two correct computers will agree and one faulty one will not. With four, you get better diagnostic capability: you can tolerate one failure and still have three computers agreeing, which lets the system distinguish a failing computer from a degraded-but-functional one.

The Space Shuttle set an early precedent for this approach. Its five IBM AP-101 computers, four running in lockstep with a fifth on standby running independently developed software, established the template that influenced everything after it. The independent software for the backup system was a direct countermeasure against a specific threat: a common-mode software fault that could cause all primary computers to fail in the same way simultaneously.

Artemis II inherits this philosophy but on modern hardware. Collins Aerospace, as the avionics supplier for Orion, builds on decades of fault-tolerant avionics work. The communication fabric between redundant units typically runs on MIL-STD-1553, a deterministic serial bus developed for military aviation that guarantees bounded latency, which matters because voting only works if you can synchronize comparison points reliably.

Memory Scrubbing and Transient Faults

Voting at output boundaries catches faults in computation, but SEUs in memory that have not yet affected an output need a different mechanism. Memory scrubbing is the continuous background process of reading memory, detecting errors through ECC (Error-Correcting Code), and rewriting corrected values before a second bit flip in the same word turns a correctable single-bit error into an uncorrectable multi-bit one.

ECC memory can typically correct one-bit errors and detect two-bit errors. In a low-radiation environment, the accumulation of two errors in the same word before a scrub pass is unlikely. In deep space, the flux is high enough that it becomes a real concern, which is why scrub rates are tuned to the expected radiation environment for a given mission phase. During Van Allen belt transit, the scrub rate goes up.

This is one of the places where space software engineering diverges sharply from embedded systems work elsewhere. The background scrubber is not a performance optimization or a reliability nicety. It is a primary safety mechanism that has to run at a precise rate and cannot be starved by other workloads. The scheduler design for a space RTOS reflects this: rate-monotonic scheduling or similar fixed-priority schemes give safety engineers formal tools to prove that high-priority tasks including scrubbers will always meet their deadlines.

Software Certification and the DO-178C Parallel

The software that runs Orion’s flight computers is developed under NASA’s software safety standards, including NASA-STD-8739.8, which requires rigorous traceability from requirements through code to test. The process is comparable to aviation’s DO-178C Level A, which governs software whose failure would cause catastrophic consequences.

At Level A, every requirement must be traced to test cases that achieve modified condition and decision coverage (MC/DC), a coverage criterion that requires each boolean condition in every decision to independently affect the outcome at least once. MC/DC was developed specifically for flight software because it catches classes of logic errors that statement and branch coverage miss. For a codebase controlling life support, propulsion, and guidance simultaneously, the test matrix is enormous.

Many space programs, including Orion, use Ada for flight-critical code. Ada’s language-level support for tasking, protected objects, and strong typing reduces entire categories of concurrency and memory bugs that are common in C. The Ravenscar profile for Ada restricts the language further to a deterministic, formally analyzable subset suitable for high-integrity real-time systems. When you need to formally prove that your scheduler meets timing constraints, having a language that maps cleanly to formal models is a genuine engineering advantage.

What Orion’s Architecture Reveals About Reliability

The architecture of a fault-tolerant space computer is a layered response to a threat model. Radiation hardening reduces the fault rate at the hardware level. ECC and scrubbing handle transient bit flips at the memory level. Lockstep voting handles Byzantine failures at the computation level. Redundant buses handle communication failures. Independent software development handles common-mode software faults. Each layer addresses the failure modes that the layer below cannot prevent.

This layering is the distinguishing feature of systems where failure consequences are irreversible. In a web service, you can absorb a percentage of errors and compensate with retries and circuit breakers. In Orion, the system has to maintain mission capability through any single failure and many combinations of multiple failures, not by hiding errors from users, but by detecting, isolating, and reconfiguring around them automatically before the crew or ground control needs to intervene.

For Artemis II specifically, the stakes are clear. The mission will carry four astronauts on a free-return trajectory around the Moon. The computers controlling Orion’s systems have to work not because there is a backup plan, but because the backup plan is the computer.

The engineering discipline that produces this kind of reliability is different in character from most software development, not faster or slower, but more formal, more adversarial in its assumptions about what can go wrong, and more rigorous about proving that the mitigations actually work. The CACM piece on Artemis II’s fault-tolerant computer is a window into what it looks like when that discipline gets applied to hardware that astronauts will stake their lives on.

Was this interesting?