The Engineering Discipline Behind a Computer That Cannot Fail

There is a category of software and hardware engineering where the failure modes are not bugs to be fixed in a patch but events that end human lives. The flight computer aboard Orion for the Artemis II mission sits squarely in that category. The CACM writeup on how NASA built it is worth reading alongside the background that makes the design decisions legible.

Artemis II will carry four astronauts on a free-return trajectory around the Moon, roughly a ten-day mission with no resupply and no quick abort back to the ground past a certain point. The computing system that manages propulsion, life support interfaces, guidance, navigation, and control has to keep working through hardware faults, software errors, and the one threat unique to deep space: ionizing radiation.

Triple Modular Redundancy Is the Foundation, Not the Whole Story

The core strategy in Orion’s flight computer architecture is triple modular redundancy (TMR). Three independent Flight Management Computers (FMCs) run the same software, compare outputs, and vote. Two agree, the third is suspect and gets flagged. This is not a novel idea, the Space Shuttle flew four primary general-purpose computers plus a fifth backup loaded with independently written software, a design frozen in the late 1970s. What has changed is the integration density, the radiation environment targets, and the formal rigor applied to the software.

TMR handles transient faults well. A single event upset, where a cosmic ray flips a bit in memory or a register, typically affects one of the three computers. The other two vote it down, execution continues, and ground operators get a flag. What TMR does not inherently solve is a correlated failure mode: a software bug that manifests identically in all three instances because all three are running the same code path. This is the deeper problem, and it is why software verification in this class of system consumes far more engineering time than the hardware design.

The Radiation Problem Is Different from the Reliability Problem

Terrestrial systems deal with component wear and environmental stress. Spacecraft in deep space deal with those plus a continuous flux of high-energy particles, primarily galactic cosmic rays and solar energetic particles. These produce two classes of problems.

Single Event Upsets (SEUs) are transient bit flips. Memory cells change state, a register holds a bad value for a cycle, and then execution continues. ECC memory catches many of these before they propagate; periodic memory scrubbing catches others by reading and rewriting memory to correct accumulated errors before they spread.

Single Event Latchup (SEL) is worse. A heavy ion can cause a parasitic current path in a CMOS device that latches on, draws increasing current, and destroys the component if not power-cycled within milliseconds. Radiation-hardened chips are fabricated in specialized processes, silicon-on-insulator or fully-depleted SOI, that eliminate the parasitic structures. The tradeoff is performance: radiation-hardened parts often trail commercial equivalents by a process node or two. The BAE Systems RAD750, derived from the PowerPC 750 and used across many NASA deep-space missions including Mars Science Laboratory and Artemis, runs at clock speeds that would have been modest in a consumer laptop fifteen years ago. Speed matters less than determinism and survival.

Orion’s flight computers use radiation-hardened variants of established processor architectures. The choice of architecture matters less than the radiation characterization: every part that flies gets tested under a particle accelerator to establish its linear energy transfer (LET) threshold and cross-section curve, data that feeds directly into the reliability analysis and informs how much scrubbing and redundancy the design needs.

The Software Layer Is Where the Real Work Lives

The hardware redundancy buys time and resilience against physics. The software has to be correct in a much more demanding sense than commercial software.

NASA Goddard’s Core Flight System (cFS) provides a reusable software framework that many NASA missions now share, including components of the Orion software. cFS separates the platform-specific executive layer from portable application components, which means flight-proven code from one mission can be reused and reverified rather than rewritten from scratch. This matters because verification cost scales with lines of code.

The applicable standard for flight software at this safety level is effectively DO-178C at DAL A (Design Assurance Level A), where a single failure can cause a catastrophic outcome. DAL A requires, among other things, modified condition/decision coverage (MC/DC), where every independent condition in every decision in the code must be shown to independently affect the outcome. This is expensive to achieve and to demonstrate. Every branch, every Boolean expression, needs test cases that prove each condition matters independently.

Ada has been the language of choice for safety-critical flight software in U.S. programs for decades, and for good reason. The language was designed with contract-based programming, strong typing, and predictable exception semantics in mind. SPARK Ada, a formally analyzable subset, allows proof-based verification: mathematical proofs that specific properties hold for all possible inputs, not just the ones you thought to test. Partial use of SPARK for the highest-criticality components lets programs prove the absence of runtime errors, something testing alone cannot achieve.

Fault Detection, Isolation, and Recovery

Fault tolerance is not just about surviving a fault. It requires detecting that a fault occurred, isolating the failed component so it cannot corrupt the healthy ones, and recovering either by switching to a backup or by reconfiguring around the damage.

Orion implements Fault Detection, Isolation, and Recovery (FDIR) at multiple levels. At the hardware level, built-in test (BIT) circuitry continuously monitors power supplies, clock signals, and communication buses. At the software level, watchdog timers require each critical task to periodically signal that it is alive; a missed heartbeat triggers a task restart or, if that fails, a computer switch. At the system level, the redundancy management software tracks the health state of each FMC and manages the voting configuration.

What makes deep-space FDIR harder than Earth-orbit FDIR is the round-trip light time. At lunar distance, the one-way signal travel time is roughly 1.3 seconds. Artemis II will not go to lunar orbit, but the return trajectory takes the spacecraft nearly 7,400 kilometers beyond the Moon at its farthest point. Ground controllers cannot intervene in real time. Every FDIR decision the computer might need to make during a critical maneuver has to be pre-planned, validated, and loaded. The flight computer is genuinely autonomous during those windows.

What the Space Shuttle Got Right and What Changed

The Shuttle’s five-computer architecture had an elegant property: the backup flight system was developed by a completely separate team using a different language (HAL/S), specifically to avoid correlated software bugs. If the primary computers all agreed on something wrong, the backup had an independent shot at getting it right.

Orion does not use that approach at the software level in the same way. Instead, the formal verification investment, the model-based design tools, the automated test generation, and the structured requirements traceability are meant to catch errors before flight rather than provide a diverse fallback. Both approaches reflect rational engineering trade-offs given the cost and schedule environments of their respective programs.

The toolchain for Artemis-era flight software is also dramatically more capable. Model-based design tools like MATLAB/Simulink, paired with qualified code generators, can produce flight software directly from verified models and then generate test cases from those same models. The equivalence between the model and the generated code becomes the verification artifact rather than a separate testing campaign against manually written code.

The Part Nobody Talks About: Integration Testing

All of the above, the TMR architecture, the radiation-hardened silicon, the SPARK proofs, the MC/DC coverage, has to be integrated and tested as a system. For Orion, this happens on the Integrated Avionics Test Facility at Johnson Space Center, a hardware-in-the-loop simulation environment that connects actual flight computers to simulated sensors, actuators, and vehicle dynamics. You cannot find every emergent failure mode in a model; you need the real hardware responding to realistic timing.

The anomalies that surface during integrated testing are often the most instructive. A race condition that only appears when a sensor dropout coincides with a mode transition at millisecond precision. A voltage droop on a power supply that shifts a computation just enough to cause a voter disagreement. These are the kinds of failures that TMR and ECC do not prevent; they only survive them if the FDIR logic handles the resulting state correctly.

Building a computer that cannot fail, for a ten-day crewed mission beyond the Moon, turns out to be mostly the work of building a system that fails gracefully in every way you can imagine, and then spending years trying to imagine new ways it could fail.