The Code That Cannot Fail: Software Engineering for Artemis II's Flight Computer
Source: hackernews
Most of the conversation around NASA’s Artemis II fault-tolerant computer gravitates toward the hardware: triple modular redundancy, radiation-hardened silicon, voting circuits. That framing is reasonable, because the hardware constraints are visually dramatic and the engineering is genuinely impressive. But it undersells the software half of the equation. The language choices, operating system design, middleware architecture, and verification methodology for Artemis II are not an afterthought layered on top of fault-tolerant hardware; they are themselves a fault-tolerance mechanism, aimed at a class of failures that redundant processors cannot detect or recover from.
The Language Is a Design Decision
Orion’s flight software is written in Ada 2005, mandated for Class A human spaceflight software by NPR 7150.2D, NASA’s Software Engineering Procedural Requirements. The mandate exists because Ada eliminates categories of runtime error at the language level rather than relying on discipline or testing to catch them.
Strong static typing in Ada means that a value typed as a guidance angle cannot be accidentally assigned to a propellant mass variable, even if both are floating-point numbers at the hardware level. Distinct numeric types can carry range constraints enforced by the compiler and runtime: a variable declared as Angle_Degrees range 0.0 .. 360.0 will raise a constraint exception if anything tries to write 400.0 into it. Array bounds checking is on by default and cannot be silently disabled. These are the kinds of silent truncations and type confusions that contributed to the Ariane 5 Flight 501 failure in 1996, where a 64-bit floating-point value was converted to a 16-bit signed integer, overflowed, and caused the inertial reference system to crash 37 seconds into flight.
Ada’s concurrency model is equally relevant to fault tolerance. Tasks communicate through protected objects and rendezvous, which the language defines at the semantic level rather than delegating to a threading library. This makes data races something the compiler can reason about, not something the programmer is trusted to avoid. For a system with hard real-time scheduling requirements, where multiple tasks share sensor data and actuator outputs, that is not a minor ergonomic benefit.
The comparison with the Space Shuttle’s HAL/S language is instructive. HAL/S was a NASA-specific language with its own toolchain, designed in the early 1970s for the AP-101 avionics computers. It had many of the same safety properties Ada later formalized. What it lacked was a broad support base; when the Shuttle program ended, HAL/S expertise largely went with it. Ada persists because it has ISO standardization, commercial compiler support from vendors like AdaCore, and continued use across defense and avionics programs worldwide. The language choice carries a maintenance dimension that extends decades past initial deployment.
Partitioning as a Fault Isolation Mechanism
The Orion avionics run VxWorks, Wind River’s hard real-time operating system, configured with ARINC 653 partitioning. ARINC 653 is an avionics application software standard originally developed for civil aviation, and its defining feature is spatial and temporal partitioning: each application partition gets a fixed, exclusive time window in each scheduling cycle, and hardware memory protection enforced by the MMU prevents any partition from reading or writing another partition’s address space.
The consequence is that a software fault in a lower-criticality partition, say a telemetry formatting application or a crew display renderer, cannot corrupt the memory of the GN&C partition running above it. The GN&C software does not need to defend itself against its own operating environment because the OS and hardware together enforce the boundaries. This matters because the failure mode it prevents is subtle: a pointer error or stack overflow in one software component silently corrupting data structures in another is exactly the kind of fault that hardware voting will not catch. Three redundant processors running three copies of corrupted software will agree on a wrong answer.
The Middleware That Makes Components Independent
NASA’s Core Flight System (cFS), developed at Goddard Space Flight Center and released as open source, serves as the software bus and service layer across Orion’s flight software applications. cFS provides a publish-subscribe messaging model: applications post messages to named software bus pipes and subscribe to the messages they need, without direct coupling between producers and consumers.
Table services within cFS allow configuration data, including fault management response rules, limit thresholds, and reconfiguration policies, to be stored separately from compiled code. These tables can be uplinked from the ground and validated in-flight without a software patch cycle. When mission experience reveals that a sensor’s noise floor requires a wider tolerance band, or that a fault response is triggering on a known benign anomaly, operators can update the response table without touching the code that processes it. This is the configuration management equivalent of reducing blast radius: the thing that changes is isolated from the thing that must be trusted.
Individual cFS applications can be developed, tested, and formally verified in isolation. That decomposition directly supports the verification workload, because you can demonstrate correctness of a GN&C component against its interface specification without requiring a complete vehicle software integration.
How You Know the Code Is Correct
NASA Class A software verification requires 100% Modified Condition/Decision Coverage (MC/DC), the most demanding structural coverage criterion in widespread use. MC/DC requires not merely that every branch is executed, but that each individual boolean condition within every decision is independently shown to affect the overall outcome. For a decision like if (temperature_exceeded AND pressure_nominal), you need test cases that demonstrate that flipping temperature_exceeded alone changes the result, and that flipping pressure_nominal alone changes the result. This rules out coincidental coverage, where a test happens to execute a branch without actually exercising the condition that controls it.
MC/DC at 100% across a complete flight software baseline is expensive to achieve, which is the point. The cost forces engineers to write software that can be tested at this level: simple control flow, limited nesting, clear boolean conditions. Complex conditional logic that is difficult to decompose for MC/DC coverage is usually a sign that the logic itself needs to be simplified.
Fault injection testing runs alongside structural coverage analysis. Dedicated campaigns deliberately introduce bit flips into memory, kill power to individual redundant strings, inject false sensor readings, and corrupt inter-partition message payloads to verify that the fault management software responds correctly. SEU injection simulates radiation events by flipping specific bits through software while the system is running, confirming that EDAC detection and memory scrubbing work as specified.
Independent Verification and Validation (IV&V) is performed by a facility that has no organizational connection to the development team. The IV&V program repeats analysis and testing with its own methods, and the development team is obligated to resolve every discrepancy IV&V raises. This is the institutional version of the same principle behind the Space Shuttle’s Backup Flight System: an independent team looking at the same problem catches assumptions that the original team has baked in without noticing.
The BFS Trade-Off
The Shuttle’s BFS, developed by Rockwell under different management than the PASS prime, was software diversity in the most literal sense: a completely independent implementation written to the same functional specification. If a common-mode bug in all four PASS computers caused a simultaneous failure, the BFS would take over running code that had never shared a design assumption with PASS. The Shuttle’s 1981 launch scrub, caused by a timing synchronization bug in PASS that affected all four computers simultaneously, demonstrated exactly why that mattered.
Orion does not have an equivalent BFS. The design rationale holds that the BFS itself contained bugs, that the maintenance cost of two independent software baselines was prohibitive, and that rigorous IV&V and fault injection testing of a single baseline provides equivalent protection. That conclusion is defensible but not universally accepted. The category of bugs that software diversity catches is precisely the category that a single-baseline IV&V process is least likely to find, because both the developers and the independent verifiers share the same specification and the same mental model of what the software is supposed to do. A bug in the specification, or a systematic misreading of the specification, propagates through both.
NASA’s bet is that the combination of Ada’s type safety, ARINC 653 partitioning, formal analysis, and MC/DC coverage closes enough of that gap to make a second independent baseline unnecessary. Whether that bet holds at the tail end of the reliability distribution, where the failure modes are the ones nobody modeled, is a question that only flight experience can answer.
What Stays Constant
The RAD750 running at 200 MHz is slower than embedded processors in consumer devices by orders of magnitude, and it remains the right choice for this application. The same calculus applies to the language, OS, and middleware: Ada, VxWorks, and cFS are not cutting-edge choices by any metric other than the one that matters for this problem, which is demonstrated correctness and understood failure behavior under adversarial conditions.
SpaceX’s approach on Falcon 9 and Dragon, using commodity x86 processors running Linux in triple redundancy, is a coherent alternative engineering philosophy. The argument that rapid iteration and software robustness can substitute for radiation-hardened silicon makes more sense in low Earth orbit, with ground support available in minutes, than it does on a cislunar trajectory where the vehicle must handle faults autonomously for days. The environment changes what the right answer is.
For Artemis II, the software’s job is to ensure that hardware redundancy is meaningful rather than nominal. Three processors voting on a wrong answer, computed in a type-unsafe language by unpartitioned software without coverage-verified logic, is not fault tolerance. It is fault amplification. The engineering work visible in the Artemis II software stack is the work of making sure the vote is worth counting.