Liveness, Refinement, and the Real Cost of Proving a Distributed System Correct

Formal verification of distributed systems has a long history of promising more than it delivers. You get a beautifully proven spec, or a model-checked protocol that only terminates on small state spaces, or a verified algorithm that lives in a paper and never runs. IronFleet, a 2015 SOSP paper from Microsoft Research, is the counterexample to that pattern. It proved a real Multi-Paxos implementation correct, end-to-end, including liveness, and the resulting system runs at roughly half the throughput of Zookeeper. That last part, the performance story, is almost as important as the proof.

You can watch the SOSP presentation on YouTube or read the full paper. Both are worth your time. But neither spends much time on what makes liveness verification categorically harder than safety, or why the two-level refinement approach was the key insight that made the whole thing tractable. That’s what I want to dig into here.

Safety Is the Easy Part

Most verified distributed systems prove safety properties: nothing bad ever happens. For a replicated state machine, that means linearizability. Every operation appears to take effect atomically at some point between its invocation and completion, and all clients agree on the order. This is provable with inductive invariants. You write down everything that’s true at every reachable system state, show it holds at initialization, show every possible step preserves it, and then show the invariants imply linearizability. Hard, but tractable.

Liveness is different. Liveness says something good eventually happens. In IronRSL, the Paxos-based replicated state machine that IronFleet verifies, the liveness claim is: if a quorum of replicas is correct and the network is eventually synchronous, every submitted request eventually receives a response. Proving that requires reasoning about what can persist indefinitely and what cannot.

To prove a request eventually completes, you have to prove leader election eventually terminates. To prove that, you have to prove that competing ballot numbers cannot cycle forever. To prove that, you need fairness assumptions: the scheduler actually runs every enabled node, and the network eventually stops dropping messages long enough for progress to happen. You cannot encode these assumptions as simple state predicates. You need temporal reasoning over infinite execution traces.

Prior to IronFleet, mechanically verified liveness proofs for practical distributed protocols did not exist. Verdi, the Coq-based distributed systems verifier from UW that appeared around the same time, verified Raft’s log replication for safety but explicitly did not tackle liveness. The IronFleet paper’s claim that liveness required the most effort in the entire project is credible. It took roughly three person-years of work across the team.

The methodology IronFleet uses is worth understanding carefully, because it’s not obvious and it turns out to be broadly applicable. The key insight is that distributed systems verification has two conceptually distinct hard problems, and conflating them into one layer makes both harder.

Problem one is the protocol-level reasoning: what happens when messages get lost, duplicated, or reordered? How do quorums interact? What invariants does Multi-Paxos need to maintain to prevent two leaders from committing different values to the same log slot?

Problem two is the implementation-level reasoning: does the actual code correctly implement the protocol? Are the data structures right? Is the serialization correct? Does the retry logic match the spec?

IronFleet introduces a middle layer between the high-level spec and the executable code. The structure has three levels:

A high-level spec that describes the system as a single atomic state machine, with no network, no concurrency, and no implementation details. This is where linearizability and liveness are stated.
A distributed protocol spec that models the actual distributed execution: separate per-node state, message passing, asynchrony, failures. Each node processes one message at a time. This layer is where the hard Paxos invariants live.
The executable implementation, written in Dafny, which compiles to C# and runs on .NET.

You prove refinement from level 1 to level 2 (this distributed protocol implements the atomic spec), then separately from level 2 to level 3 (this code correctly implements the distributed protocol). Each refinement is an inductive argument in Dafny.

Without the intermediate layer, you would be trying to connect sequential linearizability directly to bytes on the wire. The protocol-level invariants for Multi-Paxos are already complex enough. Adding serialization logic and data structure details to the same proof context makes finding and maintaining those invariants much harder. The two-level approach keeps the concerns separate.

Why Dafny and Not TLA+

The obvious question for anyone who has worked with distributed systems is: why not TLA+? Leslie Lamport’s TLA+ is used in production at AWS, Azure, and MongoDB to find protocol bugs. It has a proof system, TLAPS. It is the standard industrial tool for reasoning about distributed protocols.

The answer is that TLA+‘s model checker, TLC, works by exhaustive state-space exploration on bounded finite instances. It finds bugs in protocols by checking all behaviors up to some bound. It cannot produce machine-checked proofs over infinite execution traces. TLAPS can construct proofs, but it is limited and requires substantial manual effort that has not scaled to systems of IronFleet’s complexity.

Dafny, built by Rustan Leino at Microsoft Research, takes a different approach. It is a verification-aware programming language that uses the Z3 SMT solver as its backend. You write specifications (preconditions, postconditions, loop invariants, inductive invariants) alongside your code. Z3 checks them automatically. The result is executable code with machine-verified correctness properties. The trusted base is the Dafny compiler, the .NET runtime, the OS, and Z3 itself. Notably, the implementation code is not in the trusted base, which is the entire point.

The IronFleet codebase is approximately 70,000 lines of Dafny, of which roughly half are proof annotations rather than executable code. That ratio gives you a sense of the overhead.

The Invariant Discovery Problem

The technical problem that consumes most of the effort in projects like IronFleet is not writing the code. It is finding the right inductive invariants for the protocol layer.

For Multi-Paxos, the invariant needs to capture: which ballot numbers have been promised by which acceptors; which values have been accepted in which slots; how accepted values across replicas relate to each other; and the quorum intersection properties that prevent split-brain. The invariant for IronRSL spans hundreds of lines of Dafny. Getting it wrong means either the proof fails (you cannot establish the invariant holds) or you discover mid-proof that a step you thought was safe actually breaks it.

This difficulty directly motivated subsequent work. I4 (2019) applies IC3/PDR model checking to automatically synthesize inductive invariants for distributed protocols, explicitly citing the manual invariant discovery burden in systems like IronFleet. Ivy from McMillan and colleagues at Microsoft Research addresses the same problem with a different methodology, using a modular verification approach that makes invariants more composable.

Performance: Close Enough to Matter

IronRSL, the verified Paxos library, achieves roughly 10,000 to 14,000 requests per second at median latencies in the low single-digit milliseconds for small payloads with three replicas. Zookeeper, unverified and written in Java, achieves higher throughput in comparable configurations. The gap is roughly two to three times lower throughput for IronRSL.

The paper attributes the gap to the C#/.NET execution environment, the single-threaded model IronFleet uses per node to keep verification tractable, and some protocol-level overhead from the verified implementation. The single-threaded constraint is significant: real-world distributed systems implementations often use multiple threads and careful locking to maximize throughput. Proving correctness of concurrent code is substantially harder, and IronFleet deliberately avoided that problem.

Two to three times slower than Zookeeper is not negligible. But it is within a constant factor, not orders of magnitude. For many use cases, the safety and correctness guarantees would be worth that overhead. This was the practical claim the paper needed to make, and it made it.

What Came After

IronFleet’s two-level refinement methodology has been genuinely influential. Grove (SOSP 2023) verifies a sharded key-value store using Perennial, an Iris-based framework on top of Coq. It handles crash safety and uses a similar layered refinement approach, though with separation logic rather than Dafny’s invariant style.

Armada (PLDI 2020) from Microsoft Research tackles the problem IronFleet explicitly avoided: verifying concurrent systems where multiple threads share state. IronFleet’s single-threaded model was a deliberate choice to keep proofs tractable. Armada provides a framework for lifting proofs from single-threaded models to concurrent implementations.

Dafny itself has grown significantly since IronFleet. AWS uses Dafny for verified components in S3 and other services. The language has matured, gained a larger ecosystem, and is now maintained as an open-source project.

The broader distributed systems verification space has expanded considerably. The combination of better tools, more experience with invariant methodology, and more researchers who have internalized the IronFleet approach means that verified distributed systems are no longer purely a research curiosity. They remain expensive to build, but the existence proof is real.

The Real Contribution

What IronFleet proved is not just that IronRSL and IronKV are correct. It proved that the methodology works: that you can take a practical distributed protocol, specify it at multiple levels of abstraction, prove refinement between those levels using machine-checked proofs, compile the result into a running system, and end up with something that performs within striking distance of unverified alternatives.

The liveness result is the hardest part technically. The performance result is the hardest part to argue for practically. IronFleet delivered both. That combination is what made it a landmark result in 2015 and why it remains a reference point for everyone working on distributed systems correctness today.