SMT Automation vs. Interactive Proof: The Tool Divide Behind IronFleet and Verdi

The comparison between IronFleet and Verdi is among the most informative in distributed systems verification. Both projects appeared at major venues within months of each other in 2015, solved essentially the same problem, and made almost opposite choices about how to do it. One used Dafny and Z3; the other used Coq. The difference is not merely tooling; it shapes everything about what each system can prove and how much effort it takes.

IronFleet (Hawblitzel et al., SOSP 2015) verified IronRSL, a Multi-Paxos replicated state machine, and IronKV, a sharded key-value store built on top of it. Verdi (Wilcox et al., PLDI 2015) verified Raft’s log replication safety. You can watch the IronFleet SOSP talk on YouTube to see how the team frames the problem. A decade on from that talk, the tool decision remains the most technically consequential choice the two teams made, and understanding it clarifies a lot about where distributed system verification is and is not today.

The SMT Automation Bet

Dafny compiles specifications and invariants into verification conditions, hands them to the Z3 SMT solver, and Z3 either discharges them or reports a counterexample. When it works, this is fast. You write your invariant, annotate the method, and Z3 verifies it in seconds. The Z3 backend handles linear arithmetic, array theory, quantifier instantiation, and algebraic datatypes natively.

For the kinds of invariants distributed protocol verification requires, this automation pays off. The core Multi-Paxos safety invariants state things like: at most one value is decided per ballot; an acceptor only accepts a value in slot s if that value matches the highest-balloted promise it has made; if a quorum of acceptors has accepted value v in slot s with ballot b, then any higher-ballot leader learning slot s must also propose v. These invariants involve universal quantification over acceptors, ballots, and slots, but the quantifiers are relatively shallow and the arithmetic is linear. Z3 handles them well with proper trigger setup.

In Coq, the same invariants require constructing proof terms by hand through interactive tactics. There is no oracle that discharges quantified goals automatically; you case-split, apply induction, instantiate quantifiers explicitly, and close goals one at a time. Getting there can take hours of tactic scripting for a lemma that Dafny handles automatically. The IronFleet team estimated that automation reduced their total proof effort significantly relative to what a Coq-based approach would have required for the same systems.

What Coq Gets in Exchange

Verdi chose Coq for a reason. Coq’s type theory is far more powerful than what Z3 handles, and the key feature Verdi exploits is Verified System Transformers (VSTs), a compositional abstraction for network semantics. A VST takes a protocol verified against a simple network model (synchronous, perfect delivery) and transforms it into a protocol verified against a more complex model (asynchronous, with message loss, duplication, and reordering). The transformation is itself machine-checked.

This compositionality is difficult to achieve in Dafny. When you add a new feature to your network model, you need to reprove properties that should follow automatically from the composition. In Coq, the VST abstraction lets you prove the composition once and reuse it. For building a toolkit of verified network layers, this is the right design.

The trade-off is that Coq’s expressiveness requires manual proof construction. Verdi’s Raft verification is widely respected but explicitly does not prove liveness. The IronFleet team proved both safety and liveness for Multi-Paxos. That liveness result, which the Verdi team explicitly declined to attempt, came from a temporal logic library IronFleet built in Dafny, and it was achievable partly because Z3 automation kept the total proof burden manageable.

Quantifier Triggers in Practice

The practical friction point in Dafny verification for distributed protocols is quantifier trigger management. When you write an invariant like:

predicate QuorumIntersection(nodes: set<Node>, quorums: set<set<Node>>)
{
    forall q1, q2 {:trigger q1 in quorums, q2 in quorums} ::
        q1 in quorums && q2 in quorums ==>
            exists n :: n in q1 && n in q2
}

Z3 needs to know when to instantiate q1 and q2. The {:trigger} annotation tells it which terms to watch for. The wrong trigger leaves the quantifier unused and invariant-dependent proofs silently fail. An overly broad trigger causes Z3 to instantiate the quantifier on too many terms, driving verification time into minutes or causing timeouts altogether.

The IronFleet codebase contains hundreds of trigger annotations. Getting them right requires both intuition about how Z3’s pattern matching works and knowledge of where the invariant will be needed in downstream proofs. This is the main form of proof brittleness in Dafny systems: a change in how an invariant is used can make previously valid triggers insufficient, breaking proofs that appeared stable.

Coq does not have this problem in the same form, because tactic-based proofs apply quantifier instantiations manually. You always know exactly which instantiation is in use, because you wrote it. The trade-off is that manual instantiation is tedious and time-consuming for invariants with many quantifiers.

Verus as the Synthesis Attempt

Verus, developed at CMU and VMware Research starting around 2022, can be read as the direct descendant of IronFleet’s automation approach applied to Rust. Verus generates SMT verification conditions from Rust code annotated with specifications, using Z3 as the backend. The ownership and borrowing model that Rust enforces statically eliminates whole classes of aliasing questions that Dafny has to prove via ghost invariants. This narrows the gap between the language’s type-level guarantees and the SMT verification conditions, making proofs more concise and more robust to code changes.

Anvil (OSDI 2024) applied Verus to verify Kubernetes controllers, showing the approach scales to realistic infrastructure code. The verification methodology, layered refinement, separation of protocol reasoning from implementation reasoning, is directly inherited from IronFleet. The trigger management problems that plagued Dafny are partially mitigated in Verus by Rust’s ownership semantics reducing the aliasing state space that quantifiers need to range over.

The Coq lineage continues through Grove (SOSP 2023), which verified a sharded key-value store using Coq and the Iris concurrent separation logic framework. Grove handles crash safety and concurrent execution within nodes, using separation logic to reason about shared mutable state. Iris’s expressiveness directly addresses the compositionality advantages that originally made Verdi choose Coq, but with a more modern foundation than Coq’s older separation logic libraries.

The Persistent Cost

Neither approach has made distributed system verification cheap. Both IronFleet and Verdi run at roughly three to five lines of proof per line of implementation code. Verification times are long. Proofs break across tool version upgrades, a problem severe enough that the Verus team cited it explicitly as a motivation for building their own system rather than continuing with Dafny. The specification must itself be correct, and there is no mechanical way to verify that a spec captures your intent rather than some adjacent property you did not mean to state.

The major distributed systems in production today, including etcd, TiKV, and CockroachDB, remain unverified in any formal sense. AWS uses Dafny for Cedar, their policy evaluation language, showing one viable path: rigorous verification applied to a bounded, high-value component rather than a full distributed protocol stack.

What the IronFleet and Verdi comparison reveals is that the tool choice is not separable from the verification strategy. SMT automation favors protocols with well-structured quantifiers and linear arithmetic; interactive proof favors compositional abstractions and expressive type-based specifications. The right tool depends on what you are trying to prove, not just on which proof assistant your team knows.

IronFleet made the right bet for Multi-Paxos in 2015. The decade of follow-on work in both lineages suggests the field has not yet found a tool that resolves the trade-off cleanly, but Verus is the closest attempt so far, and the results from Anvil and projects like VeriBetrFS suggest it may be crossing a threshold that IronFleet’s original Dafny implementation could not reach.