Constructing Executions That Break Paxos

Murat Demirbas recently published a short post framed around breaking Paxos, and the Lobsters submission called it “Beat Paxos.” The framing is adversarial by design, and that is exactly the right frame. Paxos is provably safe; it is not provably live, and Multi-Paxos as a family leaves enough open that you can construct executions, write TLA+ specs, or run Jepsen against real implementations and find failures that the theory does not prohibit.

This is different from Paxos being wrong. It means the protocol has a precise boundary, and the interesting engineering is understanding where that boundary runs and what falls outside it.

What Basic Paxos Actually Guarantees

Lamport’s Paxos Made Simple (2001) is explicit about this. The safety property is unconditional: at most one value is ever chosen, and any chosen value was proposed. The liveness property is conditional on message delivery and a single active proposer. Those conditions can be violated, and the protocol does not prevent it.

This is a deliberate design choice, not an oversight. Consensus in asynchronous networks cannot be both safe and guaranteed to terminate; that is the FLP impossibility result (Fischer, Lynch, Paterson, 1985). Paxos chooses safety unconditionally and accepts conditional liveness. The question is how bad the liveness failure can be made in practice.

Dueling Proposers as a Formal Execution

The canonical liveness violation is the dueling proposers scenario, and it is worth describing precisely as a constructed execution rather than as a vague hazard.

Let there be five acceptors and two proposers, P1 and P2. P1 sends Prepare(1) to a quorum of three acceptors and collects promises. Before P1 can send Accept(1, v), P2 sends Prepare(2) to an overlapping quorum, invalidating P1’s Phase 2. P1 detects the failure, increments to Prepare(3), collects promises from a quorum, and before it can send Accept(3, v), P2 sends Prepare(4) to an overlapping quorum. The cycle continues indefinitely.

This is not a timing coincidence or a pathological hardware fault. A sufficiently adversarial scheduler can maintain it forever. The standard mitigation is a single designated leader, but Basic Paxos specifies nothing about leader election, and Multi-Paxos specifies it only informally. The mechanism that breaks liveness is baked into the permission structure of the algorithm: any node is allowed to issue a Prepare at any time, and a Prepare(n) with a high enough n will invalidate any in-flight Accept from a lower-numbered proposal.

The adversarial construction here is not exotic. It requires only two concurrent proposers and a network that delivers messages in a specific interleaved order. In a real system, the same effect arises from aggressive timeout settings, split-brain network conditions, or a leader that is slow but not dead.

The Phase 1 Sweep as an Attack Vector

Multi-Paxos introduces a more subtle liveness and correctness hazard in the leader transition. When a new leader takes over, it cannot safely append to the log from the tip. Every slot that the previous leader might have had in flight could be in an indeterminate state: some acceptors may have received and stored an Accept for that slot while others received nothing. The new leader’s Phase 1 sweep exists to resolve this.

The sweep works by running Prepare(n) for every potentially uncommitted slot. If any acceptor returns a previously accepted value for slot k, the new leader must propagate that value to completion, even if it is not the value the new leader would have chosen. This is the mechanism that maintains safety across leadership changes.

The attack vector is in the scope of the sweep. The new leader needs to know how many slots might be uncommitted. If it uses an estimate that is too low, it misses slots that were partially accepted and will create conflicting decisions for those slots when it later re-uses them. If it uses an estimate that is too high, the sweep is expensive and blocks new writes during recovery.

Paxos Made Live (Chandra, Griesemer, Redstone, 2007) is the most direct public account of this problem. The paper describes how Chubby’s implementation handles leader recovery and is unusually candid about the engineering complexity that Basic Paxos does not address. The authors found that correctly bounding the sweep required tracking a per-slot watermark across leadership terms, a piece of bookkeeping that the paper-level algorithm says nothing about.

This is not a bug in Paxos. It is a gap in the specification of Multi-Paxos that every implementation fills differently. The gap means that two systems both described as “Paxos-based” can behave differently during leader transitions, even if both are safe in isolation.

TLA+ and the Decisions That Specification Forces

Formal model checking via TLA+ does not find bugs by running tests. It finds bugs by exhaustively exploring the state space of a model up to a bounded depth and checking whether any reachable state violates a specified invariant. For Paxos, this is directly applicable.

When you write a TLA+ spec for Basic Paxos, the safety invariant is easy to express and the model checker confirms it holds across all reachable executions up to the state bound. When you extend the spec to Multi-Paxos, you immediately encounter the underspecification in a concrete form: you cannot write down the leader election mechanism, the sweep bounds, or the log compaction policy without making explicit choices. The model checker then checks your choices, not the abstract algorithm.

This is where TLA+ provides value that reading the papers does not. The papers can be underspecified in ways that are easy to miss. A TLA+ spec must be complete enough to run, and every gap in the spec is a gap you have to fill. When you fill the sweep bounds with a specific policy and the model checker finds a state where two leaders simultaneously believe they own overlapping log ranges, the spec has surfaced a design decision that the original paper deferred.

Leslie Lamport’s own TLA+ spec of Paxos is available in the TLA+ examples repository. It covers Basic Paxos. The Multi-Paxos extension has to be written by the implementor, and the exercise of writing it is substantially more informative than reading any prose description of the protocol.

Jepsen on etcd

Kyle Kingsbury’s Jepsen analyses have tested several Raft-based systems because Raft is what most Paxos-family systems in production use. The etcd 2.2.0 analysis from 2015 found that etcd lost acknowledged writes under specific network partition scenarios. The etcd 3.4.3 analysis in 2020 found no consistency anomalies, but found correctness issues in the client library that allowed stale reads through bookkeeping bugs at the client layer rather than the consensus layer.

This distinction matters. Jepsen tests the full system, not the algorithm. A correct Raft or Paxos implementation can still be exposed to clients in a way that violates the guarantees the consensus layer provides. Lease expiry bugs, client retry logic that does not account for partial failures, and read paths that bypass the consensus log are all surfaces where a formally correct algorithm can fail at the integration layer.

For Paxos-specific systems, the Jepsen analysis of Apache Cassandra found lost writes in Cassandra’s lightweight transactions, which run a full Basic Paxos instance per conditional write. The failure was not in the Paxos protocol itself but in how Cassandra’s implementation handled the commit phase notification, specifically an inconsistency in when a value was considered chosen versus when it was durably applied. Constructing the failing execution required fault injection at a specific phase boundary, the kind of adversarial timing that unit tests do not reach.

Reconfiguration Races

Reconfiguration, adding or removing nodes from a Paxos group, is arguably the hardest part of Multi-Paxos. Lamport addressed it in a separate paper, Reconfiguring a State Machine (2010), which runs to over thirty pages and introduces alpha-based reconfiguration, a mechanism where a configuration change takes effect only after a fixed number of commands in the log. The length of the paper relative to the original algorithm description is a reasonable proxy for the difficulty ratio.

The adversarial scenario is a race between an old configuration completing a consensus round and a new configuration starting one before the transition is fully propagated. If the quorum definition changes during an active Phase 2, a value can be accepted by a majority of the old quorum without being seen by a majority of the new quorum. The result is two separate consensus decisions that each believe they are correct.

This is the split-brain scenario at the reconfiguration boundary. The alpha-based approach prevents it by enforcing a gap between when the reconfiguration command is agreed upon and when it takes effect, ensuring no in-flight decisions span the boundary. The implementation requirement is that the system must decline to use any slot in the gap range for non-reconfiguration purposes, which is coordination overhead that does not appear in Basic Paxos at all.

Raft handles this via joint consensus, a two-phase approach where during the transition period both the old and new quorums must agree before any decision is complete. The Raft paper (Ongaro and Ousterhout, 2014) was partly motivated by the difficulty of getting this right under Multi-Paxos’s informal specification. Raft’s explicit joint consensus protocol makes the reconfiguration race visible and closes it by construction.

Flexible Paxos and What It Changed

Flexible Paxos (Howard, Malkhi, Spiegelman, 2016) exposed a constraint in classical Paxos that was more conservative than safety required. The classical protocol uses the same quorum size for Phase 1 and Phase 2, requiring a majority in both. Flexible Paxos shows that the only requirement is that the Phase 1 quorum and the Phase 2 quorum intersect. You can use a larger quorum for Phase 1 and a smaller one for Phase 2, or any combination as long as the intersection property holds.

This matters because it breaks the implicit assumption that Paxos performance is symmetric. In a five-node cluster, a system that expects reads to be rare could use a Phase 1 quorum of four and a Phase 2 quorum of two: any write completes with two acknowledgments, but any leader transition requires hearing from four nodes. This trades leader transition cost for write latency, which is the right trade-off in many read-heavy workloads.

The broader significance of Flexible Paxos is that it forced a re-examination of which constraints in classical Paxos were load-bearing and which were defaults inherited from the original formulation. The quorum intersection requirement is non-negotiable; the specific quorum sizes are not. Systems like CockroachDB have incorporated similar insights into their replication layer.

The Formal and Adversarial Frame

The value of treating Paxos adversarially, whether through TLA+ model checking, Jepsen fault injection, or constructing specific bad executions by hand, is that it turns vague correctness claims into testable propositions. Saying “Paxos is safe” is only useful if you can also say precisely what execution would need to occur to violate safety, and confirm that the protocol rules out that execution.

For liveness and the Multi-Paxos gaps, the adversarial frame is even more important. The dueling proposers scenario, the Phase 1 sweep boundary problem, and the reconfiguration race are not theoretical curiosities. They correspond to real failure modes that have appeared in production implementations. Constructing them explicitly, in a spec or a test harness, is the engineering method that closes the distance between the algorithm on paper and the system running in production.