The Gap Between Paxos on Paper and Paxos in Production

Lamport’s abstract in “Paxos Made Simple” opens with a statement that has become almost a provocation: “The Paxos algorithm, when presented in plain English, is very simple.” A 2021 post from mydistributed.systems takes that claim seriously and proves it out, walking through the protocol in clean prose without the Paxon parliament allegory that made Lamport’s original 1989 technical report so opaque. The result holds up. The core two-phase protocol, stripped of its historical packaging, is not difficult to follow.

The catch is that “the Paxos algorithm” in that context means something more specific than most people assume. It means single-decree Paxos: an algorithm for reaching agreement on a single value, once. That is a solved problem with a clean correctness proof. Multi-Paxos, which extends the mechanism to a replicated log and is what any practical system actually needs, is a different problem. Lamport describes it only informally, and the gap between the two is where most of the difficulty in Paxos-based systems lives.

The Core Protocol

Basic Paxos involves three kinds of participants: proposers, who initiate consensus; acceptors, who vote; and learners, who observe the outcome. Any node can fill multiple roles. The algorithm runs in two phases, and each proposal carries a unique, monotonically increasing proposal number, typically implemented as a (sequence_number, server_id) pair to ensure uniqueness without coordination.

In Phase 1, a proposer broadcasts Prepare(n) to a majority quorum of acceptors. Each acceptor that receives a prepare message with proposal number n higher than any it has previously responded to sends back a Promise(n, n_a, v_a): a commitment not to accept proposals numbered below n, plus the highest-numbered proposal it has already accepted, if any.

In Phase 2, the proposer collects promises from a majority and sends Accept(n, v). The value v is constrained by one rule: if any promise carried a previously accepted value, v must be the value from the promise with the highest accepted proposal number. If no acceptor has accepted anything before, the proposer can use whatever value it wants. Acceptors that receive Accept(n, v) and have not since promised to ignore proposal n record the proposal and notify learners.

The acceptor logic fits in a few lines:

on receive Prepare(n):
  if n > max_promised_n:
    max_promised_n = n
    reply Promise(n, accepted_n, accepted_v)

on receive Accept(n, v):
  if n >= max_promised_n:
    accepted_n = n
    accepted_v = v
    reply Accepted(n, v)

The safety argument rests on one invariant, which Lamport calls P2c: if a proposal with number n and value v is issued, there must exist a majority of acceptors such that either none of them has accepted any proposal numbered below n, or v is the value of the highest-numbered accepted proposal among them. This rule forces a new proposer to inherit whatever value a prior round may have already committed. It ensures that once a value is chosen by any majority, all future proposals with higher numbers carry that same value, preserving agreement across failures.

The proof follows by induction. The quorum intersection property guarantees that any two majorities share at least one acceptor, so Phase 1 of any new proposal will always discover the highest previously accepted value in the system. The minimum cost of a decision is four message delays, two round trips: Prepare and Promise, then Accept and Accepted.

The Liveness Hazard

Safety is unconditional in Basic Paxos. Liveness is not. Two proposers can indefinitely prevent each other from making progress: P1 runs Phase 1 with n=1, P2 runs Phase 1 with n=2 (invalidating P1’s upcoming Phase 2), P1 runs Phase 1 with n=3 (invalidating P2’s Phase 2), and the cycle continues. Neither proposer ever completes Phase 2. This dueling proposers problem is a structural property of the algorithm, not a bug, and it has a specific cause: any node is allowed to act as a proposer at any time.

The standard mitigation is to designate a single leader so competing proposals do not arise in steady state. But Basic Paxos specifies nothing about how that leader is chosen or maintained. Leader election is the first of several problems that Multi-Paxos introduces without resolving.

What Multi-Paxos Leaves Unspecified

For a replicated log, each log position requires a separate consensus instance. Running full two-phase Paxos per log slot costs at minimum four message delays per write. Multi-Paxos amortizes Phase 1: if a stable leader exists, it runs Phase 1 once to establish authority over a range of log positions and skips directly to Phase 2 for each new entry. Steady-state cost drops to two message delays per slot, one round trip for Accept and the corresponding Accepted replies.

The optimization is real and significant. The underspecification is also real. Lamport’s papers leave the following to implementers, with little guidance:

Leader election. How a node becomes leader, how long its authority lasts, and how the transition is managed. Randomized timeouts are common, but nothing about the election mechanism itself is specified.

Log recovery. When a new leader takes over, it cannot simply append from the current log tip. Every log position that might have had a proposal in flight under the old leader is potentially in an indeterminate state: the old leader may have sent Accept to some acceptors but not achieved a majority before crashing. The new leader must run Phase 1 on each such position, discover whether any value was already accepted, and either confirm or re-propose as needed. This recovery process adds latency on every leader transition and requires careful implementation.

Log compaction. No production system can maintain an unbounded log. Snapshots must be coordinated with ongoing consensus in a way the papers do not specify.

Membership changes. Adding or removing nodes from the quorum requires a protocol that prevents two overlapping majorities from existing simultaneously during the transition period.

Read safety. A client reading from any follower may see stale data if that follower has not received recent writes. Safe reads require either routing through the leader or running a quorum read protocol, each of which has performance implications that depend on the specific system design.

Google’s Chubby lock service paper from 2006 is the first widely known account of Paxos in production. The subsequent “Paxos Made Live” paper, co-authored by the Chubby engineering team, catalogs the decisions required to bridge from the published algorithm to a working system: leader leases, epoch numbers for reads, disk corruption handling, snapshot coordination, and more. None of it was covered by Lamport’s papers. The authors note that implementing Paxos was significantly harder than the publications suggested.

Apache Cassandra uses Paxos for a specific, narrow purpose: lightweight transactions, which provide compare-and-set semantics on individual keys. For general replication, Cassandra uses an eventually consistent model, avoiding the full complexity of Multi-Paxos at scale. That is a deliberate scoping decision that reflects how much engineering Paxos-based replication actually requires.

Why Raft Was a Useful Contribution

Raft was designed to solve the underspecification problem directly. Diego Ongaro and John Ousterhout’s 2014 paper “In Search of an Understandable Consensus Algorithm” positions Raft as a complete specification of Multi-Paxos, with explicit choices at every point where Paxos implementations diverge. Leader election uses randomized timeouts with a defined term structure. Log entries flow strictly from leader to followers in sequence order, so there are no log holes requiring recovery: a candidate can only win an election if its log is at least as up-to-date as a majority, guaranteeing that the winner always has the complete committed log. Membership changes use joint consensus. Snapshotting is specified.

Raft and Multi-Paxos are equivalent in what they guarantee; the difference is completeness of specification. This is why etcd, which underpins Kubernetes, and CockroachDB use Raft rather than Paxos. The algorithm is not inherently better, but the specification leaves substantially less room for implementation-defined behavior, which matters for correctness in open-source systems with many contributors who need to reason about the protocol independently.

Google Spanner uses Paxos groups for per-shard replication and layers TrueTime on top for external consistency across shards. These are systems maintained by teams deep enough to work through the underspecified parts of Multi-Paxos. For projects that need broad community maintainability, Raft’s completeness is a practical advantage.

Calibrating the Claim

The claim that Paxos in plain English is simple is accurate for what Lamport defined: a two-phase protocol for single-decree consensus with a clean invariant and a short correctness proof. It is not accurate as a description of Multi-Paxos, which adds leader management, log recovery, compaction, membership changes, and read consistency, none of which the papers address in full.

For engineers approaching distributed consensus, the source article and “Paxos Made Simple” provide a genuine foundation. Understanding why Phase 1 must report previously accepted values, and what the quorum intersection property guarantees, gives you the theoretical grounding to reason about any consensus-based system. But building something requires more than that foundation. Read the Raft paper afterward, or study the etcd source. The jump from the algorithm to a correct implementation requires a map that Lamport never drew, and knowing that gap exists is half the work of navigating it.