The Gap Between Paxos the Algorithm and Paxos the System

Paxos has carried a reputation for inscrutability since Leslie Lamport submitted “The Part-Time Parliament” to ACM Transactions on Computer Systems in 1989. The paper sat in review for nearly a decade, partly because reviewers found its Greek parliament allegory confusing and a little eccentric. When ACM published it in 1998, the damage to Paxos’s reputation was largely set. Lamport followed up in 2001 with Paxos Made Simple, a nine-page paper that dropped the allegory entirely and opened with this: “The Paxos algorithm, when presented in plain English, is very simple.”

A plain-English walkthrough of the core protocol really is short. The algorithm fits in a handful of paragraphs. So the question worth examining is why, if the algorithm is so simple, it became the emblem of distributed systems complexity and why teams building consensus infrastructure keep reaching for Raft instead.

There are two different things called Paxos. One is the single-decree consensus protocol. The other is the production system you build when you need replicated state at scale. They are related but not the same thing, and conflating them is where most of the confusion originates.

The Protocol, Stated Concretely

Single-decree Paxos solves one specific problem: getting a set of nodes to agree on a single value, even if some nodes are unavailable or slow. It involves three roles. Proposers suggest values. Acceptors vote on proposals. Learners observe committed values. The protocol runs in two phases.

In Phase 1, a Proposer picks a proposal number n that is unique across all Proposers and broadcasts a Prepare(n) message to a quorum of Acceptors, typically a majority. Each Acceptor checks whether n exceeds any proposal number it has previously promised to honor. If so, it responds with Promise(n, v_a, n_a): a commitment not to accept any proposal numbered less than n, plus the highest-numbered value (n_a, v_a) it has already accepted, or null if it has accepted nothing.

In Phase 2, the Proposer collects a quorum of promises. If any Acceptor returned a previously accepted value, the Proposer must use the value paired with the highest n_a among all responses. If no prior value exists, the Proposer may choose freely. It then broadcasts Accept(n, v) to the quorum. Each Acceptor that has not since made a higher promise accepts the value and notifies Learners.

The invariant that makes this safe: if a value v was chosen in a round numbered n, then Phase 1 of any subsequent round with a higher number will discover v among the promise responses and be obligated to re-propose it. This follows from quorum intersection. Any two majorities share at least one node, so at least one Acceptor in any new quorum has already accepted v and will report it.

The message types are Prepare, Promise, Accept, and Accepted, covering two phases and one quorum-intersection invariant. The protocol is implementable in a few hundred lines of code. Lamport’s claim holds up.

Where the Complexity Lives

The problem is that single-decree Paxos solves a problem almost no production system actually has. Production systems need a replicated log: an ordered sequence of commands that every node applies in the same order, indefinitely. The standard approach is Multi-Paxos, which runs a separate Paxos instance for each log slot.

For efficiency, Multi-Paxos elects a stable leader and lets that leader skip Phase 1 for subsequent slots after winning an initial round, since it already holds valid promises from a quorum. Without this optimization, every log append costs two round trips. With it, steady-state appends cost one. The optimization is essential for any write-heavy workload.

But this optimization requires leader election, and Lamport’s paper says almost nothing useful about how to implement it. A new leader must detect that the old one has failed, broadcast its own Phase 1 to establish authority, and then recover the log by determining which slots have been committed, which have been accepted by a quorum but not yet learned by all nodes, and which are genuine gaps because no quorum ever formed. The recovery protocol has correctness requirements that interact with the core invariants in non-obvious ways.

None of this appears in Paxos Made Simple. Membership changes, log compaction, snapshot transfer, and client interaction are also absent. These are not minor implementation details. They represent the majority of the engineering work in any real deployment.

Google’s Chubby paper, published in 2006, documented this gap directly. The Chubby engineers noted that the distance between Lamport’s description and the requirements of a real system was substantial. Chubby used Paxos for its replicated state machine, and the team found that master leases, session management, and failure recovery required significant engineering work that the literature did not cover. The paper’s observation, that Paxos requires practitioners to work out much of the surrounding machinery independently, was influential in how the distributed systems community began to think about the algorithm’s true difficulty.

How This Shaped the Ecosystem

Raft, designed by Diego Ongaro and John Ousterhout and published at USENIX ATC in 2014, was a direct response to this gap. Their stated goal was understandability, and their central argument was not that Paxos is incorrect but that it is underspecified as a practical system. They decomposed consensus into three subproblems, leader election, log replication, and safety, and made explicit decisions for each one that Paxos leaves open.

The strong leader model in Raft constrains the design in ways that simplify reasoning. In Paxos, multiple nodes can act as Proposers concurrently, which is safe but produces scenarios where one Proposer discovers and incorporates the prior proposals of another. Raft eliminates this by routing all writes through a single elected leader. The tradeoff is less flexibility in consensus topology, but the benefit is a system whose behavior under failure is easier to trace and reproduce.

Raft defines an explicit term counter that increments with every leader election. Stale messages from old leaders are identifiable by their term number and can be safely discarded. This makes failure recovery deterministic in a way that aids both implementation and debugging. etcd, which provides the state store behind Kubernetes, uses Raft. CockroachDB uses Raft for per-range consensus. TiKV, the storage engine behind TiDB, uses Raft. The breadth of adoption reflects what teams find tractable to implement and reason about under pressure, not a verdict on Paxos’s theoretical foundations.

Apache ZooKeeper takes a different path. It uses ZAB (ZooKeeper Atomic Broadcast), a Paxos variant expressed in terms of atomic broadcast rather than generic state machine replication. The Yahoo engineers found that framing the protocol as broadcast made some recovery invariants easier to specify. ZAB’s epoch concept plays the same role as Raft’s term: a monotonic counter that lets nodes detect and reject messages from stale leaders. ZAB diverges from Multi-Paxos in its recovery semantics but shares the same foundational intuition about quorum intersection.

What the Simplicity Claim Is Actually For

Lamport’s point in 2001 was not that distributed systems engineering is easy. It was that the core idea behind consensus, the quorum-intersection argument that forces new rounds to propagate prior accepted values, is not mysterious. The complexity practitioners encountered was complexity they added when building the surrounding machinery that the protocol does not specify.

This distinction has practical value when you’re learning the material. Starting with the single-decree algorithm builds a mental model for the core invariants. Once you understand why Phase 1 forces a Proposer to discover and re-propose any previously accepted value, you can reason about what a leader failover must accomplish: the new leader runs Phase 1 across all uncertain log slots to establish their state before accepting new writes. Once you understand quorum intersection, you understand why you need 2f+1 nodes to tolerate f failures, and why that bound is tight rather than conservative.

Starting with a complete system description like Raft gives you a more immediately useful specification for building production software. The two approaches serve different purposes. Raft is better for implementing; single-decree Paxos is better for understanding the mechanism underneath.

The distributed systems community spent roughly a decade conflating these two levels of description, treating the absence of a complete Multi-Paxos specification as evidence that the core algorithm was obscure. What was missing was documentation of the surrounding system, and Raft provided that documentation explicitly rather than leaving it to practitioners to discover through implementation and failure.

Single-decree Paxos remains worth understanding because it exposes the minimum required for consensus without obscuring it in engineering decisions about leader election or log management. The algorithm is small and clean. The engineering built on top of it is substantial and was largely undocumented until Raft and similar specifications made it explicit. Those are two separate observations, and keeping them separate is what makes both of them useful.