Basic Paxos Is Simple, Multi-Paxos Is Not: The Distinction Matters

Leslie Lamport opened his 2001 paper “Paxos Made Simple” with a sentence that has irritated distributed systems engineers ever since: “The Paxos algorithm, when presented in plain English, is very simple.” The linked article attempts to prove that claim, and it succeeds. The confusion is that Lamport and his critics are talking about different things.

The algorithm itself is simple. The system you have to build around it is not. These two things are simultaneously true, and conflating them has produced decades of unnecessary argument.

What Basic Paxos Actually Says

Basic Paxos solves one narrow problem: getting a set of distributed nodes to agree on a single value, despite some nodes crashing at arbitrary points. Three roles participate: proposers, acceptors, and learners. A proposer wants to get some value committed. Acceptors vote. Learners observe what was decided.

The protocol runs in two phases.

In Phase 1, a proposer picks a proposal number n that is unique and higher than any it has used before, then sends Prepare(n) to a majority of acceptors. Each acceptor that has not already promised to ignore proposals below n responds with a promise: it will not accept anything numbered below n, and it includes whatever value it has already accepted, if any.

In Phase 2, if the proposer receives promises from a majority, it sends Accept(n, v) to those acceptors. The value v is constrained: if any acceptor in the majority had already accepted a value under a prior proposal, the proposer must use the value associated with the highest such proposal number. If none had, the proposer may use any value it wants. An acceptor that receives Accept(n, v) accepts it unless it has since promised a higher number to someone else.

A value is committed when a majority of acceptors have accepted it. Learners find out through some notification mechanism, which the protocol does not specify further.

The safety invariant at the core of this, which Lamport calls P2c, is: if a proposal with number n and value v is issued, then there exists a majority of acceptors such that either none of them has accepted any proposal numbered below n, or the highest-numbered proposal accepted among them has value v. Quorum intersection enforces this automatically: any two majorities share at least one acceptor, so a new proposer always discovers what was previously accepted.

That is genuinely the whole algorithm. The safety proof fits in nine paragraphs. The implementation runs to a few hundred lines of code. Lamport is not wrong.

A Decade in a Drawer

The actual history of Paxos makes the simplicity claim more interesting. Lamport wrote “The Part-Time Parliament” in 1989, framing the protocol as a legislative procedure on a fictional Greek island where senators would occasionally leave the chamber to handle other business. He submitted it to SIGACT and then to ACM TOCS. Reviewers at both venues rejected it, unable to see through the allegory to recognize the distributed systems content.

The paper sat unpublished for nine years. Lamport eventually sent a cleaned-up version to TOCS in 1998, which accepted it. Then in 2001 he wrote “Paxos Made Simple” with the parliament fiction stripped out entirely, as a direct response to the complaints that his earlier presentation was incomprehensible.

The irony is that the algorithm did not change between the 1989 draft and the 2001 simplification. The presentation changed. Which suggests that a significant portion of Paxos’s reputation for difficulty was always a presentation problem, not an algorithmic one.

Where the Difficulty Actually Lives

Basic Paxos solves a single-value consensus problem. Real systems need to replicate a sequence of values over time, which is a log of state machine commands. Multi-Paxos is the name for the extension that runs separate Paxos instances for each log slot. The name implies a specification. It is not one.

Lamport described the key optimization in “Paxos Made Simple”: if a single proposer acts as stable leader, it can skip Phase 1 for subsequent log slots after establishing its authority in the first. This reduces the steady-state cost from four message delays to two, one round trip per log entry. That optimization is real and significant.

Everything else is left open. Leader election is not specified. How to handle the transition when a leader fails is not specified. Log recovery when a new leader takes over is not specified. Snapshots, membership changes, how clients safely read committed state: none of it.

The log recovery problem is particularly subtle. When a new leader takes over, every log slot that had a proposal in flight under the old leader is in an indeterminate state. The old leader may have sent an Accept message to some acceptors but not achieved a majority before crashing. The new leader cannot simply start appending at the end of whatever log it sees locally. It must run Phase 1 on every uncertain slot to discover whether a value was already accepted by a majority, then either confirm that value or fill the slot before accepting new writes. This adds latency on every leader transition and requires careful bookkeeping.

The “dueling proposers” livelock is another artifact of underspecification. Two proposers can invalidate each other’s promises indefinitely: A issues Prepare(1), B issues Prepare(2) before A can send Accept(1), A issues Prepare(3) to invalidate B, and so on. No safety violation occurs but no decision is ever reached. The standard fix is a stable leader with randomized election timeouts, but Paxos does not tell you that.

This is why “Paxos Made Live”, the 2007 paper from Google’s Chubby team that included Lamport as a co-author, is such a useful complement. It documents the engineering required to turn the algorithm into a production system: leader leases to allow safe reads, epoch numbers to identify stale messages from old leaders, disk corruption handling, snapshot coordination, and multi-master reconfiguration. The paper is not a critique of Paxos. It is an acknowledgment that the algorithm specifies the minimum necessary for correctness and leaves the rest to implementers.

Why Raft Won

Raft, Diego Ongaro and John Ousterhout’s 2014 consensus algorithm, was an explicit response to this underspecification problem. Raft and Multi-Paxos provide equivalent safety guarantees. The difference is that Raft makes the implementation decisions that Paxos left open.

Leader election in Raft uses randomized timeouts and a monotonically increasing term counter. Any message from a node with a lower term can be safely ignored. Log entries flow strictly from leader to followers in sequence, with no holes that require recovery. A candidate can only win an election if its log is at least as up-to-date as a majority. Membership changes use a specified joint consensus approach. Snapshotting is specified. Read safety is specified.

Ongaro’s user studies showed that students who learned Raft significantly outperformed those who learned Paxos on comprehension questions. That result is often cited as evidence that Raft is simpler. The more accurate reading is that Raft is more completely specified. Completeness and simplicity overlap but are not the same thing.

The practical consequence is that every major open-source distributed database built in the last decade runs Raft: etcd (which underpins Kubernetes), CockroachDB, TiKV, Consul. The algorithm’s completeness made it more maintainable in open-source contexts where implementation decisions cannot be handed off to a small team of distributed systems specialists.

Raft’s tradeoff is that its strong leader requirement limits throughput flexibility. Egalitarian Paxos (EPaxos), a 2013 variant, allows any replica to commit non-conflicting commands in a single round trip without a designated leader, which is only possible because Paxos leaves leader behavior unspecified. That flexibility is still being explored in research systems, though production deployments remain rare.

What the Simplicity Debate Is Actually About

The engineers who call Paxos hard are not wrong. They are describing the experience of building production systems that Paxos does not fully specify. The researchers who call Paxos simple are also not wrong. They are describing the algorithm’s core correctness proof.

Lamport’s approach was to provide the minimum correct solution to a well-scoped problem. The scoping was intentional. Basic Paxos solves single-value consensus; Multi-Paxos extends it to logs; everything else is system design. That is a reasonable place to draw the line for an algorithm paper.

The problem was that engineers read “Paxos Made Simple” expecting a complete blueprint and found a correct but underspecified sketch. The gap between what the paper provides and what production requires is where the difficulty lives, and that gap is real.

Raft resolved the argument not by inventing a new algorithm but by completing the specification. That distinction is worth keeping in mind when evaluating any distributed systems paper: correctness and completeness are separate properties, and a proof of the former does not imply the latter.