Basic Paxos Is Simple; Multi-Paxos Is Not, and the Distinction Matters
Source: lobsters
There is a recurring frustration in distributed systems: engineers read the Paxos algorithm, find it manageable, then try to implement it in a real system and discover the gap between the two experiences is enormous. A 2021 walkthrough on mydistributed.systems makes the claim that Paxos, stated plainly, is not complicated. That claim is correct. It is also only half the story.
The confusion lives in a distinction that most resources skip over: Basic Paxos and Multi-Paxos are different things. The former is a protocol for reaching agreement on a single value. The latter is what you need to actually build anything, and it carries substantially more complexity. Treating them as the same algorithm is where the “Paxos is impenetrable” reputation comes from.
What Basic Paxos Actually Says
Leslie Lamport described the core of Paxos in Paxos Made Simple (2001), a paper deliberately written to be readable. The algorithm proceeds in two phases and involves three roles: proposers, acceptors, and learners.
Phase 1 is a handshake. A proposer picks a proposal number n and sends a Prepare(n) message to a quorum of acceptors. Each acceptor that has not already seen a higher proposal number responds with a promise: it will not accept any proposal numbered below n, and it reports back the highest-numbered proposal it has already accepted, if any.
Phase 2 is commitment. If the proposer collects promises from a quorum, it sends Accept(n, v) where v is either the value associated with the highest-numbered promise received (if any acceptor had already accepted something) or the proposer’s own value otherwise. Acceptors accept the proposal unless they have since seen a higher-numbered prepare request. Once a quorum accepts, the value is chosen.
That is the complete algorithm. The quorum requirement, typically a majority of nodes, ensures that any two quorums overlap by at least one member. That overlap is the mechanism that enforces consistency: a new proposer will always hear from at least one acceptor that knows about any previously accepted value.
The safety guarantee is clean. Only one value can be chosen. No value can be chosen that was not proposed. Any process that learns of a chosen value learns the correct one. The algorithm provides no liveness guarantees; if two proposers keep outbidding each other with increasing proposal numbers, the system can stall indefinitely. In practice this means Paxos needs a stable leader, but the algorithm itself does not specify one.
The Jump to Multi-Paxos
Basic Paxos solves single-shot consensus. Any real system needs to agree on a sequence of values, a log of commands that drives a replicated state machine. Running independent instances of Basic Paxos, one per log slot, is the starting point for Multi-Paxos, and the starting point is where the simplicity ends.
The first problem is phase-1 overhead. Running a full prepare-promise round for every log entry is expensive. Multi-Paxos collapses this by electing a stable leader that runs phase 1 once, receives promises covering all future slots, and then submits values directly with phase-2 messages until the lease or leadership expires. This optimization halves message latency per entry but introduces the concept of leadership with all its attendant concerns: how leadership is detected, how it expires, and what happens when a leader crashes mid-stream.
Leader recovery is where the real complexity begins. When a new leader takes over, it cannot simply start writing to the log from where the old leader stopped. Some slots may have been accepted by some acceptors but not committed. Some may be genuinely empty. The new leader must run phase 1 across every uncommitted slot to learn what values, if any, were accepted before taking over. Depending on how many uncommitted slots exist, this can be a large operation, and it must complete before the new leader can make progress.
Garbage collection adds another dimension. Acceptors must retain state for all slots that might still be in play; they cannot discard a slot’s state until they know a value has been committed and all replicas have caught up. Coordinating this across a cluster while the system continues processing new entries requires bookkeeping that is not obvious from the core algorithm.
Pipelining compounds the state management. To achieve throughput, a leader must have multiple slots in flight simultaneously rather than waiting for each to commit before proposing the next. This means slots can commit out of order relative to when they were proposed, and learners must handle gaps in their knowledge of the log. Filling those gaps on a live system without stalling or corrupting state is a genuine engineering challenge.
None of these complexities are unique to Paxos. They arise from the requirements of any replicated state machine protocol.
Why Raft Felt Simpler
Raft, introduced by Diego Ongaro and John Ousterhout in their 2014 USENIX paper, was explicitly designed to be more understandable than Paxos. It achieves this primarily by making the design decisions that Multi-Paxos leaves implicit into first-class protocol concerns.
Raft mandates a leader at all times and defines a clear election protocol. It organizes consensus around an explicit log with a strict append-only contract. It specifies exactly how a newly elected leader recovers uncommitted entries: any entry not present in a majority of nodes is considered uncommitted and cannot be preserved by a new leader that lacks it. These are the same constraints that Multi-Paxos must enforce; Raft simply names them and describes them directly.
The user-study component of the Raft paper found that students understood Raft better after reading both papers. That result is real and meaningful for pedagogy. It also reflects the fact that Multi-Paxos, as typically described, requires readers to derive the recovery and garbage-collection policies themselves from the algorithm’s invariants. Raft hands them to you.
This is why systems like etcd, CockroachDB, and Consul adopted Raft rather than Paxos. The underlying guarantees are equivalent; the specification was easier to implement correctly.
What Production Experience Reveals
Google’s Chubby paper (2006) describes a distributed lock service built on Paxos and notes that while the algorithm was understood, implementing it correctly in a real system required solving numerous problems not addressed by the algorithm description. The paper treats this as expected engineering work, not as a failure of Paxos.
Apache ZooKeeper uses a Paxos-inspired protocol called ZAB (ZooKeeper Atomic Broadcast), which makes different trade-offs around leader-based ordering to simplify implementation. ZooKeeper is used at scale in Kafka, HBase, and many Hadoop-adjacent systems precisely because ZAB is a well-specified protocol that implementors can reason about.
The pattern across these systems is consistent: the challenge is never the core consensus protocol. It is recovery semantics, membership changes, snapshot and compaction policies, and lease management. These concerns exist in every consensus system; they are properties of the problem domain, not of any particular algorithm.
The Useful Framing
For anyone trying to understand distributed consensus, the distinction between Basic Paxos and Multi-Paxos is worth holding clearly. Basic Paxos is simple. Read the 2001 paper and the explanation in the mydistributed.systems walkthrough and you will understand it without much difficulty.
Multi-Paxos is hard in the same way that Raft is hard: not in the core algorithm but in the production requirements that any consensus implementation must satisfy. Leader failure recovery, log gap filling, membership changes, and snapshot management are hard because the underlying problem is hard. Raft explains the solutions more clearly; Paxos leaves more derivation to the reader.
Knowing this changes how you evaluate both algorithms. Paxos is not a poorly designed protocol with a needlessly complex core. It is a well-proven algorithm with an underspecified implementation guide. The reputation for difficulty is accurate but misdirected; it attaches to the algorithm when it should attach to the problem.