· 7 min read ·

The Nine-Year Delay That Shaped Paxos's Reputation for Complexity

Source: lobsters

Leslie Lamport wrote “The Part-Time Parliament” in 1989. It described a consensus protocol through the extended metaphor of a fictional Greek legislature on the island of Paxos, where members could come and go mid-session without disrupting the proceedings. Reviewers at ACM Transactions on Computer Systems found the metaphor more confusing than illuminating and asked Lamport to rewrite the paper in a conventional style. He declined. The paper circulated as a DEC Systems Research Center technical report for nearly a decade before TOCS finally published it in 1998.

In 2001, Lamport published “Paxos Made Simple”, a 14-page distillation of the algorithm with an opening line that reads: “The Paxos algorithm, when presented in plain English, is very simple.” The line was pointed. Nine years of informal circulation had given the algorithm a reputation for inscrutability, and Lamport was correcting the record. A 2021 walkthrough at mydistributed.systems makes the same case: the core protocol, presented directly, is comprehensible to anyone who can follow a message exchange.

Both Lamport and that walkthrough are right about basic Paxos. The controversy around the algorithm’s difficulty comes from conflating it with what production systems require, and from the fact that Lamport left the production-relevant parts as an informal sketch.

The Protocol in Plain Terms

Basic Paxos solves single-decree consensus: getting a set of nodes to agree on exactly one value. There are two roles, proposers and acceptors, and the key structural element is the quorum. A quorum is any majority of acceptors; for five acceptors, three form a quorum. Any two quorums share at least one node, which is the property that carries information between rounds.

The protocol runs in two phases. In Phase 1, a proposer picks a globally unique ballot number n and sends a Prepare(n) message to a quorum of acceptors. Each acceptor that receives a ballot number higher than any it has previously promised to respond with two things: a promise to reject all future proposals below n, and its current accepted state, specifically the highest ballot it has accepted (acceptedN) and the value under that ballot (acceptedV).

Proposer → Acceptors (quorum): PREPARE(n)
Acceptors → Proposer:          PROMISE(n, acceptedN, acceptedV)

In Phase 2, the proposer collects a quorum of promises and selects a value. If any promise includes a previously accepted value, the proposer must use the one associated with the highest acceptedN among its responses. It cannot substitute its own preferred value. The proposer then sends Accept(n, v) to a quorum; acceptors that have not since made a higher promise respond with Accepted(n, v).

Proposer → Acceptors (quorum): ACCEPT(n, v)
Acceptors → Proposer:          ACCEPTED(n, v)

A value is chosen when a quorum of acceptors have sent Accepted for the same ballot. The acceptor logic is compact:

# Persistent state: min_proposal, accepted_n, accepted_v

def on_prepare(n):
    if n > min_proposal:
        min_proposal = n    # persist before responding
        return Promise(n, accepted_n, accepted_v)
    return Nack(n)

def on_accept(n, v):
    if n >= min_proposal:
        min_proposal = n    # persist before responding
        accepted_n = n
        accepted_v = v
        return Accepted(n, v)
    return Nack(n)

The persistence annotations are not optional. An acceptor that crashes and recovers without durable state can violate the safety invariant by promising the same ballot twice with different knowledge. This requirement is mentioned in the papers but understated; implementations that get it wrong have caused data loss in production systems.

The safety proof is an induction over ballot numbers. Once a value has been accepted by a quorum, any proposer that completes Phase 1 will see that value in at least one promise response, because any two quorums overlap. The proposer is then required to propagate it, which is what the value-selection rule enforces.

Lamport’s claim holds; the algorithm is genuinely understandable at this level of description.

Where Simple Ends

No production system runs basic Paxos. Production systems need a replicated log, a sequence of agreed values, not a single-value agreement. Getting there efficiently requires Multi-Paxos, and Lamport’s coverage of Multi-Paxos in “Paxos Made Simple” spans roughly one page, presented as an informal sketch rather than a specification.

The core optimization in Multi-Paxos is stable leadership. A leader can run Phase 1 once across all future log slots and then proceed directly to Phase 2 for each new entry, halving the per-consensus message count from four to two. This is the right intuition. The sketch leaves everything else to the implementor.

How does a new leader discover the state of uncommitted slots from the previous term? The new leader must run Phase 1 across all potentially uncommitted slots to determine which values might have been chosen, which requires knowing how many slots might exist. What happens to gaps in the log if the leader fails mid-slot? What happens to in-flight proposals from the old leader that a quorum may or may not have accepted? How does the cluster handle membership changes, adding or removing acceptors, without violating quorum properties during the transition? None of this is in Lamport’s paper.

Google’s experience building Chubby (Burrows, OSDI 2006), a distributed lock service backed by Paxos, is the most candid account of this gap in the literature. The paper notes that “we failed to appreciate the subtlety and difficulty of building a reliable, production-quality replicated state machine.” Chubby required implementing leader leases to allow safe local reads, epoch-based leader transitions with full log reconciliation, and crash recovery with careful replay semantics. None of that came from the papers.

Apache Cassandra takes a narrower approach. It uses Paxos only for lightweight transactions, the IF NOT EXISTS and compare-and-set operations, scoped to individual partition keys. The protocol requires four round trips in the worst case: prepare, promise, accept, commit. The Cassandra documentation is explicit that this path is expensive and should be avoided in throughput-sensitive code paths. That is Paxos used with appropriate constraint, not as a general coordination substrate.

Google Spanner (Corbett et al., ACM TODS 2013) uses Multi-Paxos with leader leases across thousands of Paxos groups, one per data shard, combined with TrueTime for globally consistent timestamps. The engineering surface is substantial: leader leases for performance, careful handling of leader changes to preserve read-your-writes semantics, and log compaction across geographically distributed replicas. Spanner is what production Multi-Paxos looks like, and it bears limited resemblance to the page-long sketch in “Paxos Made Simple.”

Van Renesse and Altinbuken’s “Paxos Made Moderately Complex” (ACM Computing Surveys, 2015) is the most complete attempt to fill the specification gap. It runs to approximately 50 pages and still requires implementors to make choices the paper leaves open.

Why Raft Exists

Diego Ongaro and John Ousterhout published “In Search of an Understandable Consensus Algorithm” (USENIX ATC 2014) as a direct response to the Multi-Paxos specification problem. The paper opens with the observation that Paxos is difficult not because consensus is inherently complex, but because the literature leaves too many design decisions implicit. Each team building a Paxos-based system makes those decisions independently, producing systems that are all called “Paxos” but behave differently: Chubby, ZooKeeper’s ZAB protocol, and Spanner share a theoretical foundation while being mutually incompatible.

Raft’s design response was decomposition and restriction. Leader election uses explicit RequestVote RPCs with term numbers. Log replication requires that a leader always has all committed entries before becoming leader, eliminating the new-leader recovery complexity that Multi-Paxos leaves unresolved. Raft forbids gaps in the log entirely. Membership changes are specified in the paper through a joint consensus phase, not deferred to the implementor.

Raft has no asymptotic or theoretical advantage over Paxos as a consensus algorithm. Its appeal comes from specification completeness, with choices made explicitly to favor comprehensibility over generality. The Raft paper includes a user study at Stanford and Berkeley in which students who learned Raft answered questions about leader election and log consistency more accurately than those who learned Paxos, and the gap was specifically traceable to the underspecified areas of Multi-Paxos. The study is not definitive, but it is the most empirical evidence available that the specification gap has measurable consequences.

etcd, the key-value store backing Kubernetes, uses Raft rather than Paxos for exactly this reason. When the CoreOS team evaluated consensus protocols, the availability of a complete specification and multiple reference implementations made Raft the pragmatic choice.

The Actual Lesson

The mydistributed.systems walkthrough makes the right point: basic Paxos is worth reading, and it is comprehensible. Lamport’s “Paxos Made Simple” is 14 pages and should be read directly. The two-phase structure, the quorum invariant, and the value-selection rule are all teachable in an afternoon, and understanding them gives you real intuition about distributed safety properties.

The confusion around Paxos as a production technology comes from the distance between that clean foundation and the requirements of an actual replicated state machine. Multi-Paxos, as it appears in the literature, is not one algorithm; it is a family of algorithms that share the basic Paxos safety core while making independent, underspecified choices about leader election, log management, membership changes, and read semantics. Every team that has shipped a production consensus system has made those choices differently, which is the honest reason why the distributed systems community has spent thirty years writing papers explaining Paxos.

The algorithm is simple; what you build on top of it is not. That distinction matters, and keeping it clear is what prevents the next team from quoting Lamport’s opening line while underestimating what comes after it.

Was this interesting?