Paxos Is Simple. The System Around It Is Not.

Leslie Lamport opened his 2001 paper “Paxos Made Simple” with a sentence that reads almost as a rebuke: “The Paxos algorithm, when presented in plain English, is very simple.” The source article at mydistributed.systems echoes this claim, and the claim is true. The core algorithm can be stated in a page. Two phases, a quorum requirement, a single safety-preserving rule about which value a proposer must adopt. That is Paxos.

But the reputation Paxos has accumulated over three decades is not for its simplicity. It is for the difficulty of turning the algorithm into a running distributed system. Understanding why both things are true at once requires separating what Paxos actually specifies from what engineers need when they reach for it.

The Algorithm Itself

Paxos solves a specific problem: getting a distributed set of nodes to agree on a single value, even if some of those nodes crash. The nodes play three roles: proposers (which nominate values), acceptors (which vote), and learners (which observe the outcome). In practice these roles overlap, and a single server often plays all three.

The protocol runs in two phases.

Phase 1: Prepare and Promise. A proposer picks a proposal number n, higher than any it has used before, and broadcasts a Prepare(n) message to a majority of acceptors. Each acceptor that receives this message, if n is higher than any proposal number it has previously seen, replies with a promise: it will not accept any future proposal numbered less than n. Crucially, the acceptor also sends back any value it has already accepted, along with that value’s proposal number.

Phase 2: Accept and Accepted. The proposer collects a quorum of promises. If any promise carried a previously accepted value, the proposer must adopt the one with the highest proposal number rather than its own value. If no promise carried a value, the proposer is free to propose whatever it wants. It then sends Accept(n, value) to the acceptors. An acceptor that has not since promised to a higher-numbered proposer stores the value and replies Accepted.

If a quorum of acceptors reply Accepted, the value is decided.

The safety invariant is this: once a value has been accepted by a quorum, any future proposer that completes Phase 1 will discover at least one acceptor in its quorum that holds that value, and the rule forcing adoption of the highest-numbered previously accepted value guarantees the same value propagates forward. No two different values can ever be chosen.

That really is the whole algorithm. The implementation of single-decree Paxos in Go is roughly 200 lines including network boilerplate.

Why Paxos Spent Nine Years Unpublished

Lamport circulated the original paper, “The Part-Time Parliament,” in 1989. Reviewers at SIGACT and TOCS repeatedly rejected it. The paper framed the protocol as a legislative procedure on the fictional Greek island of Paxos, with senators, ballots, and decrees. Reviewers could not see through the metaphor to recognize it as a distributed systems paper. It finally appeared in ACM TOCS in 1998, along with a note from Lamport expressing considerable exasperation.

“Paxos Made Simple” arrived in 2001 with the metaphor stripped out and the algorithm written in direct prose. The algorithm had not changed. The presentation had.

This history matters because it is easy to conclude that Paxos was hard because of poor presentation. That conclusion is only half right. The algorithm itself, once stated plainly, is not hard to follow. What is hard is everything the algorithm does not say.

The Gap Between the Algorithm and a Real System

Single-decree Paxos agrees on one value. Production systems need to agree on a sequence of values: a replicated log of operations to apply to a state machine. The extension to a log is called Multi-Paxos, and Lamport described it informally but never specified it completely. Each slot in the log requires its own Paxos instance. A stable leader can skip Phase 1 for subsequent slots once it has established authority, reducing steady-state consensus to a single round trip per log entry. But leadership itself is not specified. Nor is:

How to elect a leader in the first place
How to detect leader failure and trigger a new election
How to handle log gaps when a leader crashes mid-replication
How to add or remove nodes from the cluster (reconfiguration)
How learners reliably discover that a value has been decided

Every team that has implemented Paxos in production has had to invent answers to these questions independently. Google’s Chubby lock service, described in a 2006 OSDI paper by Mike Burrows, was one of the first public accounts of Paxos in a large-scale system. The paper is candid about how much engineering exists beneath the protocol: session management, master leases, health checking, failure detection. The algorithm is the easy part.

Apache Cassandra uses Paxos for its lightweight transactions, specifically for compare-and-set operations like INSERT ... IF NOT EXISTS. Each such operation runs a full four-phase protocol (the standard two phases plus a commit broadcast). The latency cost is roughly four times a normal write, which is why Cassandra’s documentation warns against using lightweight transactions for anything on the hot path. The safety guarantee is real; the performance cost is non-trivial.

Google Spanner runs one Multi-Paxos group per shard and coordinates cross-shard transactions using TrueTime, a GPS and atomic clock-backed clock synchronization system. This is arguably the highest-scale Paxos deployment that exists, and it required hardware infrastructure that most organizations cannot replicate.

The FLP Shadow

Paxos guarantees safety unconditionally: it will never allow two nodes to decide different values. It does not guarantee liveness: it may never decide anything at all.

This is not an oversight. The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that no deterministic consensus algorithm can guarantee both safety and termination in a purely asynchronous system where even a single process may crash. Paxos navigates this by guaranteeing safety always and making progress only under partial synchrony, meaning when the network is well-behaved and a stable leader exists.

The concrete failure mode is dueling proposers. Suppose Proposer A sends Prepare(1) and gets promises. Before A can send Accept, Proposer B sends Prepare(2), invalidating A’s promises. A increments to Prepare(3), invalidating B. This cycle can continue indefinitely. No safety violation occurs, but no decision is ever reached. The fix in practice is a stable leader with randomized timeout-based election, which requires assuming partial synchrony. Paxos alone says nothing about this.

What Raft Actually Fixed

Diego Ongaro’s 2014 thesis, “In Search of an Understandable Consensus Algorithm,” co-authored with John Ousterhout and published at USENIX ATC, described Raft as a response to exactly this gap. Ongaro’s complaint was not that Paxos was mathematically incorrect. It was that Paxos provided no complete foundation for building a real system.

Raft makes different choices. It enforces a strong leader at all times; all writes flow through the leader to followers. Leader election uses randomized timeouts and is a first-class protocol component, not something implementors must invent. The replicated log is the primary abstraction, not an afterthought bolted onto single-decree consensus. Membership changes have a specified protocol using joint consensus. Ongaro’s user studies found that students who learned Raft significantly outperformed those who learned Paxos on comprehension questions.

The tradeoff is real. Raft’s strong leader can bottleneck throughput in write-heavy workloads. Paxos’s generality has enabled variants that Raft cannot easily match. Egalitarian Paxos (EPaxos) (Moraru et al., SOSP 2013) allows any replica to commit non-conflicting commands in a single round trip with no designated leader, achieving near-optimal throughput and latency in geo-distributed deployments. This is only possible because Paxos does not prescribe a leader.

Reading the Source Article in Context

The mydistributed.systems post makes the same argument Lamport made in 2001: the algorithm is simple when you present it plainly. This is a useful corrective to the mysticism that has built up around Paxos. Engineers sometimes treat it as arcane when the core mechanics are genuinely accessible.

But the article’s claim should be understood precisely. Single-decree Paxos, the algorithm that achieves consensus on one value among a fixed set of nodes, is simple. The machinery that production systems need, the leader election, the log replication, the reconfiguration, the persistence requirements (acceptors must fsync their state before replying, or the safety guarantees collapse), the monitoring, the recovery procedures: none of that is specified by Paxos, and none of it is simple.

Lamport gave distributed systems a precise, minimal, correct solution to a well-scoped problem. The engineering work that surrounds it in any real deployment is substantial and largely left as an exercise. Recognizing both of these facts, rather than collapsing them into either “Paxos is simple” or “Paxos is impossible,” is what actually prepares you to work with consensus in production.

If you want a complete protocol that answers most of those implementation questions out of the box, Raft is the right starting point. If you want to understand why distributed consensus works at all, reading the two-page “Paxos Made Simple” is still the clearest path to the core insight.