Certification Is Not Serializability: What Jepsen Found in MariaDB Galera Cluster
Source: lobsters
Kyle Kingsbury’s Jepsen analysis of MariaDB Galera Cluster 12.1.2 lands in a long tradition of Jepsen reports that reveal the distance between what a distributed system advertises and what it delivers under fault injection. Galera’s case is interesting because the gap is not primarily an implementation bug. It is architectural, and it is shared by every system built on certification-based replication, including Percona XtraDB Cluster and MySQL Group Replication in multi-primary mode. Understanding the report means understanding the protocol.
How Certification Replication Actually Works
Galera is a synchronous multi-master replication plugin for MariaDB and MySQL, built around the wsrep API. Every node accepts writes. There are no read replicas or designated primaries in the traditional sense.
When a transaction commits on any node, the wsrep layer converts it into a write set: a binary structure containing the row-level changes and a key set, which is the collection of primary key and unique index values for every row the transaction wrote. This write set is broadcast to all cluster nodes via the group communication system, a custom protocol called EVS (Extended Virtual Synchrony) that guarantees all nodes receive write sets in the same total order.
Each node then runs a certification test. It checks whether the incoming write set’s key set overlaps with the key sets of any write sets that were certified after the incoming transaction took its snapshot. If there is an overlap, the transaction fails certification and the client receives ER_LOCK_DEADLOCK (error 1213), which signals that the application should retry. If there is no overlap, the write set is certified and queued for application.
Here is the critical distinction: certification happens in a globally agreed order, but application to the storage engine on non-originating nodes is asynchronous. The originating node commits immediately after certification. Other nodes apply the write set in the background via applier threads. This is the mechanism behind the term “synchronous replication” in Galera’s documentation, and it is a narrow definition of synchrony.
The Read-Set Problem
The certification test only tracks write sets. It does not track what a transaction read.
Consider two concurrent transactions on different nodes, T1 and T2. Both read the same row R. T1 decides to update row A based on R’s value; T2 decides to update row B based on R’s value. Their write sets touch different rows, so key set overlap is empty. Both pass certification. Both commit. The resulting database state reflects a decision made by each transaction on a premise that the other transaction may have invalidated.
This anomaly has a formal name: write skew, or G2-item in the Adya isolation taxonomy that Jepsen’s Elle checker uses. It is a violation of snapshot isolation. Galera’s certification protocol, by tracking only written rows, cannot detect it. This is not a bug that can be patched in a point release; it would require tracking read sets in write sets, which would substantially increase their size and the cost of certification checks.
What Jepsen Found
The Jepsen report documents several classes of anomaly, and the Elle checker’s transaction history analysis provides the formal evidence. The most significant findings fall into a few categories.
Stale reads are the most operationally common problem. With the default setting wsrep_sync_wait=0, a read on node B proceeds without waiting for pending certified write sets to be applied. A write acknowledged on node A may not yet be visible on node B. Applications that route reads to any node in the cluster, which is the common pattern with load balancers, silently read stale data. This violates read-your-writes consistency. The fix exists: setting wsrep_sync_wait=1 forces a synchronization check before any read, ensuring all certified write sets have been applied. The cost is added latency on every read, which is why most deployments leave it at zero.
Lost updates under certification gaps are rarer but more dangerous. Jepsen found conditions, particularly involving concurrent inserts and certain index structures, where two conflicting write sets both passed certification, both committed on their originating nodes, and one was silently overwritten during application on remote nodes. The client that “won” certification had no indication that its data was subsequently overwritten. This is the kind of anomaly that causes money to disappear from bank accounts or inventory counts to go negative without any error surfacing.
Stale reads during network partitions revealed a problem with Galera’s primary component mechanism. When the cluster partitions, nodes without quorum are supposed to stop serving writes. With certain configurations, including pc.ignore_quorum or misconfigured pc.weight values, a partitioned node continued serving reads from a state that was diverging from the rest of the cluster. An application reading from that node believed it was reading authoritative data.
The wsrep_sync_wait Knob Is Not a Minor Setting
Galera’s documentation treats wsrep_sync_wait as a performance tuning parameter. The Jepsen report frames it as a correctness parameter. The distinction matters for how operators reason about their deployments.
The parameter is a bitmask. Value 1 enables sync before reads; value 2 enables sync before writes; value 3 enables both. The default is 0, which enables neither. For most OLTP applications, setting it to 1 is the minimum required to get read consistency that matches what developers intuitively expect from a database cluster.
The MariaDB Knowledge Base entry on wsrep_sync_wait explains the semantics, but the connection between this setting and the consistency guarantees the cluster provides is underemphasized in deployment guides and tutorials. Many teams discover the stale-read behavior in production rather than in testing, which is precisely what Jepsen-style testing is designed to prevent.
How This Compares to MySQL Group Replication
MySQL Group Replication, which powers MySQL InnoDB Cluster, uses Paxos (via the XCom component) for transaction ordering rather than Galera’s EVS protocol. This gives Group Replication stronger formal guarantees around membership changes and split-brain scenarios.
But on write-set certification, Group Replication shares Galera’s fundamental limitation. In multi-primary mode, write sets are certified against each other using row-level key overlap, read sets are not tracked, and write skew is possible. The Group Replication documentation explicitly notes multi-primary mode limitations and recommends single-primary mode for workloads requiring strong consistency.
Single-primary Group Replication avoids the write-skew problem by routing all writes through one node, making it effectively a primary-replica setup with automatic failover. Galera has no single-primary mode; all nodes are always writable. This is both a feature and a constraint. Applications that need the write scalability of multi-master replication must accept the consistency trade-offs that come with it, or they must implement application-level conflict resolution.
The Broader Lesson from Jepsen’s Elle Checker
The fact that these anomalies were found with Elle rather than a traditional linearizability checker like Knossos is worth noting. Knossos models distributed systems as single-object registers and checks whether operations could have occurred in some legal sequential order. It is the right tool for key-value stores.
Elle models multi-object transactional workloads. It constructs a dependency graph where edges represent relationships between transactions (write-write, write-read, anti-dependency) and checks for cycles. Cycles in this graph correspond to named isolation anomalies in Adya’s taxonomy: G1a through G2-item, write skew, phantom reads, and lost updates. This is the correct framework for analyzing relational databases with multi-row transactions.
Galera had been deployed in production for over a decade before this analysis. The wsrep_sync_wait stale-read issue was known in the community. The certification false-negative and write-skew findings are less widely documented, and their formal characterization through Elle gives operators a precise vocabulary for the risks they are accepting.
The report does not conclude that Galera is unusable. It concludes that Galera should not be used as a drop-in replacement for a strongly consistent single-node database without careful attention to configuration and application design. That is a reasonable conclusion, and it is the kind of precision that distinguishes a Jepsen analysis from a vague caveat in a README. For teams running Galera in production, the report is a configuration audit checklist as much as it is a criticism.