Galera's Certification Protocol Tells You About Writes, Not Reads

Kyle Kingsbury’s Jepsen analysis of MariaDB Galera Cluster 12.1.2 is worth reading carefully, not because it finds surprising bugs, but because it makes explicit what the protocol was never designed to prevent. The findings are rooted in how write-set certification works at the protocol level, and any database built on this approach, including Percona XtraDB Cluster and MySQL Group Replication in multi-primary mode, shares the same structural limitations.

The Certification Protocol

Galera is a multi-master replication plugin for MariaDB and MySQL, built around the wsrep API (Write-Set Replication). Every node accepts reads and writes. When a transaction commits on any node, the wsrep layer converts it into a write set: a binary structure containing the actual row-level changes and a key set, which is the collection of primary key and unique index values for every row the transaction wrote. This write set is broadcast to all cluster nodes via a group communication system called EVS (Extended Virtual Synchrony), which guarantees that every node receives write sets in the same total order.

Each node then runs a certification test. The test compares the incoming write set’s key set against the key sets of write sets certified after the incoming transaction took its snapshot. If any keys overlap, the transaction fails with ER_LOCK_DEADLOCK (error 1213) and the client is expected to retry. If no keys overlap, the write set is certified, queued for application, and the originating node commits immediately. Other nodes apply the write set asynchronously in the background via applier threads controlled by wsrep_slave_threads.

The originating node commits immediately after certification. Non-originating nodes apply later. Galera’s documentation describes this as “synchronous replication,” but the synchrony applies specifically to the certification ordering step, not to when data becomes readable on other nodes. That distinction is the source of most of the problems Jepsen found.

What Certification Cannot See

The certification test tracks written keys. It does not track what a transaction read.

This matters because many real conflict scenarios turn on reads, not writes. The canonical example is write skew: two transactions running concurrently on different nodes each read the same row, each decide to write to different rows based on what they read, and both succeed because their write sets touch disjoint keys. The certification test sees no overlap and certifies both. The final database state reflects two decisions made against a shared precondition that neither transaction could see the other invalidating.

In Atul Adya’s isolation taxonomy, which Jepsen’s Elle checker uses to classify findings, this is called G2-item. Detecting it requires tracking read sets alongside write sets. Galera does not track read sets. Closing this gap would require including read keys in write sets, which increases their size and the cost of certification checks proportionally to how many rows a transaction reads. For OLAP-style transactions touching large ranges, that cost would be substantial. The current design reflects a deliberate trade-off, not an oversight.

What Jepsen Found

The Jepsen analysis uses Elle to instrument real transactional workloads, inject faults, record the full history of operations, and check that history for anomaly signatures using dependency graph analysis. Four categories of problems emerged.

Stale reads at default configuration. With the default wsrep_sync_wait=0, reads on any node proceed without waiting for pending certified write sets to be applied. A write acknowledged on node A may not yet be visible on node B. Any application load-balancing reads across cluster nodes, which is the common deployment pattern, silently reads stale data. This violates read-your-writes consistency. Setting wsrep_sync_wait=1 forces a synchronization check before reads, ensuring all certified write sets are applied first, at the cost of added read latency.

Lost updates from certification false negatives. Under concurrent inserts with certain index structures, two conflicting write sets both passed certification, both were acknowledged as committed on their originating nodes, and one was silently overwritten during application on remote nodes. Neither client received an error. This is a P4 (lost update) anomaly in Adya’s taxonomy. The “winning” commit had no way to discover that its data was subsequently overwritten.

Write skew confirmed structurally. Elle found G2-item anomalies in transaction histories, confirming that write skew occurs as a consequence of the read-set gap described above. This is not a configuration-dependent finding; it reflects what the certification protocol can and cannot track.

Partition misconfigurations allow diverging reads. Galera’s primary component mechanism is supposed to stop non-quorum nodes from serving writes. With pc.ignore_quorum=true or misconfigured pc.weight values, a partitioned node continued serving reads from a state diverging from the rest of the cluster. Applications reading from that node believed they were reading authoritative data.

The `wsrep_sync_wait` Framing Problem

Galera’s documentation and most deployment guides treat wsrep_sync_wait as a performance tuning parameter. The Jepsen report reframes it as a correctness parameter, which is the more accurate characterization.

The MariaDB documentation for wsrep_sync_wait describes it as a bitmask: bit 0 (1) synchronizes before reads, bit 1 (2) synchronizes before writes, bit 2 (4) synchronizes before SHOW statements. The default is 0, no synchronization. For any application expecting that a read following an acknowledged write will see that write, wsrep_sync_wait=1 is the minimum required configuration, not an optional performance trade-off.

Jepsen found the stale-read behavior through systematic testing. The behavior is known in the Galera community and documented, but the connection between this setting and the consistency guarantees that applications implicitly assume is not emphasized in most operational documentation. Teams discover it in production.

Historical Context

This is not Jepsen’s first look at certification-based replication. Kyle Kingsbury’s 2014 “Call Me Maybe” post on Percona XtraDB Cluster found similar stale-read behavior and split-brain scenarios under partition. The difference between that analysis and the 2026 Galera 12.1.2 report is primarily the tooling: the 2014 work used Knossos, which models a distributed system as a single-object register and checks for linearizability. Elle models multi-object transactional workloads and checks for cycles in a dependency graph encoding write-write, write-read, and anti-dependency (read-write) edges. The mapping to named anomalies in Adya’s taxonomy makes the newer findings more precise and more actionable for operators.

How Other Systems Handle the Read-Set Problem

The systems that close the write-skew gap do so by tracking read sets. CockroachDB implements serializable snapshot isolation using a distributed MVCC approach with hybrid logical clocks. Transactions track both their write sets and their read sets; conflicts on either dimension trigger retries. Jepsen has tested CockroachDB across multiple versions with generally positive results on isolation guarantees, though with some findings around clock skew handling in older releases.

YugabyteDB takes a similar approach, also targeting serializable isolation via distributed MVCC and Raft-based consensus per shard. YugabyteDB’s documentation describes its SSI implementation explicitly in terms of read-set and write-set conflict detection.

MySQL Group Replication in single-primary mode avoids write skew by routing all writes through one node, effectively trading multi-master write scalability for stronger consistency. MySQL’s documentation explicitly recommends single-primary mode for workloads requiring strong consistency and lists multi-primary limitations in detail. Galera has no single-primary mode; the multi-master topology is non-optional.

What the Report Is Actually Saying

The Jepsen report does not conclude that Galera is unusable. It concludes that Galera should not be used as a drop-in replacement for a strongly consistent single-node database without explicit attention to wsrep_sync_wait, careful application-level handling of retry logic, and acceptance of the write-skew risk inherent to certification-based replication.

For write-heavy workloads where write skew is acceptable by design, where reads can tolerate eventual consistency, and where multi-master write distribution across nodes is valuable, Galera is a reasonable choice. The risk profile changes significantly for workloads that require read-your-writes consistency across all nodes, or for any transaction pattern where two concurrent transactions may both read and write based on shared state.

The precise vocabulary Elle provides matters here. Knowing that the system can exhibit G2-item anomalies and P4 lost updates, using Adya’s taxonomy, lets an operator reason about whether those anomalies affect their specific workload rather than treating “consistency issues” as a vague category of risk. That is what distinguishes a Jepsen analysis from a README caveat, and it is why this report is worth reading even for teams not currently running Galera.