When the Correction Never Comes: On Technical Lies and the Systems We Built to Ignore Them

Kyle Kingsbury, better known as aphyr, has spent the better part of a decade running Jepsen against distributed databases and publishing the results. The findings are almost always the same: a database claims linearizability or serializability, Jepsen finds violations, the vendor acknowledges, patches, or quietly updates their documentation, and then the cycle repeats with the next product. His latest essay steps back from the individual test reports and asks the harder question: after all that work, what has actually changed?

The answer, as the title implies, is not much.

What Jepsen Actually Proved

Jepsen’s contribution to distributed systems isn’t just a list of bugs. It’s a methodological proof that self-reported correctness guarantees in the database industry are essentially meaningless without independent verification. The Jepsen analyses catalog reads like a hall of shame: MongoDB, Cassandra, Elasticsearch, etcd, CockroachDB, TiDB, and dozens more, each caught in some violation of the consistency guarantee they advertised.

Most of these systems weren’t lying maliciously. Engineers made architectural decisions that didn’t hold up under network partition, or documented guarantees that assumed conditions the real world doesn’t provide, or simply described aspirational properties rather than measured ones. The gap between “what we designed” and “what we tested under adversarial conditions” turned out to be enormous, and almost nobody was checking.

What Jepsen revealed is that the distributed systems field had an honor system where nobody was checking for cheating, not because the engineers were dishonest, but because the tooling to check was hard to build and the incentives to deploy it were weak.

The Incentive Structure That Makes Lies Rational

This is the part that Kingsbury’s essay pushes on that most technical post-mortems skip. A database vendor faces a choice: spend months doing adversarial correctness testing before announcing “strong consistency”, or ship with the claim and wait for someone like Jepsen to find the edge cases. The second path gets you to market faster, the marketing claim is already in the wild by the time a correction surfaces, and the correction rarely gets the same distribution as the original announcement.

This isn’t unique to databases. It’s the same dynamic behind SSL/TLS implementation bugs that lasted years, behind CPU speculation vulnerabilities that shipped with incorrect security documentation, behind cloud providers publishing SLAs that carefully avoid committing to anything measurable. The correction never reaches the people who made decisions based on the original claim.

And now we’re in an AI moment where the same dynamic is running at a much larger scale and a much faster cycle time. Model providers publish benchmark numbers. The benchmarks get gamed or cherry-picked. Researchers find the gaps. The next model ships. Nobody updates the downstream applications that were built on the assumption that the benchmark reflected real-world performance.

Verification Doesn’t Scale the Way We Need It To

The uncomfortable thing about Jepsen is that it’s artisanal. Each analysis requires Kingsbury or a collaborator to understand the system’s claimed semantics, write a workload that exercises the relevant failure modes, and interpret the results. It’s months of skilled work per system. The database industry has produced hundreds of systems. The math doesn’t work.

Formal verification is the obvious answer people reach for, and it’s genuinely useful in bounded domains. TLA+ has caught real bugs in real systems, including Amazon’s use of it for DynamoDB and S3. seL4 is a formally verified microkernel. CompCert is a verified C compiler. These are genuine achievements.

But formal verification works on specifications, and specifications are where the lies often live. You can verify that your implementation matches your spec while your spec omits the network partition behavior, the clock drift behavior, the behavior under specific GC pause patterns. Verification closes the gap between spec and implementation. It doesn’t guarantee the spec is honest.

This is the epistemological trap Kingsbury is pointing at. You can build more rigorous tools, and they’ll find more bugs, and vendors will fix those bugs, and the marketing claims will still race ahead of the evidence.

The AI Version of This Problem Is Worse

At least with databases, there are properties you can formally define and test. Linearizability has a precise mathematical definition. You can write a checker. Elle, Jepsen’s transactional consistency checker, works because “serializable” means something specific enough to verify mechanically.

Large language model capabilities don’t have this property. When a provider claims a model “reasons about code” or “understands context”, there’s no formal definition to check against. MMLU, HumanEval, MATH, and similar benchmarks are the closest approximation, but they’re known to be gameable, they measure narrow slices of claimed capabilities, and the benchmark results saturate before the underlying capability claims stop inflating.

The result is a landscape where capability claims are essentially unauditable by third parties. You can run the benchmarks, but if the model was trained on benchmark-adjacent data, the benchmark isn’t measuring what it claims to measure. Contamination detection is an active research area precisely because this is a real and widespread problem.

What “Where Do We Go From Here” Actually Means

Kingsbury’s question at the end of the title isn’t rhetorical despair. It’s asking what the viable paths are when the verification problem is intractable at scale and the incentive structure doesn’t self-correct.

A few things seem worth investing in, none of them individually sufficient.

Regulatory disclosure requirements are one lever. If vendors are required to publish adversarial test results before making specific correctness claims, the cost of a false claim goes up. The EU’s AI Act gestures at this for high-risk AI systems, requiring conformity assessments and technical documentation. It’s incomplete and enforcement is uncertain, but it’s a structural change in who bears the cost of a lie.

Independent red-teaming infrastructure is another. Jepsen is essentially a one-person red-teaming operation that the industry has come to rely on despite never collectively funding. If the security industry’s model for CVE disclosure and coordinated response has any lessons, it’s that adversarial testing needs institutional support to scale.

Longer-term, better specification languages for the things we care about would help. If “this model correctly handles context windows up to N tokens” had a precise enough definition to write a checker for, independent verification becomes possible. The database world got Jepsen because linearizability has a definition precise enough to build tools around. Expanding that surface area is slow, unglamorous work.

Why This Keeps Mattering

I’ve been building Discord bots and tooling for long enough to know that the systems I build on top of are not what their documentation says they are. The message delivery guarantees Discord publishes are not the same as the guarantees I observe under load. The consistency behavior of the database I use is not exactly what the README claims. I build in retries, idempotency keys, and reconciliation passes not because I’m paranoid but because I’ve been burned by believing documentation.

Most engineers who’ve been doing this long enough have the same practiced skepticism. The problem Kingsbury is identifying isn’t that experienced engineers don’t know to distrust claims. It’s that the industry’s default is to publish the optimistic number and make skepticism the buyer’s problem. New engineers, product teams making build-vs-buy decisions, and organizations without the expertise to run adversarial tests all pay the price.

Jepsen has been one of the few systematic efforts to shift that cost back toward vendors. The essay’s title suggests some exhaustion with how little that’s shifted the underlying dynamic. That exhaustion is earned. But the alternative, which is to stop checking and trust the claims, is clearly worse.