· 7 min read ·

Kyle Kingsbury Has Been Watching the Industry Lie for a Decade. It's Getting Worse.

Source: hackernews

Kyle Kingsbury has been testing distributed databases since 2013. The Jepsen project, which he built and has maintained for over a decade, applies formal consistency analysis to production systems, then publishes the results. The findings are rarely flattering. MongoDB lost data under network partition despite claiming otherwise. Cassandra’s “lightweight transactions” were not serializable. Redis Sentinel’s failover created split-brain states. CockroachDB’s default isolation level was not serializable in the way its documentation implied. In every case, a company made a strong technical claim, a developer built on top of that claim, and Jepsen demonstrated the claim was wrong.

His latest essay broadens that lens considerably. The question is no longer whether a specific database is honest about its consistency model. The question is whether the information environment that engineers depend on is even meaningfully oriented toward truth anymore, and what a person is supposed to do when the answer is no.

How Jepsen Made False Claims Visible

To understand why this essay lands the way it does, you have to understand what the Jepsen project actually proved about the industry’s relationship with accuracy.

Jepsen runs a cluster, injects faults (network partitions, clock skew, process pauses, power loss), executes concurrent operations, and checks whether the observed history of operations is consistent with the guarantees the database claims to provide. The formal model used is based on Aphyr’s knossos library, which implements linearizability checking via a variant of the Wing-Gong algorithm. The process is thorough, reproducible, and grounded in definitions that are part of the published distributed systems literature.

The point is not that database engineers are lazy or malicious. The point is that correctness is hard to verify and easy to claim, and the market rewards confident claims over accurate ones. A sales team that says “our database is eventually consistent with no single-point-of-failure under network partition, subject to the following precisely defined failure modes” will lose deals to one that says “our database is always-on and never loses data.” The incentive gradient toward overclaiming is steep, and for a long time the cost of that overclaiming was absorbed by engineers who debugged production outages years later.

Jepsen raised the cost of false claims by making them visible before deployment. After Jepsen published its MongoDB report, MongoDB issued a response and improved their documentation. After the Redis report, the Redis team changed some defaults. The mechanism worked because falsified claims in distributed systems are testable: you can run the software, inject the failure, and observe what actually happens.

The New Problem Is That Claims Are Becoming Untestable

AI systems present a categorically different challenge. When a database vendor claims linearizability, you can run knossos against it. When an AI company claims their model is “aligned,” “safe,” “reliable,” or “capable of reasoning,” there is no corresponding formal definition to check against, and no test harness that produces a definitive verdict.

This is not an accident. Capability claims for AI systems are intentionally vague because vagueness survives scrutiny and specificity does not. “GPT-X reasons like a senior engineer” is unfalsifiable. “GPT-X achieves 87.3% on HumanEval” is specific but downstream of benchmark selection, which is itself controllable by the company running the evaluation.

The benchmarking problem in AI has been documented extensively. Goodhart’s Law applies with unusual force: once a benchmark becomes the measure by which models are evaluated and sold, companies train specifically toward that benchmark. SWE-bench, HumanEval, MMLU, and MATH have all seen this dynamic. A model that achieves state-of-the-art performance on a published benchmark may still fail on the type of work the benchmark was meant to measure, because the training data may have included benchmark problems or near-equivalents.

The AI Safety Institute in the UK and its counterpart at NIST have made genuine efforts to produce evaluations that are harder to game, but the fundamental problem remains: the organizations commissioning evaluations and the organizations being evaluated have very different interests, and the evaluating organizations do not have the resources to run Jepsen-scale adversarial testing across every capability claim every major lab makes.

The Information Ecosystem Problem

Kingsbury’s concern in the essay extends beyond AI company marketing. The downstream effects of AI-generated content on the general information environment are the broader target.

Search results, documentation, Stack Overflow answers, blog posts, and technical tutorials are now partially generated by systems that produce fluent text with no particular connection to ground truth. The generation is cheap enough that content volume has increased dramatically while the average epistemic quality has decreased. A developer searching for how to configure distributed locking in Redis is now more likely to encounter a confident, well-formatted article that is wrong than they were in 2020. The article may link to real documentation; it may use correct terminology; it may be wrong in a way that is very hard to detect without already knowing the answer.

This is the part of the problem that does not have a Jepsen-style solution. Jepsen works because there is a ground truth to check against. Code either maintains linearizability under fault injection or it does not. Information about how to correctly configure a system is harder to verify because verification requires expertise, and if you have the expertise to verify the information, you probably did not need the information.

The social infrastructure that used to partially compensate for this, peer review, reputation systems, community correction, is under pressure from the same dynamics. When a wrong Stack Overflow answer gets upvoted by people who found it plausible rather than by people who tested it, the community correction mechanism fails silently.

Where the Incentives Actually Point

The deepest part of Kingsbury’s argument is about incentives, not technology. False claims persist because the cost of making them is low and the benefit is high. Database vendors who overclaim face occasional Jepsen reports and the reputational cost of being exposed by a respected researcher. AI companies who overclaim face skeptical blog posts and, occasionally, regulatory attention. In neither case does the market price the dishonesty accurately enough to deter it.

The regulatory response in the EU through the AI Act’s capability and transparency requirements, and in the US through NIST’s AI Risk Management Framework, is an attempt to change this. Mandatory third-party audits and standardized evaluation requirements would externalize the cost of false claims more effectively than voluntary disclosure. But regulation that is not grounded in clear definitions of the properties being claimed will reproduce the same problem at a different level of abstraction. “This AI system must be reliable” is as unfalsifiable as “this database never loses data” without a shared technical specification of what reliable and reliable mean in each context.

What Jepsen succeeded at was establishing a shared vocabulary, and the definitions in it were precise enough to make claims checkable. Distributed systems engineers now have a reasonably common understanding of what linearizability, serializability, and causal consistency mean, and why the differences matter. That vocabulary developed over decades, through papers like Herlihy and Wing’s 1990 linearizability paper, through the CAP theorem, through the critiques of the CAP theorem. The AI field has not developed an equivalent vocabulary for the properties its systems are claimed to have, and the commercial pressure to avoid developing one is substantial.

What Is Actually Tractable

Kingsbury’s question, “where do we go from here,” is not rhetorical. Some things are tractable.

For distributed systems, the Jepsen model has already shown that adversarial third-party testing by technically credible researchers changes vendor behavior. The DBMS Testing Alliance and similar academic efforts extend this. Where the model breaks down is resource: Jepsen analyses take weeks per database, and the number of databases, AI systems, and cloud services making correctness claims has grown far faster than the capacity to test them.

For AI systems, the most tractable near-term path is probably narrowly scoped capability claims with clear testing protocols attached to them. “This model correctly classifies toxic content in English at 94% precision and 91% recall on the following benchmark” is verifiable and useful. “This model is safe” is neither. The pressure on companies to make only the former type of claim would have to come from procurement, regulation, or sufficiently painful reputational incidents, because the market as currently structured does not reward it.

For the information ecosystem problem, the solutions are slower and more cultural than technical. Source literacy, provenance tracking, and the rehabilitation of slow, careful, verifiable writing over fast, fluent, confident generation are directionally correct but take a long time to shift. The model cards framework developed at Google, whatever its limitations, represents the right instinct: make the claims specific, bounded, and testable, and attach them permanently to the artifact being claimed about.

What Kingsbury has been documenting for a decade is not a new problem but an old one that technology keeps amplifying. The solution is not more technology. It is more precision, more willingness to say “I don’t know,” and more institutional capacity to verify claims before they become load-bearing assumptions in production systems. That is harder to build than a new database, and the incentives for building it are diffuse. But so was linearizability checking, once.

Was this interesting?