The Cost of Technical Dishonesty, and What It Takes to Keep It High

For more than a decade, Kyle Kingsbury’s Jepsen project operated on a deceptively simple methodology: take a database vendor’s consistency claims, set up a cluster, inject network partitions and clock skews, run concurrent workloads, and compare what the vendor promised against what the data showed. The gap, when it existed, went into a public report. MongoDB, Cassandra, Redis, VoltDB, CockroachDB, Elasticsearch, MariaDB Galera, and dozens of others went through this process. The results were embarrassing enough, and technically specific enough, that the database industry measurably improved its documentation and its actual guarantees over the years that followed.

Kingsbury’s new essay, “The future of everything is lies, I guess: Where do we go from here?”, reads differently than a Jepsen report. It reads less like an investigation and more like a reckoning. The title carries a weight that “Jepsen: CockroachDB 22.2” never did. Something has changed, and the subtitle suggests he is not sure the methods that worked before are adequate anymore.

What Jepsen Actually Built

The lasting contribution was not any specific finding. It was the establishment of a credible adversarial testing methodology for distributed systems claims, and the public pressure that came with it. Before Jepsen, vendors could ship documentation claiming “linearizable reads” or “no data loss on network partition” with relatively low risk of public contradiction. A competitor might know those claims were false, but had every incentive to stay quiet, and customers rarely had the tooling or expertise to verify the claims themselves.

After Jepsen established its methodology publicly, the costs of making false claims shifted. A vendor claiming consistency properties knew those claims might end up in a Jepsen report. The reports were specific enough, and technically credible enough, that denying them was difficult. The result was industry-wide improvement in how distributed systems properties were documented and sometimes implemented.

Jepsen changed incentive structures by raising the cost of dishonesty. A database vendor’s claims might not end up under scrutiny, but enough databases were tested that the possibility was real. That is a different kind of pressure than a regulatory requirement or a bug report; it is a community-maintained credibility tax on exaggeration.

The Scale Problem

The difficulty now is that the surface area of consequential technical claims has expanded well beyond what any Jepsen-style project can cover.

Database consistency claims are a bounded domain. There are a finite number of major databases, their claims are relatively precise, and the testing methodology for evaluating those claims is well-understood. Jepsen’s Clojure-based testing framework is publicly available. Reproducibility is possible. The community has internalized enough of the methodology that some projects now run Jepsen themselves.

The claims that matter now are made in a much wider space. AI model providers make claims about reasoning, accuracy, and safety that are both harder to specify and harder to test adversarially. Companies building on top of AI make implicit claims about the reliability of outputs that customers have no way to independently evaluate. The volume of technical content online has grown sharply, and an increasing fraction of it is generated by systems optimized for plausibility rather than accuracy.

The specific challenge with AI systems is not just that they can be wrong. It is that evaluating their claims requires a fundamentally different methodology. Jepsen tests have a clear oracle: either the data is consistent or it is not; either the acknowledged write is present or it is not. AI output evaluation often requires ground truth data that does not exist for the claim being made, or expensive human evaluation, or AI-assisted evaluation with its own reliability questions. The tooling for adversarial evaluation of AI systems exists in early form, but it has not established the same credibility or community reach that Jepsen built for databases.

The LLM Dishonesty Surface

Large language models introduce a new form of unverifiable claim: claims made not by vendors but by the systems themselves, in the course of answering questions.

When a database vendor misrepresents linearizability, there is a human decision somewhere in the chain. When an LLM confidently states incorrect technical information, there is no human decision; there is a probability distribution over tokens. The error is not intentional in the usual sense. But from the perspective of someone trying to make a sound technical decision based on that information, the distinction matters less than it might seem. The output is wrong, it is presented confidently, and there is no internal signal distinguishing it from correct output.

Research on LLM calibration shows that model confidence scores are imperfectly correlated with accuracy, and that GPT-4-class models tend to be overconfident on factual recall tasks. Work on hallucination detection and mitigation has produced partial improvements, but nothing approaching the reliability of a correct answer. The practical result is that any technical decision-making process that incorporates LLM-generated information without independent verification is carrying an unknown and typically underestimated error rate.

This is not an argument against using language models. It is a structural description of a new source of plausible-sounding incorrect technical information, operating at a scale that few previous information channels have matched.

What Verification Infrastructure Looks Like

The honest answer to “where do we go from here” is probably: the same direction Jepsen pointed, but much harder.

Jepsen’s model works because the testing methodology was public and the results were public. Reproducibility matters. When someone can run the same test suite and get the same result, findings accumulate credibility independent of the author. The key was not just finding bugs, but building a community of practice around adversarial evaluation of specific claim types, and doing it consistently enough that it changed how vendors calculated the risk of exaggeration.

For AI systems, this looks like investing in benchmark infrastructure diverse and adversarially designed enough to resist Goodhart’s law. It looks like organizations like METR doing independent capability evaluations. It looks like the HELM benchmark suite at Stanford, imperfect as it is. None of these have established the same degree of credibility or feedback loop with industry incentives that Jepsen built over a decade of consistent work. But they are the direction.

The ARC-AGI benchmark series attempts to maintain challenge integrity by reserving test sets from public training pipelines. LMSYS Chatbot Arena uses blind human preference ratings to sidestep benchmark contamination. These are partial solutions to a hard problem, and each has its own methodological weaknesses. The goal, a testing regime that makes it expensive to claim capabilities you do not have, is the same goal Jepsen had for databases. The difficulty of establishing that regime for AI systems is substantially higher, because the claims are less formally defined and the test oracle is far less crisp.

The Accountability Gap

There is a structural reason the problem has gotten harder beyond just the expanded surface area. Jepsen worked in part because its targets, database companies, had direct financial relationships with the engineers who would encounter the reports. A startup CTO reading a Jepsen report about their chosen database had a clear path to bringing that to the vendor. Vendors knew their enterprise customers had technical staff who read Jepsen.

The accountability loop for AI claims is weaker. The people making claims about model capabilities are typically not the same people whose production systems will behave incorrectly when those claims turn out to be wrong. The downstream effects are diffuse, often delayed, and difficult to trace back to a specific false claim. This diffusion reduces the pressure that credible adversarial testing can create.

Kingsbury’s title lands with weight because it comes from someone who has done the adversarial verification work for years and is watching the problem scale past what the methodology was designed to handle. The frustration is earned. The methodology itself, adversarial testing, public reproducible results, community credibility, changing incentive structures rather than eliminating bad actors, remains the most useful template available. The work got substantially harder; it did not stop being the work.