The Cost Equation That Has Kept Formal Verification Out of Production Software

Formal verification tooling has been mature for a long time. Coq reached version 1.0 in 1989; Isabelle has been under active development since 1986; PVS, HOL4, and ACL2 are all decades old and used in production at organizations like NASA, Intel, and major aerospace contractors. The gap was not in the tools. It was in the labor required to use them.

A conservative estimate from formal methods literature puts the cost of formal verification at 10 to 50 times that of unverified development when done by experts. For most software, that ratio places formal verification entirely outside the cost envelope. It has persisted in niches where the cost of an undetected bug exceeds the cost of verification: avionics certification under DO-178C, cryptographic protocol design, OS microkernel verification. The seL4 project at CSIRO verified the functional correctness of a microkernel in Isabelle, representing tens of person-years of proof engineering. The result is extraordinary; it is also not replicable for typical software projects.

Mistral AI’s Leanstral, an open-source agent for Lean 4 proof engineering released in March 2026, is interesting in this context because it specifically targets the cost driver where AI assistance is most mechanically applicable: the labor of constructing proofs once a specification exists.

Where the Cost Actually Lives

Formal verification cost breaks down into three distinct phases, and they are not equally automatable.

The first is specification writing. Before anything can be verified, the property to be verified must be stated precisely in the proof assistant’s language. Writing a specification for a sorting function is straightforward; writing one for a concurrent cache that correctly characterizes all behaviors under contention requires deep understanding of the system’s intended semantics. The AI can suggest specifications and flag inconsistencies, but choosing what to specify, and verifying that the specification captures the right behavior, remains an engineering decision with semantic weight that no model output substitutes for.

The second is proof development: given a specification, find and construct a proof that the implementation satisfies it. This is where most of the mechanical labor has historically lived. Mathematical proofs on paper compress years of work into a few paragraphs; formalizing them into a proof assistant has traditionally required substantially more effort, because every step that a human reader accepts as obvious must be explicitly constructed and kernel-verified. This is the phase where AI assistance is most directly applicable. The search through tactic space is mechanical, the feedback from the Lean 4 kernel is unambiguous and instantaneous, and the scale of Lean 4’s training corpus gives a language model substantial pattern-matching material.

The third is proof maintenance. Software changes, and proofs about software must change with it. A refactoring that renames a function, changes a data structure, or alters a type signature may break dozens of dependent proofs. This phase is chronically underestimated; it has led some organizations that adopted formal verification for critical systems to later abandon it as maintenance costs compounded. AI assistance can help here too, since a model that can search for tactics can also repair proofs after changes, but the problem is structurally harder: the model must diagnose why a previously valid proof now fails and identify which lemma structure needs to change rather than just which tactic to apply next.

The Historical Infrastructure

Understanding why Leanstral matters requires understanding the design lineage it builds on. The architectural principle underlying all serious proof assistants was established by Nicolaas de Bruijn’s Automath in 1967: maintain a small, trusted kernel that validates proofs generated by larger and potentially buggy automation layers. The de Bruijn criterion holds that only this kernel needs to be trusted, and that it should be kept small enough to audit. Lean 4, designed by Leonardo de Moura and Sebastian Ullrich and described in their 2021 CADE paper, implements this rigorously; its kernel is a few thousand lines of C++ and is the sole ground truth for proof correctness.

Coq introduced a practical tactic language in the early 1990s that let users write proof scripts instead of raw proof terms, substantially reducing proof development labor by allowing higher-level reasoning. Isabelle developed Sledgehammer, a tool that dispatches proof subgoals to external automated theorem provers such as E, Z3, and Vampire, automatically filling in the low-level connective tissue between high-level proof steps. Lean 4’s built-in automation tactics, omega for linear arithmetic, ring for polynomial identities, norm_num for numeric computations, and decide for decidable propositions, serve the same role: dispatching entire classes of goals automatically and reducing the search space the agent has to navigate.

What none of these systems had was a language model trained on the full corpus of formal mathematics. Sledgehammer calls first-order provers with limited context and cannot recognize that a specific lemma from algebraic topology would reframe a number-theoretic goal in a way that makes a subsequent automation apply. A model trained on Mathlib, Lean 4’s community mathematics library containing over 150,000 theorems, can build this kind of associative reasoning across the library’s full breadth.

DeepMind’s AlphaProof demonstrated what becomes possible when that training is combined with reinforcement learning using the kernel’s binary reward signal. Solving 4 of 6 problems from IMO 2024, including problems never previously formalized in Lean, it required no human annotation of proof correctness. The kernel accepted or rejected each attempt, and the model learned from millions of such interactions. The critical design choice was the same one Automath established in 1967: keep the verifier separate from the generator, and trust only the verifier.

The Open-Source Distinction

Leanstral’s release as a locally-deployable open-source model matters specifically for the contexts where formal verification is most valuable. Aerospace and defense contractors cannot send proprietary system specifications to an external API; cryptographic protocol designers working on sensitive systems face similar constraints; organizations subject to export control or data residency requirements have limited options with cloud-hosted AI. A locally-runnable model that does useful formal work is a categorically different tool from a cloud-hosted code completion service, regardless of capability comparison.

The LeanDojo project from Caltech and MIT provided an open gym-style interface for Lean 4 in 2023, enabling the community to build and benchmark AI proof systems. Their ReProver system, which retrieves relevant premises from Mathlib using a retrieval model trained on proof co-occurrence, reached around 51% on their benchmark in 2023; state-of-the-art systems have since pushed toward 65-70% on the miniF2F benchmark of 244 competition-level mathematics problems. Leanstral enters this landscape as an open-source, purpose-built model rather than a general code completion model adapted for Lean, which should affect its coverage of Lean-specific tactic idioms and library structure.

The Specification Gap Remains

A kernel-accepted Lean 4 proof is a correct proof of the stated proposition. Whether the stated proposition captures what the developer intended is outside the kernel’s scope and outside what any AI agent currently changes.

The seL4 verification layered three levels of specification: the abstract mathematical model, the Haskell prototype, and the C implementation. Each transition required human effort to specify correctly, and the value of the entire verification chain depends on the quality of the abstract specification. A verified sorting function that sorts in the wrong order, a verified cryptographic protocol with a flawed threat model, a verified compiler for incorrect language semantics: all are kernel-certified and semantically wrong. Formal verification has always been contingent on correct specifications, and no current AI system has changed this.

Some research is directed at inferring specifications from test cases, documentation, and informal descriptions, but these approaches remain preliminary. The more certain near-term benefit of systems like Leanstral is narrower: a developer who writes a correct specification can increasingly delegate proof search to an AI agent and receive a kernel-verified answer, rather than spending weeks constructing the proof manually.

What Moves in the Cost Calculation

The 10 to 50x cost ratio for formal verification assumed expert human proof engineers working without AI assistance. If AI-assisted proof search reduces proof development time significantly for tractable specifications, the cost ratio for suitable properties drops into a range more software projects can justify. The properties best suited to this approach are local, bounded, and specifiable without deep domain knowledge: data structure invariants, cryptographic protocol properties, parser correctness, compiler intermediate representations.

These are the same properties where formal verification has historically had the clearest cost-benefit argument. What changes is not what is worth verifying but how many teams can afford to verify it. The Lean 4 ecosystem, the 2021 paper’s careful design of its FBIP memory model and LLVM-compiled native code, and Mathlib’s breadth as both training corpus and benchmark, form the infrastructure. Leanstral, following the same architectural approach AlphaProof validated at competition mathematics level, provides the agent. Whether the combination moves the cost calculation far enough for routine software development depends on how well such systems handle the distance between mathematical benchmarks and the specifications engineers actually need to write.