From Silver Medal to Production Use: The Remaining Distance in AI Formal Verification

In 2024, DeepMind’s AlphaProof solved four of six problems from the International Mathematical Olympiad, reaching silver-medal level. The system had no prior exposure to those specific problems. It formalized them in Lean 4, searched for proofs using reinforcement learning with the kernel as the sole reward signal, and found proofs that checked out. The result made formal verification, long considered too expensive and specialized for routine engineering work, look like a problem that AI was beginning to solve.

Mistral’s Leanstral is the most recent system in this lineage. It is open-source, locally deployable, and targets both formal proof engineering and verified coding in Lean 4. Understanding where it fits requires context on how this field has moved and where the remaining obstacles are.

The Systems Before It

The foundation for AI-assisted formal verification in Lean 4 was laid by LeanDojo from MIT and Caltech in 2023. LeanDojo built an open gym interface for Lean 4, exposing proof states as observations and tactics as actions. Their ReProver system added retrieval-augmented generation: before generating each tactic, a retrieval model scans Mathlib for lemmas likely to be relevant at the current proof state, based on co-occurrence patterns from existing proofs. ReProver reached approximately 57% on miniF2F, a set of 488 competition-level math problems.

COPRA showed the same year that GPT-4 with a backtracking search strategy, without task-specific fine-tuning, achieved competitive results. The lesson was that search strategy matters roughly as much as model quality at moderate difficulty levels. A weaker model with effective backtracking outperforms a stronger model generating one-shot.

Draft, Sketch, Prove tried a different decomposition: the model generates an informal proof sketch in natural language, then formalizes it step by step. This mirrors how mathematicians work and anchors the search to a high-level plan before tactic generation starts, which sidesteps some of the premise selection difficulty.

AlphaProof scaled all of this with full reinforcement learning. The IMO result changed the implied ceiling. AlphaProof uses Monte Carlo tree search over tactic sequences, with the Lean 4 kernel providing binary accept/reject rewards; no human annotations, because the kernel’s verdict is instantaneous and unambiguous. AlphaProof’s proof of problems it had never seen was enabled by extended search time, not a lookup in training data. The architecture generalized.

The limitation was that AlphaProof is proprietary and cloud-only. The research result stood; the capability did not transfer to practitioners. Leanstral, along with DeepSeek-Prover, is part of the wave of open-weight systems attempting to address that.

What the Benchmarks Measure

Current state-of-the-art systems score in the 60-70% range on miniF2F. This matters as a calibration signal, but miniF2F is competition mathematics: algebra, number theory, combinatorics. The distribution reflects what is available in formalized form, which is overwhelmingly Mathlib theorems and IMO-style problems.

Software verification looks different. A proof of memory safety for a systems program involves specifications of pointer aliasing, lifetime constraints, and invariants on data structures. A proof of protocol correctness involves concurrent message-passing systems and state machine invariants. A proof of cryptographic security involves probabilistic reduction arguments. Few of these appear in competition math training data; few appear in Mathlib in forms that transfer directly to engineering contexts.

This is why miniF2F scores do not predict how well Leanstral handles a team trying to verify a parser, a scheduler, or a concurrent data structure. The mathematical prerequisites are similar, but the domain knowledge is entirely different. The benchmarks tell you what proof search achieves on a well-defined mathematical distribution; they say less about performance on specifications an engineering team would actually write.

The gap is not just about training data. Competition math proofs tend to be self-contained: the relevant mathematical objects are fully described in the problem statement. Software specifications depend on a large context of type definitions, invariants, and library behavior that must be correctly modeled. Getting the specification right is harder than the proof search that follows from it.

The Design of Lean 4

Lean 4 is the right target for this kind of agent because of a design decision in its 2021 CADE paper by Leonardo de Moura and Sebastian Ullrich: Lean 4 is simultaneously a dependently typed theorem prover and a compiled systems programming language.

In Coq, verifying an algorithm produces a proof and a separate extracted artifact in OCaml or Haskell. In Isabelle/HOL, verification and execution are separated similarly. The verified specification and the running code are different artifacts; while the extraction is formally justified, the gap is real enough that most software teams skip formal verification entirely. Lean 4 eliminates this. A function proven correct in Lean 4 is a Lean 4 function, compiled via LLVM to a native binary. The specification, the proof, and the executable implementation are the same artifact.

The FBIP (functional but in-place) memory model from the 2021 paper is part of what makes this practical: values with a reference count of one at runtime are updated in place rather than copied, achieving near-imperative performance while maintaining functional semantics.

For an AI agent, this matters because the output is deployable code. A Leanstral-generated verified sorting function is something you can link against. The proof and the code do not live in separate worlds that require translation between them.

Why Open Source Matters Differently Here

In standard AI coding assistance, open-source models matter partly for correctness auditing: understanding what the model might generate. In formal verification, correctness is handled by the kernel, not the model. Every proof, regardless of how it was generated, passes through Lean 4’s kernel, a few thousand lines of C++ implementing the Calculus of Constructions, following the de Bruijn criterion that only the kernel needs to be trusted. The model can generate incorrect proof attempts freely; they are rejected before they count.

This separates the reasons for caring about open-source models in this domain. The model being open-source does not make proofs more correct; the kernel already handles that. It matters for different reasons:

Local deployment. Formal verification happens in regulated environments: aerospace certification, cryptographic protocol design, safety-critical embedded systems. Sending code to an external API is often not acceptable. A locally-runnable model fits certification workflows that cloud services cannot.
Fine-tuning. A team verifying domain-specific infrastructure can fine-tune on their type definitions, proof conventions, and domain lemmas. This is not possible with closed models.
Auditability of the pipeline. Certification processes require auditing the full toolchain. Open weights are auditable weights in a way that API endpoints are not.

AlphaProof’s results demonstrated what the architecture can achieve; they did not make the capability accessible to engineers who need local deployment and domain customization. Leanstral’s open-source release addresses that constraint more directly than a better miniF2F score would.

The Remaining Hard Problem

Benchmark progress on competition mathematics does not fully represent the difficulty of software verification. Writing a correct formal specification of what software is supposed to do remains a human problem, and it is frequently harder than writing the software itself.

The Lean 4 kernel verifies that a proof term has the type corresponding to a stated proposition; it has no access to intent. Consider a sorting function specification:

def sort_correct (sort_fn : List Nat → List Nat) : Prop :=
  ∀ l, List.Sorted (· ≤ ·) (sort_fn l) ∧ sort_fn l ~ l

Drop the permutation constraint (sort_fn l ~ l) and you have formally verified a function that returns an empty list for every input. The kernel accepts the proof, but the specification was wrong. This is not a gap that better models will close; it is a fundamental property of what formal verification means.

What AI assistance changes is the cost of the proof step, once a correct specification exists. The proof obligations for a verified sort function, or a verified parser, or a verified cryptographic primitive, have historically required significant expertise in Lean tactic selection and library navigation. A system that discharges those obligations automatically, given a specification, makes formal verification accessible without requiring that expertise at every step. The Liquid Tensor Experiment, which formalized Peter Scholze’s condensed mathematics in Lean, took years of substantial community effort; the proof-writing was hard, but deciding exactly what each lemma’s formal statement should be was equally demanding.

The near-term scope for systems like Leanstral is narrower than full-scale software verification: critical algorithms in security-sensitive or safety-critical contexts, where the investment in specification is justified by the cost of being wrong. For that scope, the combination of Lean 4’s unified proof-and-code model, Mathlib’s depth as a lemma library, and open weights for local deployment and fine-tuning represents a meaningful advance over what existed before. The competition math benchmarks measure the lower end of that scope. The rest is being built.