The Proof Repair Loop: Leanstral and the State of AI-Assisted Formal Verification

Mistral AI announced Leanstral this week: an open-source agent for formal proof engineering in Lean 4. The announcement is framed around “trustworthy coding,” which is a reasonable pitch for a general audience, but the more interesting story is about architecture. Leanstral represents the current consensus design for AI proof assistants, and understanding why that design looks the way it does requires backing up to look at what makes formal verification fundamentally different from ordinary code generation.

Why Lean 4 Is the Substrate Everyone Converged On

Formal proof assistants have existed since the 1970s, and the competition is substantial: Coq, Isabelle/HOL, Agda, Metamath, ACL2. So why has AI-assisted proof research concentrated so heavily on Lean 4?

The answer is Mathlib4. The Lean 4 standard mathematics library contains over 150,000 theorems spanning algebra, analysis, topology, number theory, and combinatorics. It is the largest single formal mathematics library in the world, and its size creates a virtuous cycle: more theorems means more training data, which means better AI, which attracts more contributors, which grows the library further. No other proof assistant ecosystem has this scale.

Beyond the library, Lean 4 itself is an unusual piece of software. The 2021 CADE paper by de Moura and Ullrich describes a language where the theorem prover and general-purpose programming language share the same core. Lean 4 compiles to C and produces binaries that are competitive with OCaml in performance benchmarks. Its metaprogramming system is written in Lean 4 itself, meaning that tactics (the DSL you use to construct proofs interactively) are just programs of type TacticM Unit. This self-hosting design gives the ecosystem an unusual coherence: the same language that proves theorems about cryptographic primitives can also implement those primitives.

The type theory underneath is the Calculus of Inductive Constructions, extended with universe polymorphism, quotient types, and a handful of classical axioms (propositional extensionality, function extensionality, choice). Propositions are types; a proof of P is a term of type P. The kernel that type-checks programs is identical to the kernel that verifies proofs. This is not a distinction with a difference: it means the correctness oracle is the same system you use for everything else, and it is fast, deterministic, and machine-checkable.

The Proof-Repair Loop

Most discussion of AI coding tools focuses on the model: what architecture, how many parameters, what training data. For formal proof systems, the more important question is what the agent does after the model generates something.

The central insight in Leanstral, and in every competitive AI prover today, is the proof-repair loop. The agent generates a proof attempt, submits it to the Lean 4 kernel, reads the error output, and iterates. The kernel’s feedback is binary and precise: either a proof type-checks or it does not, and if it does not, the error message tells you exactly where it failed and why. “Unknown identifier Nat.add_comm'” means the lemma name is wrong. “Type mismatch: expected n + m = m + n but got m + n = n + m” means the arguments are reversed. “Unsolved goals: ⊢ 0 < n” means the proof is structurally complete but left a subgoal open.

This is qualitatively different from the feedback loop in ordinary code generation, where test failures are ambiguous (is the test wrong, or is the code wrong?), and lint errors are soft suggestions. The Lean kernel is an oracle. There is no false positive, no flaky test, no environment-specific behavior. The repair loop therefore gives the model reliable signal on every iteration, which is what makes agentic architectures viable here.

In practice, a Lean 4 proof state looks like this:

-- Tactic state:
-- n m : ℕ
-- h : n < m
-- ⊢ n + 1 ≤ m
example (n m : ℕ) (h : n < m) : n + 1 ≤ m := by
  exact Nat.succ_le_of_lt h

The model receives the tactic state as context, generates the next tactic, and the kernel either advances the proof state or rejects the step. Leanstral is specifically trained to interpret these proof states and kernel errors, not just to produce syntactically plausible Lean code.

What General Coding LLMs Get Wrong About Proofs

A general code model like Codestral or GPT-4o can write Lean syntax. Lean code appears in public training data. But type-checking success rates without a feedback loop are low, and the failure modes are specific.

The most common is hallucinated premises. Lean 4 proofs frequently cite Mathlib lemmas by their exact name: Real.sqrt_sq, Finset.sum_comm, Int.ediv_add_emod. A model that does not know the Mathlib namespace deeply will generate plausible-sounding names that do not exist, or that exist with slightly different types. With 150,000 lemmas in the library, premise selection is a genuine search problem, not a recall problem.

LeanDojo’s ReProver addressed this with retrieval augmentation: a bi-encoder retrieves relevant premises from Mathlib based on the current proof state, and those premises are fed into the generator as context. The model still has to choose and apply them correctly, but at least it is choosing from a relevant shortlist rather than hallucinating from scratch.

A second failure mode is the binary correctness requirement. Unlike natural language, where a mostly-correct sentence is still readable, a proof with a single wrong step fails entirely. Models trained on natural language tend to generate outputs that are roughly right, with the expectation that the reader can fill in gaps. The Lean kernel has no such tolerance.

Third: multi-step dependency. A non-trivial theorem proof might require 30 tactic steps, each building on the previous. An error at step 20 means steps 21 through 30 are meaningless. Models that generate proofs autoregressively without checkpointing intermediate states will compound errors across long proof chains.

The Trajectory That Got Here

This architecture did not appear fully formed. The lineage is traceable.

GPT-f (Polu and Han, 2020) was the first serious application of large language models to formal proof search, targeting Lean 3 and Metamath. The key contribution was using the language model as a policy in a best-first proof search: generate candidate next steps, rank them, execute the best, and expand the search tree from the resulting states. It achieved around 36% on miniF2F, a benchmark of 488 competition math problems.

LeanDojo (2023) contributed the open infrastructure: a gym-style environment wrapping Lean 4’s interactive server, plus the LeanDojo Benchmark for standardized evaluation. ReProver, their retrieval-augmented prover, reached ~26% on a harder Mathlib-derived benchmark, but more importantly gave the research community reproducible tooling.

DeepSeek-Prover-V1 (2024) showed what scale of synthetic data could do. The team generated roughly 8 million Lean 4 proof pairs by translating informal problems to formal statements, then using a draft model to generate proof candidates and filtering them with the Lean kernel. Only correct proofs were kept for training. This self-filtering pipeline drove miniF2F pass rates above 50%. DeepSeek-Prover-V1.5 added MCTS-based proof search on top, pushing further.

AlphaProof (Google DeepMind, 2024) solved four of six problems from IMO 2024 in Lean 4, combining a Gemini-based language model with AlphaZero-style tree search and a value network trained through self-play. This remains the highest-profile result in the field. It is not open-source.

Leanstral is Mistral’s open-source answer: a fine-tuned model plus the agentic scaffolding, released under terms that allow downstream use. The competitive landscape now has open-source options (LeanDojo, DeepSeek-Prover, Leanstral) alongside closed systems (AlphaProof, internal tools at Google and OpenAI). That the open-source tier exists at all is not obvious; a year ago it was less clear.

What “Trustworthy Coding” Actually Means

Mistral’s framing of Leanstral as a “trustworthy coding” tool is not purely a marketing angle. The formal verification community has spent decades arguing that theorem provers should be used for software, not just mathematics, and the argument is straightforward: if you can specify the property you care about as a type, the Lean kernel will check that your implementation satisfies it.

For a cryptographic hash function, the property might be collision resistance under a formal model of the adversary. For a parser, it might be that every output is a valid AST. For a memory allocator, it might be that allocated regions never overlap. These are not hypothetical: Lean 4’s practical programming facilities (compiled output, do-notation, FFI) make it a viable implementation language for systems-adjacent code, not just a proof scratchpad.

The practical barrier has always been the labor cost of writing proofs. A tool that can fill in proof obligations given the theorem statement and library access materially reduces that cost. This is the actual promise of Leanstral: not that it replaces the mathematician, but that it handles the tedious middle steps (the omega calls, the simp configurations, the lemma-chaining) that currently require detailed Mathlib knowledge to navigate.

Whether Leanstral delivers on that in practice depends on benchmarks that are still being evaluated against the released weights. But the architectural approach is sound, the open-source release is a genuine contribution to a field that has been dominated by lab-internal tools, and Lean 4 is the right substrate. The interesting question going forward is how much of the remaining gap between open-source provers and AlphaProof is a data problem, a search problem, or a model capability problem. Leanstral gives the community one more variable to test against.