Leanstral and the Proof-Repair Loop That Open-Source AI Theorem Proving Needed

When DeepMind published AlphaProof’s results in July 2024, the system solved four of six International Mathematical Olympiad problems at a silver medal level. That was a genuine capability achievement. The limitation was access: AlphaProof is proprietary, trained on infrastructure most teams cannot replicate, with weights that have never been public. The research community could study the results but not reproduce or build on the system.

Leanstral, announced by Mistral AI, is the open-source answer to that problem. It is an agent designed for formal proof engineering in Lean 4, and its release changes what independent researchers, open-source math libraries, and software verification teams can actually do.

Why Lean 4

The choice of Lean 4 as the target language is not arbitrary. Several serious theorem provers are in active use: Coq, Isabelle/HOL, Agda, and others all have communities and libraries. Lean 4 has accumulated structural advantages that make it the most productive target for LLM-assisted proving.

The first advantage is Mathlib4, the largest single formalized mathematics library in existence. Mathlib4 contains over 150,000 theorems spanning undergraduate and graduate mathematics, and its growth rate has accelerated as the Lean 4 community expanded. For any LLM being trained or fine-tuned on formal mathematics, this corpus is the primary source of signal. More formalized proofs means better coverage of proof patterns, tactic idioms, and lemma libraries. Coq has Mathcomp and the standard library; Isabelle has the Archive of Formal Proofs. Neither approaches Mathlib4’s density in pure mathematics.

The second advantage is Lean 4’s tactic state feedback. When you write a tactic proof in Lean 4, the type checker shows you the proof state at each step: the hypotheses available, the goal to prove, and what changed after each tactic application. This is not just convenient for humans. It gives an LLM a structured signal at every decision point. The model does not have to predict a complete proof from scratch; it observes the current state and suggests the next move. This transforms proof generation from a one-shot completion problem into something closer to a search problem with dense, formally grounded feedback.

example (n m : ℕ) (h : n < m) : n ≤ m := by
  -- Proof state: h : n < m  ⊢  n ≤ m
  exact Nat.le_of_lt h
  -- Proof state: no goals

Even in a simple case, the structure is visible: the tactic state is explicit before and after each move, and the type checker immediately confirms or rejects it. For complex proofs spanning dozens of steps, this intermediate state visibility is what makes repair loops tractable. The model always knows exactly where it stands.

The third advantage is programmatic access. Lean 4 ships with a full Language Server Protocol implementation, meaning external tools can query proof states, check individual tactics, and observe errors without reimplementing the type checker. A system verifying tactics step-by-step can do so through the LSP rather than constructing a bespoke interface. This is the infrastructure that makes agent-style interaction practical to build.

The Lean 4 paper from 2021, by Leonardo de Moura and Sebastian Ullrich, laid out a system designed for both theorem proving and general-purpose programming. The bet that these two use cases could coexist productively in a single language has paid off: the programmer-friendly ergonomics brought in contributors who then formalized mathematics, which attracted AI researchers who needed a rich training corpus, which brought in more contributors. The feedback loop compounded.

The Proof-Repair Loop

Leanstral is described as an agent rather than a completion model, and the distinction carries real weight. A pure completion approach takes a problem statement and generates a proof in a single pass. The problem is that formal proofs are zero-tolerance: the type checker either accepts the proof or it does not, and a single wrong step invalidates the entire chain. One-shot generation against that constraint fails at a rate that makes it practically useless for anything beyond toy examples.

The agent approach replaces one-shot generation with an iterative loop. The model proposes a proof or a partial proof, Lean’s type checker evaluates it, and the errors, including the current proof state at the point of failure, are returned to the model. The model revises its attempt using this feedback. The loop continues until the proof succeeds or some computational budget is exhausted.

This pattern appeared earlier in research systems like LeanDojo and ReProver from Stanford, which demonstrated retrieval-augmented theorem proving using Lean 4’s infrastructure. Leanstral packages the same core loop as a production-quality, openly licensed system built on a competitive base model.

The proof-repair loop has a property that distinguishes it from other LLM feedback loops: the feedback is formally correct. When a compiler returns a borrow error in Rust, some interpretation is still required to connect the error to the underlying intent. When Lean returns a type mismatch or an unsolved goal, it is giving an exact description of the failure in the language of the type theory itself. The model receives machine-verified information about the state of the proof. This makes the feedback loop considerably cleaner than test-failure-driven code repair, where the gap between test signal and underlying bug can be large.

There is also a useful affordance in Lean 4’s sorry tactic, which allows proofs to be stubbed out. A complex proof can be decomposed into subgoals, each admitted with sorry initially, and then filled in one by one. This lets the agent work on pieces of a proof without requiring the entire structure to typecheck simultaneously, reducing the difficulty of each individual generation step considerably.

theorem complex_result (a b c : ℝ) (ha : 0 < a) (hb : 0 < b) (hc : 0 < c) :
    a / (b + c) + b / (a + c) + c / (a + b) ≥ 3 / 2 := by
  have h1 : 0 < b + c := by linarith
  have h2 : 0 < a + c := by linarith
  have h3 : 0 < a + b := by linarith
  sorry  -- agent fills in the main inequality argument

This decomposition strategy is what separates serious proof automation from naive generation. AlphaProof used a similar insight via its search procedure; making the same capability available through a clean open-source interface is what Leanstral contributes.

What Open-Source Changes

AlphaProof demonstrated proof-of-concept capability at the IMO level, but IMO problems are not the primary use case for formal verification in practice. Software engineers verifying protocol implementations, mathematicians extending Mathlib, and researchers encoding domain models in dependent types all have different needs. They need tools that run locally, can be fine-tuned on specific domains, and integrate into existing development workflows.

A proprietary system cannot serve these use cases. The open-source release of Leanstral matters because it puts the infrastructure in the hands of people who will specialize it. Mathlib contributors can use it to discharge routine proof obligations that currently require manual effort. Cryptography verification projects, like the work being done to formalize proofs around elliptic curve arithmetic, can fine-tune it on their specific lemma libraries. Security-focused projects using Lean 4 for specification and verification can embed it in their CI pipelines. This broadened accessibility changes who can use formal methods at all, not just whether formal methods can reach higher mathematical levels.

Formal verification has a long history of excellent tools that remained confined to specialists. Coq was the dominant prover for proof-carrying code and certified compilers for decades, producing landmark systems like CompCert, a formally verified C compiler. Isabelle/HOL has anchored hardware and protocol verification work. The knowledge required to use these tools effectively kept them out of most engineering organizations. AI assistance lowers that barrier, but only if the AI tooling is accessible. A proof assistant paired with a locked-down model is still a specialist tool.

The miniF2F Lens

The miniF2F benchmark, which covers 488 formal mathematics problems drawn from AMC, AIME, and IMO competitions, has become the standard measure for AI theorem proving systems. It is worth understanding what it does and does not capture. The benchmark tests whether a system can generate machine-checkable proofs for competition problems, most of which are a few lines in a skilled human prover’s hands. Performance on miniF2F correlates with general proof automation capability, but it does not directly measure usefulness for software verification or large-scale Mathlib development.

The systems that have performed well on miniF2F share a common pattern: they use Lean 4 as the target language, they employ some variant of the proof-repair loop, and they retrieve relevant lemmas from Mathlib during generation. Leanstral fits this pattern and extends it with Mistral’s model quality and the engineering rigor of a production release.

The Practical Picture

For most software engineers, Lean 4 and formal verification remain niche. The overhead of writing machine-checked proofs for production software is still significant, even with AI assistance. Leanstral does not remove that overhead overnight.

What it changes is the marginal cost of extending proofs once you have started, and the barrier to getting started at all. The combination of a capable open model, a loop that exploits Lean 4’s structured feedback, and the Mathlib4 corpus creates conditions where incremental improvement compounds. Researchers who were blocked on proof automation have a new tool to iterate with. Educators teaching formal methods have something students can run without institutional access to proprietary systems. Teams that want to experiment with formally specifying parts of their codebase have a lower entry point.

The parallel to earlier AI coding tools is imperfect but instructive. When AI-assisted code completion first appeared, the immediate impact was on boilerplate; the deeper impact accumulated over time as tool quality improved and usage shaped the ecosystem. Leanstral begins a similar loop for formal verification, and it starts from a stronger technical foundation than those tools had. Lean 4’s type-checked feedback is more precise than test results or linter output, and Mathlib4’s scale gives the model an unusually rich training surface. The conditions for rapid iteration are present.