Formal Proof as a Developer Tool: What Leanstral Gets Right

The Oracle Problem in AI-Assisted Development

When an LLM writes code, you have no immediate way to verify the output is correct. You run tests, read the diff, trust your intuition. The feedback is slow, indirect, and incomplete. Tests check a finite set of inputs; they say nothing about the rest. This is not a failure of LLMs specifically. It is a property of the task: general code generation has no ground-truth oracle.

Formal theorem proving in Lean 4 is a different task. The Lean kernel checks every proof step immediately and unambiguously. A tactic either closes a goal or it does not. An agent working in this environment gets feedback that is binary, instantaneous, and mathematically certain. Leanstral, Mistral AI’s open-source agent for Lean 4, is built around this property.

What Lean 4 Is

Lean 4 is a theorem prover and general-purpose programming language developed by Leonardo de Moura and Sebastian Ullrich, described in their 2021 CADE paper. The key design decision is that the proof language and the programming language are the same language. You write your data structures, algorithms, and correctness proofs in a single file, and the same compiler handles all of it.

The foundation is the Calculus of Inductive Constructions. Every proposition is a type, every proof is a term, and the kernel that checks this relationship is around 5,000 lines of C++. A kernel you can audit is a kernel you can trust. Whatever proof term the kernel accepts is correct; there is no escape hatch.

Proofs in Lean 4 are written using tactics, which are commands that transform proof goals:

theorem add_comm (n m : ℕ) : n + m = m + n := by
  induction n with
  | zero => simp
  | succ n ih => simp [Nat.succ_add, ih]

After each tactic, the Lean Language Server reports the current goal state. You see exactly what remains to be proved. This interactive feedback is the same mechanism an AI agent exploits at inference time: generate a tactic, check the result, generate the next one.

The Mathlib library, maintained by the Lean community, provides over 180,000 theorems and definitions covering undergraduate and graduate mathematics. Any system that can write Mathlib-idiomatic proofs has access to this entire corpus as a vocabulary of building blocks.

The Path from GPT-f to Leanstral

The idea of using language models for theorem proving has been developing for about five years. OpenAI’s GPT-f (2020) was the first serious application: fine-tune a transformer on proof corpora, use it to suggest the next tactic, verify with the proof assistant, search the resulting tree. The approach worked in Metamath and showed early promise in Lean 3.

Meta AI’s HyperTree Proof Search (2022) took this further by training a value network alongside a policy network, bringing reinforcement learning into the loop. The proof assistant served as both verifier and training signal generator.

The landmark result came in 2024 with Google DeepMind’s AlphaProof, which solved four of six problems at the International Mathematical Olympiad at silver medal level. AlphaProof trained on a massive synthetic dataset generated by taking existing problems and creating variants, then using the Lean kernel as a reward signal for reinforcement learning. The combination of scale, reinforcement learning, and binary feedback from Lean’s kernel produced something genuinely new.

Leanstral enters this space as an open-source agent, making the approach accessible without requiring the compute budget of a research lab. Where AlphaProof was a research demonstration, Leanstral is a deployable tool.

How the Agent Loop Works

The core of any LLM-based theorem prover is the search procedure. One-shot generation fails on anything nontrivial; proofs require backtracking, lemma retrieval, and goal decomposition. Leanstral operates as an agent in a loop: generate a tactic sequence, submit it to the Lean type checker, receive the resulting goal state or error, revise.

This loop has a property that code generation loops do not. Each piece of feedback from Lean is formally meaningful. A goal state after apply Nat.add_comm tells the agent precisely what remains. An error like type mismatch: expected m + n, got n + m is not a vague signal; it is an exact description of the proof obligation that was not met. A model that can read these messages and use them to revise its next move has a fundamentally better learning signal than one debugging a failing test.

Retrieval augmentation plays a central role here. Mathlib’s 180,000-plus theorems mean the relevant lemma almost certainly exists; the challenge is finding it. Systems like Copra demonstrated that retrieving relevant premises at each step substantially improves LLM-based proof search. Leanstral operates in this same setting, using Mistral’s models to retrieve useful Mathlib lemmas and generate tactics conditioned on them.

Mistral’s broader model lineup, particularly the more recent Mistral Large variants, brings substantial reasoning capacity to this task. Fine-tuning on Lean 4 proof corpora from Mathlib, competition mathematics, and synthetic problem variants gives the model the specialized vocabulary it needs. The tactic language is regular enough that models can learn it; the hard part is learning which tactic to apply in which goal context, and that is where the agent loop’s iterative feedback matters most.

Beyond Mathematics: Trustworthy Coding

The “trustworthy coding” framing in Leanstral’s announcement is the more ambitious claim. Pure mathematical theorem proving is valuable, but it operates on abstractions. Trustworthy coding means applying the same machinery to real software.

Lean 4’s design makes this plausible in a way earlier proof assistants did not. When Coq proved the correctness of the CompCert C compiler or Isabelle/HOL verified the seL4 microkernel, the proofs were written in a separate language from the implementation, creating a semantic gap that required careful bridging. In Lean 4, the function and its correctness proof coexist in the same language with the same type checker.

A practical example: a sorting function can carry a proof that its output is sorted and is a permutation of its input, as a type-level guarantee enforced at compile time:

def insertionSort (l : List Nat) : { l' : List Nat // l'.Sorted (· ≤ ·) ∧ l'.Perm l } :=
  ...

Any code that calls this function gets those guarantees for free, without runtime checks. This is what “trustworthy” means here: the compiler, not the programmer, guarantees correctness.

An AI agent that can help write these specifications and fill in the proof obligations lowers the barrier to verified software considerably. The bottleneck today is not the theory; it is the human time required to write proofs. If Leanstral can automate the routine parts of that work, the economics of formal verification change. A developer who spends two hours writing a function and six hours proving it correct is going to skip the proof. A developer who spends two hours on the function and thirty minutes on an agent-assisted proof is making a different calculation.

What Open-Source Means Here

AlphaProof demonstrated what is possible at scale. Leanstral demonstrates that a capable formal proof agent can exist as an open-source, deployable tool. This distinction matters for the field. Research demonstrations influence what is believed possible. Open-source tools influence what gets built.

The proof engineering community is small. Lean’s Zulip chat has a few thousand active users. Mathlib has a few hundred contributors. Tooling that makes formal proofs more accessible, even at the margin, has an outsized effect on those numbers. An open-source agent that a researcher can run locally, inspect, and modify is a more useful contribution to this community than a closed system with better benchmark numbers.

Mistral has consistently taken a permissive approach to model weights and tooling. Releasing Leanstral under an open license keeps the feedback loop between the research community and the tool open, which is how these systems improve.

The Remaining Gap

The miniF2F benchmark, which tests on competition math problems formalized in Lean 4, sees top systems in the 60-70% pass range. ProofNet, which uses undergraduate textbook problems, is harder; the best systems sit around 20-30%. These numbers indicate the state of the art handles olympiad-style problems reasonably well and struggles more with the kind of structured, definition-heavy reasoning that characterizes most real mathematical and software verification work.

Closing that gap requires better premise retrieval, better handling of long proof obligations, and more training data. The training data problem is circular in the best possible way: as more Lean 4 proofs are written and added to Mathlib and other libraries, future models have more to learn from. Leanstral is one node in that feedback loop.

The piece that makes formal proof engineering genuinely tractable for software correctness at scale is not fully here yet. But the direction is clear, the tooling is improving, and having Mistral’s resources behind an open-source effort moves the field forward in a way that closed research systems cannot.