Leanstral and the Architecture That Makes LLM Theorem Proving Work

Mistral released Leanstral as an open-source agent for formal proof engineering in Lean 4. The announcement landed with 746 upvotes on Hacker News and 182 comments, which is a reasonable proxy for how much the formal verification community has been waiting for something like this. But the more interesting story is not the release itself. It is why Leanstral is built the way it is, and what that design says about the current state of AI-assisted formal verification.

Why Lean 4 and Not Something Else

Lean 4 is simultaneously a theorem prover and a general-purpose programming language, a combination that sounds like an awkward compromise but turns out to be a genuine design win. The language was rewritten from scratch by Leonardo de Moura and Sebastian Ullrich, as described in their 2021 CADE paper. The core logic is the Calculus of Inductive Constructions, the same dependent type theory that underlies Coq. But where Coq extracts verified programs to OCaml or Haskell as a separate step, Lean 4 programs are Lean 4 programs directly, with a real runtime, reference-counted memory management, and multi-threading support.

What matters for an AI agent, though, is the kernel architecture. Lean 4 follows the de Bruijn principle: the trusted kernel is small, and everything else, the elaborator, the tactic engine, the LSP server, is untrusted. When a tactic succeeds, it does so by producing a kernel-level term that the small trusted core has verified. This means the kernel is a ground-truth oracle with a clean API surface. An agent can call it without worrying about whether the verifier itself has bugs; the surface area is small enough to audit.

There is also Mathlib4, the community mathematics library, now over a million lines of Lean 4 covering analysis, algebra, topology, number theory, measure theory, and more than 100,000 formalized theorems. Mathlib is significant for AI systems because it provides an enormous corpus of (goal state, tactic, resulting state) triples, which is exactly the kind of structured training data these agents need.

The Agent Loop Is Not Optional

The name “Leanstral” signals the key design decision: this is an agent, not a fine-tuned completion model. The distinction matters.

A completion model would take a theorem statement as a prompt and produce a proof attempt in one shot. This approach has been tried extensively. Even very capable models fail on non-trivial theorems because formal proofs are long, precise, and unforgiving. A single wrong token, a missing implicit argument, an inapplicable lemma, and the whole attempt is invalid. The error rate compounds across hundreds of tactic steps.

Leanstral works the way every high-performing LLM proof system works: it runs an iterative loop. The agent observes the current proof state (the list of open goals and hypotheses as reported by Lean’s LSP server), generates one or more candidate tactics, submits them to the Lean 4 kernel, reads back the result, and branches accordingly. A tactic either closes the goal, produces new subgoals, or errors. Successful and goal-producing steps become nodes in a search tree; errors are pruned. The search is guided by the model’s confidence scores, so the agent explores plausible paths first rather than searching blindly.

This structure has two properties that make it work. First, the kernel prevents hallucination. The model cannot “invent” a proof that looks plausible but is wrong, because the kernel will reject it immediately with a precise error message. Second, the loop turns a hard one-shot generation problem into a series of easier local decisions. Each tactic step advances the proof from one well-defined state to another, and the model only needs to make a locally reasonable choice, not plan the entire proof in advance.

How Leanstral Was Trained

The training methodology follows what the formal verification literature calls expert iteration. You start with supervised fine-tuning on existing Lean 4 proofs, primarily from Mathlib4, to give the model a working vocabulary of tactics and proof patterns. Then you use the model to generate proof attempts on a large corpus of problems, run those attempts through the Lean kernel, collect the ones that succeed, and add them to the training set. Iterate.

The Lean kernel is the reward function. There is no ambiguity about whether a proof is correct; the kernel either accepts it or it does not. This clean binary signal is what makes expert iteration tractable here, where it would be much messier in domains where “correctness” requires human judgment. The model gets better at proving theorems because it is trained on proofs that are, by construction, correct, and the set of provable theorems it can reach grows with each iteration.

This is the same fundamental approach that DeepMind used in AlphaProof, the system that solved four of six problems at the 2024 International Mathematical Olympiad at a silver-medal level. AlphaProof used a Gemini-based backbone and enormous compute, and it remains closed-source. Leanstral is Mistral’s open-source demonstration that the same paradigm works at a more accessible scale.

miniF2F and What the Benchmark Means

The standard evaluation target for this class of system is miniF2F, a benchmark of 488 competition mathematics problems (AMC 10/12, AIME, and IMO shortlist problems) formalized in Lean 4 using Mathlib definitions. The pass rate on miniF2F-test is the number the community uses to track progress.

To put Leanstral’s results in context: early GPT-based approaches in 2021 sat around 25-30%. HyperTree Proof Search from Meta (2022) reached around 41% using Monte Carlo tree search over Lean 3 proofs. By 2024, AlphaProof-class systems were posting results in the 65%+ range depending on search budget. Leanstral places in the competitive range for open-source systems, comparable to DeepSeek-Prover, which DeepSeek released in 2024.

The search budget is worth understanding. “Pass rate” in this domain is always conditioned on how many Lean kernel calls you allow per problem. More attempts means higher success, but also more compute. Reporting a single number without the search budget context can obscure how the tradeoff was made. Leanstral’s published numbers use specific budget configurations, and the agent architecture is designed to use that budget efficiently through guided tree search rather than random sampling.

The Gap Between Competition Math and Software Verification

MiniF2F is a useful benchmark, but it measures a specific capability: solving problems that have known elegant proofs, in a domain where Mathlib already has most of the necessary machinery. Real software verification looks different.

Verifying that a distributed consensus protocol is correct, or that a compiler pass preserves semantics, requires formalizing the specification from scratch and building a proof that may span thousands of lemmas with no obvious Mathlib analogue. The “specification gap,” writing a formal spec that correctly captures what you actually want to prove, is often harder than the proof itself. No current LLM system solves that problem; what they solve is the tactical search once a specification exists.

That said, a capable open-source agent that can automate the routine lemma-proving work in a formal development would be genuinely useful. Most large formal verification projects spend a large fraction of their proof effort on tedious but mechanically derivable steps, exactly the kind of thing a search-based agent handles well. This is the practical niche where Leanstral can contribute right now, and where the Coq-side equivalent CoqPilot and related tools have been making inroads.

Why Open Source Matters Here

The formal verification community is small and historically underserved by commercial AI tooling. Coq, Isabelle, and Lean 4 have dedicated communities but nowhere near the market size that drives investment in, say, Python tooling. AlphaProof showed that the LLM-plus-kernel paradigm works at scale, but it was closed, compute-intensive, and built for a research showcase. It did not give practicing formal methods engineers anything they could run locally.

Leanstral’s model weights on Hugging Face and agent code on GitHub change that. Researchers can fine-tune it on domain-specific corpora, whether cryptographic protocol proofs, operating system verification, or programming language metatheory. They can integrate it into their existing Lean 4 workflows. They can study the training methodology and reproduce or improve on it.

The Lean 4 ecosystem has been building toward this. The language itself is designed to be programmable; tactics and elaboration strategies are written in Lean 4 using the same macro system as the rest of the language. That programmability makes it possible to build tight integration between an external agent and the Lean kernel without fighting the tooling. The Lean 4 LSP server exposes goal states and proof checking through a standard protocol, which is what gives Leanstral’s agent loop its clean feedback channel.

Formal verification has been stuck for decades at “too expensive for most teams.” AI-assisted proof engineering will not eliminate that cost overnight, but systems like Leanstral move the practical frontier. The architecture is right: small trusted kernel, agent loop, kernel-as-reward. What improves now is the model quality, the training corpus, and the integration tooling, and all three of those can be worked on in the open.