Leanstral and the Open-Source Case for AI-Assisted Formal Verification

Mistral AI’s Leanstral landed on Hacker News with 746 points and the kind of energy you see when something closes a gap the community has been watching for a while. The gap is this: the most impressive AI theorem proving result in recent memory, DeepMind’s AlphaProof solving 4 out of 6 IMO 2024 problems at gold-medal level, is entirely closed. No weights, no code, no training pipeline. Leanstral is a complete open-source release: model, data extraction pipeline, training code, evaluation harness, and agent scaffold. That’s the headline, but the more interesting story is what the system actually does and why Lean 4 specifically enables this kind of AI interaction.

What Lean 4 Gives You That Other Proof Assistants Don’t

Formal verification has existed for decades. Coq, Isabelle, Agda, ACL2, and PVS all have substantial research communities and industrial users. The reason nearly all recent AI-assisted proof work converges on Lean 4 is not fashion; it’s a set of specific architectural properties.

Lean 4 has a small, trusted kernel written in C++ that implements only the core type-checking rules for its Calculus of Constructions-based type theory. Everything else, the elaborator, the tactic engine, the macro system, live outside the kernel and produce terms that get re-checked by it. This is the de Bruijn criterion applied rigorously: you only have to trust the kernel, and the kernel is small enough to audit. When an AI agent submits a tactic step, Lean’s kernel either accepts it or rejects it with a precise error message. There is no ambiguity.

The tactic engine itself is written in Lean 4, not in a separate metalanguage the way Coq’s Ltac is. Tactics operate in the TacticM monad, which is StateT MetavarContext (ReaderT Context (EIO Exception)). This means writing new tactics is ordinary Lean programming, and the proof state exposed to those tactics is the same structured representation the AI agent sees. The state after each tactic is a list of remaining goals with all types fully elaborated. The feedback is clean, typed, and machine-readable by design.

Then there’s Mathlib4, the community mathematics library. Over 150,000 theorems and definitions, active development from hundreds of contributors, covering everything from basic number theory to condensed mathematics. The sheer scale of Mathlib4 creates a training corpus that dwarfs what’s available for any other proof assistant. Every tactic invocation in every Mathlib4 proof is a labeled (proof state, tactic) pair that can be extracted for supervised learning.

Coq has a longer history and a large Archive of Formal Proofs, but its tactic language is a separate metalanguage with weaker tooling, and its LSP support and programmatic interaction capabilities lag Lean 4’s. Isabelle has Sledgehammer, a mature integration with external ATP solvers that can close many arithmetic and logical subgoals automatically, something Lean 4 still lacks at the same level. Agda is more research-oriented with cubical type theory support but deliberately avoids classical logic, which limits its use for mainstream mathematics. Lean 4 hits a pragmatic sweet spot: strong tooling, large corpus, fast compilation, classical logic when you need it.

The Proof Repair Loop

Leanstral is not a completion model. It’s an agent, and the distinction matters. The core loop looks like this:

state = initial_proof_state(theorem)
while not state.is_closed():
    tactic = model.generate(state)
    result = lean_kernel.apply(state, tactic)
    if result.is_error():
        state = state.with_error_context(result.error_message)
    else:
        state = result.new_state
    if budget_exhausted():
        break

When the kernel rejects a tactic, it returns an error that says precisely what went wrong: a type mismatch with the expected and actual types spelled out, a variable not in scope, a universe level conflict. The agent feeds this error back into its context alongside the unchanged proof state and generates a corrected tactic. This is qualitatively different from code completion where failures are often silent or require running a test suite. Lean’s kernel gives structured, immediate, verifiable feedback on every single step.

This feedback loop is why LeanDojo, the Python library that provides programmatic access to Lean 4’s elaborator and tactic engine, became the foundation for most serious research in this space. ReProver (NeurIPS 2023) used LeanDojo to release the first large-scale open benchmark: 98,734 theorems from Mathlib4 with tactic traces. ReProver with retrieval augmentation achieved 48.8% on that benchmark. Leanstral builds directly on this infrastructure.

The choice of search strategy matters a lot. The simplest approach is sequential attempts: try a sequence of tactics, repair on error, abandon if stuck. Best-first search maintains a priority queue of partial proof states scored by the model’s log-probability of the generated tactic, expanding the most promising state at each step. Beam search keeps the top-k partial proofs. AlphaProof uses MCTS with the language model as both policy and value network, which is substantially more compute-intensive but allows deeper planning. Leanstral’s agent scaffold supports multiple search strategies and releases all of them.

The repair behavior specifically requires training on (error, fixed tactic) pairs, not just (proof state, tactic) pairs. Models fine-tuned only on successful tactic sequences struggle when they hit a novel error message because they’ve never seen error context during training. Leanstral’s training data includes error trajectories extracted from proof development histories in Mathlib4, where proofs were revised and corrected.

What Leanstral Actually Adds

The landscape before Leanstral already had strong open-source entries. DeepSeek-Prover-V1.5 achieves around 60% on the miniF2F benchmark with tree search, which is the standard 488-problem evaluation set of formalized competition mathematics problems from AMC, AIME, and IMO. InternLM2-Math is another competitive fine-tuned system. What Leanstral adds is specifically:

Mistral’s model architecture and tokenizer applied to this domain. Mistral models are known for efficient inference and strong instruction-following per parameter count. Instruction-following quality matters for tactic generation because the prompt encodes a structured proof state with specific formatting, and the model must produce output that parses correctly as a Lean 4 tactic. A model that reliably follows formatting constraints makes fewer recoverable errors in the first place.

Full pipeline reproducibility. DeepSeek-Prover released weights but the data extraction and training details were sparse. Leanstral releases the complete pipeline from Lean 4 source to trained agent, which means researchers can retrain on updated Mathlib4, experiment with different base models, or apply the same methodology to other Lean 4 libraries outside the math domain.

The agent scaffold as a first-class component. Leanstral treats the scaffolding, the proof repair loop, the search strategy, the state management, as something worth documenting and releasing separately from the model. This is the right decomposition. As base language models improve, you can swap the model while keeping the scaffold. As Lean 4 tooling evolves, you can update the scaffold while keeping the fine-tuned weights.

The Open Versus Closed Divide

AlphaProof’s IMO result was genuinely impressive and its architecture description is worth reading carefully. The key technique is reinforcement learning from proof verification: generate proof attempts, let Lean accept or reject them, and use successful proofs as training signal. Over millions of self-play iterations the model learns from its own verified successes. This is the AlphaGo approach applied to formal mathematics, and it requires the kind of compute infrastructure that a Google research lab can provide.

For the community, closed systems that achieve remarkable benchmarks are simultaneously inspiring and frustrating. The benchmark number gets published, but the methodology that produced it can’t be built on, audited, adapted, or reproduced. Leanstral’s miniF2F scores are more modest than DeepSeek-Prover’s 60%, which is itself far below AlphaProof’s IMO performance. But Leanstral is fully open, and that’s a different kind of value.

The history of the field shows why this matters. GPT-f (OpenAI, 2020) proved transformers could do useful premise selection for Metamath proofs. That paper was open. The broader community built on it. HyperTree Proof Search (Meta AI, 2022) applied MCTS to Lean 3 with a transformer policy network. That paper described the approach in detail. LeanDojo and ReProver built on those ideas and released their code and data. Leanstral continues that lineage. The closed AlphaProof result sits outside it, impressive but not directly buildable-upon.

Where the Problems Still Are

MiniF2F is increasingly saturated for strong systems, which is a sign that the benchmark was well-constructed but not that the problem is solved. The harder challenges are clearer now than they were two years ago.

Proofs requiring hundreds of tactic steps remain largely out of reach for pure search-based approaches. The issue is not raw intelligence but proof decomposition: a human mathematician facing a hard theorem doesn’t try to find the proof in one long tactic sequence. They introduce intermediate lemmas, build up theory, and structure the proof hierarchically. Teaching an AI agent to decide when to introduce a have step (an intermediate claim) versus continuing toward the main goal is a planning problem at a different level from tactic generation.

Autoformalization, translating informal mathematical prose into formal Lean 4 statements, is still largely unsolved at scale. This is the practical barrier to broader use: most mathematical knowledge lives in papers written in LaTeX and natural language, not in Lean source files. Draft, Sketch, Prove made progress by using informal proofs as intermediate representations, but the gap between a LaTeX proof and a verifiable Lean 4 proof remains wide for anything beyond undergraduate mathematics.

Lean 4 also lacks a mature equivalent of Isabelle’s Sledgehammer, which calls external SMT solvers and ATP systems to close arithmetic and logical subgoals automatically. Lean 4’s omega tactic handles linear arithmetic over integers and naturals well, and decide handles decidable propositions over finite types. But the general case of nonlinear arithmetic, set theory lemmas, and algebraic manipulations that Sledgehammer handles via Vampire and E prover still requires manual tactics or careful retrieval of relevant Mathlib4 lemmas. Projects like duper and aesop are heading in this direction, but they’re not at Sledgehammer’s level yet.

What This Signals

Mistral’s investment in Leanstral makes sense if you think about where verified software development is heading. The long-term vision, which is still years away from practical reality but directionally coherent, is a pipeline where you write a formal specification in Lean 4, an AI agent produces a proof that the implementation meets the specification, and the proof is mechanically verified by the kernel. The specification becomes an executable contract, and correctness becomes a compilation artifact rather than a test result.

For that pipeline to exist, you need the proof engineering layer to be automated enough that software developers without formal methods expertise can use it. Leanstral is a step toward making that layer open, reproducible, and improvable by the broader community rather than locked inside a research lab. The benchmark numbers matter less than the infrastructure being in the open.