The Infrastructure Beneath the Proof: What Made Leanstral Possible

When Mistral released Leanstral, the announcement highlighted the agent architecture, the open-source release, and the Lean 4 focus. What it did not highlight, because it is foundational rather than novel, is the infrastructure that had to exist before any of this was tractable.

Formal proof agents do not work in a vacuum. They require structured training data, a machine-readable proof state interface, a retrieval system that can navigate a library of 150,000+ lemmas at inference time, and benchmarks that actually measure the right capability. Most of that infrastructure did not exist five years ago. It was built piece by piece, largely through LeanDojo and the Mathlib community, and Leanstral is among the first open-source systems to benefit from all of it being in place simultaneously.

What LeanDojo Actually Built

LeanDojo (Yang et al., NeurIPS 2023) is often cited as a benchmark dataset, but it is better understood as a structured extraction pipeline. The core problem it solves: Lean 4 proofs are structured programs, not free-form text, and extracting useful training data from them requires tracing through the elaboration process to get proof states that correspond to actual decision points in the proof search.

LeanDojo instruments the Lean 4 elaborator to record, at each tactic invocation, the complete proof state before and after the tactic runs. A proof state in Lean 4 includes the list of open goals, the local context for each goal (named hypotheses and their types), the target type, and the universe levels and metavariable assignments in scope. The resulting dataset contains (proof state, tactic, resulting state) triples extracted from all of Mathlib4.

This is significantly harder to produce than it sounds. Lean 4’s elaboration is lazy and incremental; a single tactic invocation may trigger universe unification, type class resolution, and implicit argument synthesis that changes what the goal state looks like to a downstream consumer. LeanDojo handles this by hooking directly into the elaborator rather than parsing proof text after the fact. The result is a dataset where every proof state is an accurate snapshot of what Lean’s kernel actually saw at that point in the proof.

ReProver, the retrieval model trained on LeanDojo data, uses a transformer to retrieve relevant premises from Mathlib given the current proof state. Without retrieval, an agent must either memorize all of Mathlib or reconstruct needed facts from first principles. ReProver demonstrated that lemma retrieval is load-bearing, not an optimization: it achieved roughly 26.5% pass rate on miniF2F, a significant improvement over non-retrieval baselines, because identifying the right existing lemma is often the difference between a tractable proof step and an impossible one.

Mathlib as a Training Corpus

Mathlib4 is the enabling corpus. Its naming conventions are specifically useful for retrieval: Nat.add_comm, List.length_append, Finset.sum_congr follow a module-qualified dot-notation pattern that reflects the structure of the theorem being named. A model trained on Mathlib can generalize this pattern, predicting lemma names it has not seen by analogy with the structure of names it has.

This is a significant advantage over Coq (now Rocq) and Isabelle. Both have large libraries, but Isabelle’s Archive of Formal Proofs uses less systematic naming across its thousands of entries, and Coq’s standard library and community packages have historically been more fragmented. Lean 4 enforces naming consistency through a style guide reviewed at contribution time. That decision, made for human readability, turns out to matter for machine learning systems as well. The naming pattern is learnable, which means retrieval over Mathlib is a tractable dense retrieval problem rather than a keyword lookup problem.

Scale matters too. With coverage spanning analysis, algebra, topology, number theory, measure theory, and combinatorics, Mathlib provides training signal for a wide range of proof patterns. A model trained on it has seen induction over natural numbers, quotient constructions, category theory arguments, and linear arithmetic proofs, giving it a broad tactical vocabulary before any domain-specific fine-tuning.

The REPL Interface

The practical interface between an LLM agent and Lean 4 is the REPL exposed through the lean4-repl package. The agent sends JSON commands:

{"cmd": "example (n : Nat) : n + 0 = n := by\n  omega", "env": 0}

Lean returns either success:

{"env": 1, "messages": [], "sorries": []}

or failure with a structured error:

{
  "messages": [{
    "severity": "error",
    "pos": {"line": 2, "column": 2},
    "data": "omega could not close the goal\nn : Nat\n⊢ n + 0 = n"
  }]
}

The error message includes the full proof state at the point of failure. This is the feedback channel that makes the proof repair loop work. When a tactic fails, the message says exactly what the goal was and why the tactic did not apply, which is precisely the information the LLM needs to generate a better candidate on the next attempt.

The env parameter in the protocol is also important: each successfully evaluated command returns an environment identifier encoding all definitions and theorems introduced so far. An agent can branch proof search by forking from the same environment identifier, exploring multiple continuations in parallel without rerunning earlier successful steps. This is the enabling mechanism for best-first tree search in systems like Leanstral, and it is available because Lean 4’s elaborator was designed to treat environments as immutable, composable values.

Other proof assistants have REPLs, but Lean 4’s error messages are unusually structured and precise. Coq’s error messages can be verbose and context-sensitive in ways that are harder to parse programmatically. Isabelle’s PIDE architecture exposes proof state through a different protocol that has seen less investment in machine-friendly formatting. The Lean 4 REPL is clean enough that an agent can reliably parse its output without special-casing, which keeps the scaffolding layer simple.

The AlphaProof Context

AlphaProof demonstrated in 2024 that reinforcement learning over Lean 4 proofs, with the kernel as the reward signal, can solve IMO-level problems. It solved four of six problems at the 2024 International Mathematical Olympiad using a Gemini-scale backbone, months of self-play training, and substantial compute budget per problem at evaluation time. The gap it left is the entire distribution of formal verification work below the frontier: contributing lemmas to Mathlib, verifying algorithm implementations, formalizing paper proofs, building certified software libraries.

This is the work that occupies most practitioners in the formal methods community, and it requires something they can run locally, adapt to their domain, and integrate into a development workflow. Leanstral is built for this use case. It is a Mistral base model fine-tuned on LeanDojo’s extracted training data, running an agent loop with retrieval tools, with weights and scaffolding released under an open license. The benchmark performance sits in the competitive range for open-source systems, comparable to DeepSeek-Prover from 2024, in the 40-55% range on miniF2F depending on search budget. That positions it well for practical use without requiring the compute infrastructure that AlphaProof demands.

What the Infrastructure Enables Going Forward

The Lean 4 ecosystem now has a training corpus (Mathlib), a data extraction pipeline (LeanDojo), a retrieval model (ReProver), a REPL interface (lean4-repl), benchmarks (miniF2F and ProofNet), and an open-source agent (Leanstral). These components are modular and extensible.

The natural next step is domain-specific fine-tuning. A team verifying cryptographic protocols can fine-tune on their own proof corpus. A group formalizing programming language metatheory can train retrieval models on the relevant Mathlib extension. A compiler verification project can build a specialized agent that knows the relevant library’s lemma naming patterns and proof idioms. The infrastructure supports this because each layer is separable, and because Lean 4’s metaprogramming system makes the language itself extensible without forking.

The open question is whether formal verification becomes practical enough to reach a broader population of software engineers writing high-assurance components. Leanstral reduces one of the friction points, the labor cost of proof construction, by a meaningful amount. The Lean 4 paper was published in 2021; the LeanDojo infrastructure followed in 2023; Leanstral arrives in 2026. The pace suggests that the remaining gaps, specification authoring, proof state coherence over long derivations, integration with standard software toolchains, are the targets for the next few years of work. The infrastructure to support that work is now in place.