Why Leanstral Runs on Lean 4 and Not Coq

Lean 4 has become the platform of record for AI-assisted formal proof. Not Coq, which has the longer history and larger academic install base. Not Isabelle, which has the Archive of Formal Proofs and deep roots in European verification research. Not Agda or F* or Dafny. When DeepMind built AlphaProof, they built it on Lean 4. When DeepSeek released DeepSeek-Prover-V2, it targeted Lean 4. When Mistral built Leanstral, they built it on Lean 4.

The convergence is not accidental and is not explained by Lean 4 being the most powerful or most expressive theorem prover. It is explained by three specific design decisions that happen to be exactly what AI agents need.

Why Lean 4 Won the Platform Race

The first decision is language unification. In Lean 4, tactics are ordinary Lean programs. The tactic type is TacticM Unit, a monadic computation in Lean’s do-notation, using the same typeclass dispatch and pattern matching as the rest of the language. A model fine-tuned on Lean 4 code learns the proof language as part of learning the host language; there is no separate tactic metalanguage to acquire. In Coq, the Ltac tactic language is syntactically and semantically distinct from Gallina, the term language. Writing Coq proofs requires fluency in two separate systems. Isabelle’s situation is similar. Lean 4 erased that gap in its 2021 redesign by Leonardo de Moura and Sebastian Ullrich, and the decision has downstream consequences for everything built on top of it.

The second decision is the Language Server Protocol implementation. After each tactic in Lean 4, the LSP returns structured data: the current hypothesis context, the remaining goal state, and any error messages, delivered as JSON over a stable protocol. An agent that wants to know what changed after applying simp [Nat.add_comm] reads a JSON object, not terminal output. Isabelle and Coq have REPL interfaces that historically required output parsing; Lean 4 was designed from the start to be machine-readable. That is not a minor implementation convenience. It eliminates an entire class of fragile string-matching code that would otherwise sit between the model and the kernel.

The third decision is Mathlib. Over 200,000 theorems and definitions, all kernel-verified, maintained with CI-enforced naming conventions on every pull request. The naming conventions are the structural advantage: Nat.add_comm, List.length_append, and Finset.sum_comm are predictable from their mathematical content. A model generating tactics for a goal about list lengths can construct candidate lemma names from the goal structure directly, without relying entirely on semantic similarity search over dense embeddings. Isabelle’s Archive of Formal Proofs has around 800 independent entries with varied conventions and no shared naming system, so retrieval there requires embedding-based search regardless of goal structure. At 200,000 entries, the ability to reason about names as well as meaning is a compounding advantage.

These three properties together, unified language, machine-readable LSP, and naming-disciplined scale, make Lean 4 the right substrate for an AI agent. The community built the library; the design made the interface; the result is that every serious AI proof effort since 2023 has reached the same conclusion about where to build.

The Path to Leanstral

The trajectory from GPT-f (OpenAI, 2020) to Leanstral runs through several landmarks. GPT-f was the first demonstration that transformers could suggest useful proof tactics when fine-tuned on proof corpora, working in Metamath with early results in Lean 3. Meta AI’s HyperTree Proof Search (2022) added reinforcement learning: a value network trained alongside the policy, using the proof assistant’s binary feedback as the reward signal. AlphaProof (2024) brought that approach to scale, training on synthetic problem variants generated from existing proofs, and solved four of six IMO 2024 problems at a silver medal level.

AlphaProof then closed. The weights were never released. The approach was never reproducible outside Google DeepMind. For the formal verification community, a closed research result that demonstrates ceiling performance is less useful than an open tool that people can run, modify, and adapt to their own codebases.

DeepSeek-Prover-V2 arrived as an open-weight answer in the competition mathematics direction. It targets Lean 4, achieves strong numbers on miniF2F (where top systems reach the 60-70% pass range), and uses a subgoal decomposition strategy: decompose a proof goal into a tree of subgoals, prove the leaf nodes, and assemble the proof bottom-up. The approach handles the structured mathematical reasoning that competition problems require.

Leanstral’s orientation is different. The “trustworthy coding” framing in Mistral’s announcement positions it as a tool for software verification alongside mathematical proof, not only for competition problems. That framing implies a different capability profile: navigating a codebase to understand which lemmas exist and how they compose, keeping proofs valid as type signatures evolve, interpreting Lean’s elaboration errors well enough to revise failed proof attempts, and handling the cross-module dependencies that characterize real software rather than self-contained competition problems.

The Kernel as Verification Oracle

The core of Leanstral is a generate-verify-revise loop mediated by Lean’s kernel. The kernel is approximately 5,000 lines of C++, following the LCF architecture originated by Robin Milner in the 1970s: proof terms are values of an abstract type whose constructors are exactly the primitive inference rules. User tactics are untrusted programs that transform goal states; every tactic’s output passes through the kernel before acceptance. A model that generates a syntactically plausible but logically incorrect tactic sequence receives a kernel rejection with a precise error message.

theorem length_append (l₁ l₂ : List α) :
    (l₁ ++ l₂).length = l₁.length + l₂.length := by
  induction l₁ with
  | nil  => simp
  | cons hd tl ih => simp [ih, Nat.add_assoc]

Each tactic application produces a distinct goal state that the LSP reports to the agent. If the model generates a wrong tactic at the cons case, the LSP returns the remaining unsolved goal rather than silently accepting it. The loop terminates only when every subgoal is closed by the kernel, not by the model’s confidence.

That property distinguishes proof search from code generation. Code generation feedback comes from a test suite, which is finite and incomplete. Proof search feedback comes from the kernel, which is mathematically definitive. Reinforcement learning on kernel feedback is tractable in a way that RL on test suites is not, because the kernel’s signal is noiseless.

The Sorry Problem

The hardest design problem for a proof agent is not generating correct tactics. It is avoiding sorry.

In Lean 4, sorry is a tactic that closes any goal unconditionally, producing only a warning. An agent that uses sorry to move past a difficult subgoal produces output that type-checks and compiles but contains unverified obligations. The kernel marks any compiled artifact as axiom-dependent on sorry, which propagates through the dependency graph, but in a codebase with hundreds of modules this is easy to miss.

An agent that inserts sorry to make progress is worse than no agent. The codebase appears verified while the actual verification coverage is incomplete. The gap between apparent and actual status is invisible until something downstream fails for reasons the sorry was masking. Sorry hygiene is the requirement that a proof agent must either close a goal correctly or report failure explicitly, never use sorry as a progress placeholder. Meeting this requirement is a scaffolding problem as much as a model quality problem, and it is the primary thing that distinguishes a trustworthy coding tool from a fast but unreliable one.

What Open Weights Enable Here

The cases where formal verification matters most tend to involve proprietary code: cryptographic protocol implementations, safety-critical embedded systems, financial settlement logic, formally verified compiler components. These cannot be sent to a cloud API without leaking the codebase. An open-weight model that runs locally removes that constraint entirely.

Domain-specific fine-tuning also becomes possible. A team verifying cryptographic primitives has different proof patterns than a team working on operating system components or financial contracts. The LeanCopilot project, built on the LeanDojo framework from MIT and Caltech, demonstrated that an agent loop integrated into Lean’s VS Code extension is practically useful for real proof development. Leanstral builds on that infrastructure while bringing Mistral’s fine-tuning resources to the model component.

The Gap miniF2F Does Not Measure

miniF2F measures pass rate on competition mathematics problems formalized in Lean 4, with relevant imports already in scope and self-contained proof obligations. It does not measure navigating a codebase to find which of 200,000 Mathlib lemmas applies to a given goal, handling elaboration errors that require understanding Lean’s type inference to diagnose, or keeping proofs valid as the codebase evolves.

The scenario that no current benchmark evaluates is proof repair. A function’s type signature changes in a core module, and several dozen downstream proofs break. Each presents a different error. Some need a tactic swap; some need structural rethinking; some reveal the original proof relied on a coincidental corollary of the old type. The agent must identify which obligations changed, determine which subgoal structures remain valid, and generate patches for the rest, all without introducing sorry anywhere. This is the central workflow for software verification as an engineering practice, and the distance between current benchmark performance and handling it reliably is the honest measure of where the field stands.

Leanstral’s “trustworthy coding” framing points directly at this scenario. Whether the current version handles it is not yet clear from Mistral’s announcement. But framing the problem correctly, as software verification rather than competition mathematics, is the necessary precondition for building toward it, and having open weights is how that building actually happens.