Why Every Serious AI Theorem Prover Runs on Lean 4

Mistral AI’s Leanstral is the latest in a line of AI theorem-proving agents that have each made the same choice: Lean 4. Google DeepMind’s AlphaProof, MIT and Caltech’s LeanDojo, the GPT-4-based COPRA, and the interactive LeanCopilot all run against the same Lean 4 kernel. This convergence was not coordinated. It reflects something structural about Lean 4 that the other major proof assistants do not have.

The interesting question is what that something is, because Leonardo de Moura and Sebastian Ullrich were not designing for AI agents when they wrote the 2021 CADE paper that described Lean 4’s ground-up redesign. They were designing for human mathematicians and software engineers who wanted tighter integration between their proofs and their programs. The choices they made for that audience turned out to be exactly the choices an AI agent needs.

Coq and Isabelle Have the Same Trusted Kernel

The usual framing for what makes proof assistants trustworthy is the LCF architecture: Robin Milner’s 1972 insight that if proof terms are an abstract type whose values can only be constructed through primitive inference rules, then user tactics are untrusted by construction. The kernel checks whatever the tactic produces; tactics cannot forge proof terms. Lean 4, Coq, and Isabelle all implement this design. On the trust architecture alone, they are equivalent.

The difference lies elsewhere. Coq’s tactic language is Ltac, a domain-specific language embedded in Coq but separate from Gallina, Coq’s term language. Writing a Coq tactic requires switching between two distinct syntaxes and two distinct semantic models. Lean 4’s tactic system works differently: tactics are ordinary Lean 4 programs of type TacticM Unit, using the same do notation, the same typeclass dispatch, the same if/else and match constructs as the rest of the language. There is no separate tactic metalanguage. A model fine-tuned on Lean 4 code is simultaneously fine-tuned on the tactic language at no additional representational cost.

Isabelle’s interface situation is more limiting. Machine clients interact with Isabelle through its Standard ML layer, which is architecturally distant from the kind of JSON-based tooling that AI agents and modern editors expect. The friction is real; it substantially slowed research velocity in Isabelle-based AI work relative to what the Lean 4 community produced in the same period.

The LSP Server as a First-Class Component

Lean 4’s Language Server Protocol implementation is not a retrofit. The LSP server ships as a first-class component alongside the kernel; it was part of the original design, not added later for editor compatibility. This architectural decision means AI agents using Lean 4 get an interface that was built to be machine-readable from the start.

The consequence is that the proof state an AI agent works with is the same proof state a developer sees in VS Code’s InfoView panel. After each tactic application, the LSP server returns the current hypotheses and goal in a structured format. An agent reads this, generates the next tactic, submits it, and reads the response. The round-trip is a standard JSON exchange over a stable protocol. There is no need for screen scraping, parsing REPL output formats that were designed for human consumption, or handling the idiosyncratic output of a separate interactive session.

Compare this to what AI researchers attempting to use Coq would encounter: SerAPI, Coq’s machine interface, serializes state as S-expressions and has historically lagged behind Coq’s own development cycle. The interface exists and research groups have used it, but it is not a load-bearing part of the system the way Lean 4’s LSP server is.

What Mathlib Provides That No Other Library Does

Mathlib4 contains over 200,000 theorems and definitions with CI-enforced naming conventions. The naming convention is not informal guidance; it is checked on every pull request. A theorem about Nat.add being commutative is named Nat.add_comm. A theorem about list length after appending is List.length_append. The names are predictable from the content, which means a model that has internalized the naming system can retrieve the right lemma by constructing its likely name from the proof goal.

Isabelle’s Archive of Formal Proofs contains roughly 800 independent entries with varied conventions. Each entry was contributed by a different team working in their own style. There is no shared naming system across the archive. Searching it programmatically requires either semantic similarity over embeddings or exhaustive retrieval; the structured name lookup that Mathlib enables is not available.

Coq’s standard library and MathComp cover substantial mathematical territory but lack Mathlib’s breadth and coherence. MathComp is internally consistent but uses a packed classes methodology that requires familiarity with specific encoding patterns. The kind of seamless cross-domain coverage Mathlib offers, where a proof in algebraic number theory can directly invoke topological lemmas that invoke measure-theoretic infrastructure, comes from the community’s decade-long effort to build a single unified library rather than a collection of independent contributions.

Mathlib is also kernel-verified training data. Every theorem in Mathlib was accepted by the same kernel used at inference time. There is no separate annotation process, no human-labeled correctness judgments, no synthetic examples that might contain subtle errors. The training signal and the reward signal are the same thing.

No Extraction Step

When a Coq proof verifies that a function satisfies a specification, the resulting verified implementation is in Gallina, Coq’s internal language. Deploying it requires extraction to OCaml, Haskell, or Scheme. The extraction step is trusted separately, and the extracted code is a different artifact from the verified specification. You have two things: a proof and a program.

Lean 4 compiles directly to native code via LLVM. The verified function is the function. There is no extraction, no separate artifact, no semantic gap to bridge between the proof and the deployed binary. A developer writing a hash table implementation in Lean 4 can prove invariants about it in the same file, using the same language, and compile the result to native code that runs without any proof-related overhead at runtime. Lean 4’s Prop universe is proof-irrelevant: proof terms that inhabit Prop are erased at compilation. The runtime pays no cost for carrying proofs alongside code.

For the “trustworthy coding” framing that Leanstral emphasizes, this matters. The end state of a Leanstral session is not a verified specification that must then be translated into deployable code. It is deployable code, with correctness proofs that the kernel has verified, that compiles via LLVM to a native binary.

Automation as Search Space Reduction

Lean 4 ships with automation tactics that collectively cover large, commonly occurring classes of proof obligations. omega decides linear arithmetic over integers and natural numbers completely. ring discharges polynomial equalities in commutative rings. norm_num handles numeric computations. decide evaluates decidable propositions at compile time by running the decision procedure in the kernel. aesop performs configurable rule-based proof search with attribute annotations.

These are not convenience features for AI agents. They represent a substantial reduction in the search space. A proof obligation in linear arithmetic that might require a twenty-step manual argument is a single omega call. An agent that knows when to invoke these tactics avoids the search entirely for those subgoals. The uniform applicability of these tactics across the Mathlib corpus, reinforced by Mathlib’s own heavy use of them, means fine-tuning data is full of examples demonstrating exactly when each automation applies.

-- omega handles the arithmetic entirely; no manual case analysis
theorem mod_add_div (m k : ℕ) (hk : k ≠ 0) : m % k + k * (m / k) = m := by
  omega

-- ring handles the algebraic identity
theorem sq_sub_sq (a b : ℤ) : a ^ 2 - b ^ 2 = (a + b) * (a - b) := by
  ring

An agent generating proofs of the form above is doing something qualitatively different from an agent guessing sequences of low-level inference steps. The automation is expressive enough that many practical proof obligations reduce to a single tactic call once the right framing is found.

The Practical Consequence for Leanstral

Leanstral benefits from all of these choices without having made any of them. The LSP interface was already designed for machine clients. The tactic language was already in the same grammar as the proofs the model is fine-tuned on. Mathlib was already a unified corpus with predictable naming. The LLVM compilation pipeline was already there. Mistral’s contribution is a fine-tuned model and an agent loop that uses these interfaces effectively, plus the decision to release everything under an open license.

That last point has practical weight. AlphaProof demonstrated what is achievable at research lab scale with closed systems and proprietary training pipelines. Leanstral demonstrates that the same architectural pattern, an LLM generating tactics and submitting them to the Lean kernel in a loop, is viable as an open-source tool that practitioners can inspect, fine-tune on domain-specific proof corpora, and deploy in environments where cloud API access is not an option. Regulated industries verifying cryptographic implementations or safety-critical embedded systems cannot send proprietary code to a cloud API. An open-weight agent that runs locally changes the cost model for those use cases.

The field’s convergence on Lean 4 is not accidental and not reversible. Mathlib grows by thousands of theorems per month. Every new theorem is training data for future systems. The feedback loop between Lean 4 development and AI proof research is self-reinforcing: better tools produce more formalized mathematics, which produces better training data, which produces better tools. Leanstral enters that loop as a deployable, open-weight participant, which is what the community needed at this stage.

None of this was planned. De Moura and Ullrich designed a better theorem prover and programming language for human beings. The LSP-first interface, unified tactic language, LLVM backend, and the Mathlib community’s coherent naming discipline were all decisions made for ergonomics. They happened to be exactly the right decisions for an AI agent operating in the same environment.