How Lean 4 Became the Infrastructure Layer for AI Theorem Proving

Mistral’s Leanstral is the latest in a sequence of AI systems, DeepMind’s AlphaProof, MIT’s LeanDojo and ReProver, Stanford’s LeanCopilot, and the COPRA system built on GPT-4, all of which run against Lean 4. The concentration is notable. Coq has been in production since 1989 and carries an ISO-adjacent pedigree from the seL4 and CompCert projects. Isabelle/HOL has the Archive of Formal Proofs, covering thousands of verified results across mathematics and computer science. Agda and Idris have active research communities. None of them have attracted comparable AI research volume to Lean 4, which reached a stable release only in 2021.

The reasons are traceable to specific design decisions. Understanding them is useful for evaluating where AI-assisted formal verification is likely to go next.

The Interface Problem

Every AI system that interacts with a proof assistant needs a programmatic interface to that assistant. It needs to submit a tactic, observe the resulting proof state, detect errors, and branch on what it sees. The quality and stability of that interface determines how much engineering effort a research team spends on plumbing versus on the actual AI problem.

Coq’s traditional interface is coqtop, an interactive REPL that communicates over a terminal. For machine clients, SerAPI wraps Coq’s proof engine over a structured protocol using S-expressions, developed by Emilio Jesús Gallego Arias. SerAPI works, and several AI systems have used it, but it is a layer over a tool that was not designed for programmatic access. Isabelle’s interface for AI clients runs through Isabelle/ML, the Standard ML layer at the core of the system, which is expressive but architecturally distant from how modern tooling integrates. Both ecosystems have seen AI work, including the Tactician system for Coq, but the friction of the interface has constrained the research velocity.

Lean 4 was designed from the start to integrate with the Language Server Protocol. Its LSP server is a first-class component, not a third-party adaptation. Sending a tactic and receiving the new proof state is a JSON round-trip, the same protocol that editors use to provide hover documentation and diagnostics. When AlphaProof needed to run Lean at scale for reinforcement learning rollouts, the interface was already well-specified. When LeanDojo built a gym-style environment for theorem proving, the LSP provided a clean substrate. The interface difference runs deeper than convenience: a stable, documented, machine-oriented protocol makes the proof assistant composable with other systems in a way that a REPL-first tool is not.

Tactics as Lean Programs

Coq proofs are written in Ltac, a tactic language that is syntactically and semantically separate from Coq itself. Ltac2, introduced more recently, is more principled, but the metaprogramming gap between the tactic language and the object language remains real. Writing a custom tactic in Coq requires understanding Ltac’s own operational semantics, distinct from Gallina, Coq’s core type theory.

Lean 4 eliminates this separation. A tactic in Lean 4 is a term of type TacticM Unit, where TacticM is a monad carrying proof state as an effect. Writing a new tactic means writing an ordinary Lean 4 program, using the same type system, the same do notation, and the same standard library as any other Lean 4 code. The macro system, built around Lean 4’s Syntax type and quasiquotation, extends this further: new syntax forms are defined in terms of existing ones, all within the language itself.

-- A custom tactic defined using Lean 4's macro system
syntax "try_ring_then_simp" : tactic
macro_rules
  | `(tactic| try_ring_then_simp) => `(tactic| first | ring | simp_arith)

For an AI agent, the implication is that the action space, the set of valid tactic expressions it can generate, is coextensive with the set of valid Lean 4 programs. There is no impedance mismatch between what the language model outputs and what the proof engine accepts. Tactic syntax and Lean 4 code syntax are the same grammar. A model fine-tuned on Lean 4 code is simultaneously fine-tuned on the tactic language, at no additional representational cost.

Mathlib as a Unified Corpus

The most significant practical advantage Lean 4 has over its peers for AI training is Mathlib. The library contains over 200,000 formalized theorems and definitions covering most of undergraduate mathematics and substantial graduate-level material across algebra, analysis, topology, number theory, and category theory. All of it lives in a single repository, with consistent naming conventions and a shared tactic vocabulary maintained by active community review.

This did not happen by default. Kevin Buzzard’s sustained effort to bring working mathematicians to Lean, through the Xena Project and his Formalising Mathematics course at Imperial College, generated both contributors and theorems. The community’s deliberate choice to maintain Mathlib as a unified library, rather than a collection of independent packages, preserved the naming consistency and cross-library coherence that makes it valuable as a machine learning corpus. The migration from Mathlib3 to Mathlib4 was a coordinated community effort rather than a fragmentation event.

Isabelle’s Archive of Formal Proofs is comparable in mathematical depth but structured as a collection of independent entries, each maintained separately with its own conventions. Coq’s MathComp covers algebraic structures in depth but does not span the breadth of Mathlib’s analysis and topology coverage. For an AI training run, a unified library with consistent naming conventions across a wide domain is categorically more useful than an archive of independent entries. The model learns naming patterns that transfer across the corpus, and the retrieval-augmented approaches pioneered by LeanDojo’s ReProver can build a coherent retrieval index across the whole library.

LeanDojo’s approach made this concrete. ReProver retrieves potentially relevant premises from Mathlib before generating each tactic, using a retrieval model trained on proof co-occurrence patterns. The result was around 57% on miniF2F, the 488-problem formal mathematics benchmark, using a comparatively small model. The retrieval approach was tractable because Mathlib’s structure allows a usable index to be built; the same approach applied to a fragmented archive would produce noisier retrieval.

A Rich Automation Vocabulary

Lean 4 ships with a set of decision procedure tactics that cover large fragments of commonly-occurring proof obligations. omega decides linear arithmetic over integers and natural numbers completely. norm_num handles numerical computations and algebraic normalizations. ring proves equalities in commutative rings, including polynomial arithmetic. decide evaluates decidable propositions at compile time by running the decision procedure. aesop, the Automated Extensible Search for Open Problems tactic, implements configurable proof search using attribute-marked lemmas.

example (n m : Nat) : n + m + 0 = m + n := by omega
example : (2 : Int) ^ 10 = 1024 := by norm_num
example (n m k : Nat) : (n + m) * k = n * k + m * k := by ring

For an AI agent, this vocabulary is a set of high-level subroutines. A model that learns to recognize when omega will close a linear arithmetic goal does not need to generate a multi-step proof; it outputs one word. The search budget is spent on structurally interesting decisions, not arithmetic bookkeeping. The availability of aesop as a configurable search tactic means that well-understood lemma classes can be delegated automatically, leaving the agent to handle the proof obligations that require genuine structural insight.

Coq’s automation, auto, tauto, omega, and the ring and field tactics, covers similar territory but with less uniformity and fewer configuration options. The Lean 4 automation vocabulary is both broader and more consistently applicable across the Mathlib ecosystem, where attribute annotations are maintained to make aesop effective.

What Convergence Means for Leanstral

Leanstral arrives into an ecosystem that has been shaped by these design decisions. The Lean4 REPL project provides an additional programmatic interface for AI agents that need batch proof checking rather than interactive session management. The LSP server handles interactive development. Mathlib provides the corpus. The tactic-as-program design means models fine-tuned on Lean 4 general code transfer to proof tactic generation without domain mismatch.

Mistral’s contribution is making a capable, locally-deployable model available in this ecosystem. AlphaProof demonstrated that the architecture, AI tactic generation validated by the Lean kernel, works at IMO competition level. That system remains proprietary. LeanCopilot from Stanford demonstrated IDE-integrated real-time tactic suggestion. Leanstral extends access to practitioners who need local deployment and auditable weights, settings where sending proofs through an external API introduces compliance problems that the kernel guarantee alone cannot resolve.

The challenges that remain, long proof chains requiring twenty or more tactic steps, introduction of auxiliary lemmas not found in Mathlib, and non-obvious problem decompositions, are not specific to Lean 4. They reflect the distance between the distribution of training proofs and the distribution of proofs required in real software verification workflows. That distribution gap closes as more formal proof engineering is done with AI assistance and the resulting proofs feed back into training.

The choice of substrate shapes how quickly that feedback loop operates. An LSP-first interface handles parallel rollouts. A unified corpus with consistent naming provides coherent training signal. Tactics as Lean programs eliminate the representation mismatch between generation and execution. Lean 4’s design concentrated these advantages in one system. The AI theorem proving field responded to what the infrastructure made tractable, and Leanstral is a product of that convergence.