Mathlib Is the Infrastructure: Why Lean 4 Became the Center of Gravity for AI Proof Research

Mistral AI released Leanstral, an open-source agent for formal proof engineering in Lean 4. The capability claims are worth examining, but the more interesting story is the infrastructure the model sits on top of. Three major interactive theorem provers have large, active communities: Coq (INRIA, late 1980s), Isabelle/HOL (Cambridge and Munich, 1986), and Lean 4 (de Moura and Ullrich’s 2021 ground-up redesign). All three have been used for serious verified mathematics and verified software. Coq verified the CompCert C compiler and the four-color theorem. Isabelle/HOL verified the seL4 microkernel. The difference that pushed LLM-assisted proof research toward Lean 4 specifically has little to do with the expressiveness of their type theories; it has to do with corpus coherence.

Why Corpus Coherence Matters for Machine Learning

Coq and Isabelle have substantial formal libraries, but those libraries accumulated incrementally from independent groups with different conventions. Isabelle’s Archive of Formal Proofs contains around 800 entries. They are individually useful, but the naming conventions, tactic styles, and structural organization vary across entries and cannot be assumed consistent. Training a language model on a heterogeneous collection of Isabelle entries means learning many overlapping dialects rather than one coherent vocabulary.

Lean 4’s Mathlib was built as a single repository from the beginning, with a deliberately chosen naming convention enforced by CI. The convention is explicit enough to make theorem names predictable from content: a theorem named Nat.add_comm states commutativity of addition on naturals; List.length_append states that the length of a concatenation is the sum of lengths; le_trans expresses transitivity of the ≤ relation. This is not informal documentation. The naming standard is specified in the Mathlib contribution guide and automatically checked. A model trained on Mathlib learns this vocabulary and can generate theorem names likely to match existing lemmas, which is directly useful when the agent needs to invoke a specific premise at a proof step.

The library currently contains over 200,000 theorems and definitions, covering real and complex analysis, algebraic geometry, category theory, number theory, topology, and most of undergraduate mathematics. The coverage is not just large; it is interconnected. Theorems cite other theorems with explicit dependency edges, organized into a namespace hierarchy that reflects mathematical structure. The model that trains on this learns proof strategy alongside syntax: which automation tactics handle which goal shapes, how to navigate the library hierarchy, how to decompose a goal using lemmas from two related areas.

How the Library Structure Is Exploited at Inference Time

The LeanDojo project at Caltech and MIT turned Mathlib’s coherence into an explicit inference-time tool. Their ReProver system separates the proof step into two phases: before generating a tactic, a retrieval model ranks Mathlib theorems by likely relevance to the current proof state. The retrieval model is trained on proof co-occurrence, on the observation that lemmas used near each other in past Mathlib proofs tend to be relevant to similar goals. The generated tactic can then reference retrieved lemmas with high confidence they exist and have compatible types.

This retrieval step addresses what is called the premise selection problem. At any proof state, the lemma that unlocks progress might be a named result buried several namespaces deep in a part of Mathlib the model rarely encountered during training. Retrieval from a coherent, consistently named library is meaningfully better than retrieval from a noisy corpus, because the naming conventions let the retrieval model operate on semantic content rather than surface string similarity. ReProver reached around 57% on the miniF2F benchmark using a comparatively small model; state-of-the-art systems have since moved toward 65 to 70 percent on the same benchmark.

The Liquid Tensor Experiment as a Case Study

The most concrete illustration of what Mathlib enables at scale is the Liquid Tensor Experiment, completed in 2022. Peter Scholze, a Fields Medal recipient, posted a challenge to the formal mathematics community in 2020: formalize a key theorem in his condensed mathematics program, a result he was confident was correct but whose proof he found technically demanding enough that external verification would be meaningful. The Lean community undertook the formalization over roughly two years, and Scholze described the outcome as a genuine external check on his own mathematical work.

The significance for Leanstral is indirect but real. The formalization required not just writing the theorem statement but extending Mathlib with hundreds of new intermediate lemmas in category theory, homological algebra, and condensed mathematics. Each of those extensions is now part of the library, available as training data and as a retrieval target for future agents. The community effort that verified a cutting-edge mathematical result simultaneously expanded the corpus that makes AI-assisted proof search tractable. This compounding is a structural feature of how coherent mathematical libraries grow: significant formalizations leave permanent contributions to the ecosystem.

What AlphaProof Demonstrated About This Infrastructure

DeepMind’s AlphaProof (2024) demonstrated what becomes possible when a model trained on this infrastructure is combined with reinforcement learning using the Lean kernel as the reward signal. The system solved 4 of 6 problems from IMO 2024 at silver-medal level, including problems never previously formalized in Lean, without human annotation of proof quality. The Lean kernel provided the training signal: the model generated proof attempts, the kernel accepted or rejected them, and the model learned from millions of these binary verdicts.

This is what the coherent library made possible beyond the initial training. The kernel’s binary feedback is unambiguous, instantaneous, and requires no human labeling. A model can generate millions of incorrect proof attempts and learn meaningfully from each rejection, because the rejection says precisely where the proof failed. This self-improving loop is only tractable because the model’s generated tactics are syntactically valid enough to interact with the kernel in the first place, which requires having been trained on a consistent, large-scale formal library.

What Open Weights Enable That Cloud APIs Cannot

AlphaProof demonstrated the architecture and remains proprietary. Leanstral’s open-weight release changes what is accessible beyond the research lab setting.

Formal verification is most valuable in environments where sending specifications to a cloud API is not acceptable: aerospace systems under DO-178C certification, cryptographic protocol design for sensitive systems, organizations subject to data residency or export control requirements. The locally-deployable, auditable-weights model is a categorically different tool from a cloud-hosted service, regardless of raw capability comparison.

More practically, open weights allow fine-tuning on domain-specific proof corpora. A team verifying cryptographic primitives can adapt the model to their specific type definitions and library conventions. The same reinforcement learning loop that trained the base model, generating proof attempts and using kernel verdicts as the signal, can run on a domain-specific corpus. Mathlib covers mathematical structures that rarely appear in production software verification; the base model’s capabilities in algebraic geometry may be less relevant than its grasp of data structure invariants, concurrent system properties, or parser correctness conditions. Domain-specific fine-tuning, made possible by open weights, is the bridge between mathematical benchmark performance and practical software verification utility.

The Specification Boundary

The Lean kernel verifies that a proof term has the type corresponding to a stated proposition. The proposition being correct is outside the kernel’s scope and outside what any current AI system changes. A sorting function proven to return a sorted sequence is not proven to handle the specific invariants relevant to the caller’s data structure. A verified cryptographic protocol with a flawed threat model is kernel-certified and semantically wrong. Writing the right specification is still a human problem, and it has historically accounted for roughly half of formal verification effort even before proof construction begins.

This boundary is a fundamental property of what formal verification means, not a limitation of any particular tool or model. What Leanstral changes is the cost of the proof step once a correct specification exists. Lean 4’s 2021 paper describes a language where the specification, the proof, and the executable implementation can coexist in the same file, compiled to native code via LLVM without an extraction step. Leanstral adds an AI agent to that same file, searching for the proof while the kernel validates each candidate. The infrastructure, Mathlib, the kernel, the LLVM backend, and now the open-source agent, is coherent in a way that no other formal methods ecosystem currently is. The community built that infrastructure over a decade, largely without knowing it was also building the training corpus for the systems that would follow.