How the Lean Community's Naming Conventions Became the Foundation of AI Theorem Proving

Mathlib’s naming conventions predate every significant AI theorem proving effort. The rule that theorems should be named after what they prove, using a Namespace.thing_aspect pattern, was a community decision made for human reasons: consistency, discoverability, the ability to find a lemma by guessing its name. Nat.add_comm is the commutativity of natural number addition. List.length_append is the fact about list length and append. Finset.sum_congr is the congruence lemma for finite sums. A developer working in Lean 4 can often find the right lemma without searching, because the name is inferable from the goal.

That predictability turns out to be precisely what machine learning systems need from a proof corpus.

Leanstral, Mistral’s open-source agent for Lean 4, is the latest beneficiary of infrastructure built for a different purpose. The 2021 CADE paper that introduced Lean 4 describes a theorem prover and programming language; it does not describe a machine learning dataset. But the decisions made by the Lean community in the years that followed, enforcing naming conventions through CI, building Mathlib to over 150,000 theorems, maintaining consistent import structure, all combine into something that reads like it was designed for AI retrieval.

Why naming conventions are machine-learning infrastructure

Premise selection, the problem of finding which existing theorems are relevant to your current proof goal, has been a persistent bottleneck in AI theorem proving. When you are trying to prove that filter preserves list length up to some bound, you need to know that List.length_filter_le exists. When you are trying to show a finite sum is bounded, you need Finset.sum_le_sum. Strong tactic generation fails if the model does not know which lemma to invoke.

In corpora where naming is inconsistent, retrieval requires semantic embedding search. You embed the current goal, embed all 200,000 theorems, find the nearest neighbors, and hope the relevant lemma surfaces. This approach works reasonably well but degrades as the corpus grows, because embedding space becomes crowded and the nearest neighbors are not always the relevant ones.

In Mathlib, the naming is structural. A theorem about Nat.succ and addition will be in the Nat namespace. A theorem about List and length will involve both terms in its name. A model that has internalized the naming conventions can construct candidate lemma names from the goal structure directly, using names as a search index rather than relying entirely on semantic similarity. LeanCopilot, which integrates LLM-based premise retrieval into Lean’s VS Code environment, showed that models trained on Mathlib significantly outperform those using semantic search alone for premise retrieval, for this specific reason.

Isabelle’s Archive of Formal Proofs has around 800 independent entries with varied naming conventions and no central enforcement mechanism. Coq’s ecosystem is distributed across multiple libraries with overlapping and inconsistent naming schemes. At scale, the difference is substantial. A model learning from Lean’s Mathlib internalizes a consistent grammar for mathematical vocabulary; a model learning from a fragmented corpus learns a collection of dialects.

The CI enforcement that keeps the corpus clean

Mathlib is not just large; it is consistently formatted. Every pull request runs a CI pipeline that enforces naming conventions, checks for unused imports, and verifies that all existing theorems still pass the kernel. The pipeline is strict enough that contributors occasionally argue with it, but it has held naming discipline across years of contributions and thousands of theorems.

For a training corpus, this consistency is valuable in proportion to the corpus size. A large but inconsistent corpus requires more data to generalize from, because the model must learn to handle multiple naming patterns simultaneously. A large and consistent corpus lets the model treat naming as a generalizable skill. At 150,000-plus theorems, every new Mathlib contribution adds to a model’s understanding of the underlying structure rather than adding noise.

This is infrastructure built for human reasons. The naming conventions and CI enforcement exist because the Mathlib maintainers wanted a library that a mathematician could navigate by intuition. The ML benefit is a side effect, but it is a consequential one.

The kernel as reward signal

The other piece of the infrastructure is Lean’s kernel. At around 5,000 lines of C++, it is small enough to audit and fast enough to use as a training signal at scale. A proof term either type-checks or it does not, and the kernel tells you immediately. This binary, authoritative feedback is what makes reinforcement learning over proof search tractable. The training loop needs no human evaluation; the kernel’s output is the reward function.

DeepMind’s AlphaProof, which reached silver medal level on IMO 2024 problems, used the kernel as its reward function. The agent generates a candidate proof, the kernel verifies it, the outcome updates the model. No rubric, no rubric ambiguity, no partial credit for almost-proofs. Leanstral inherits this same reward structure. The specific contribution Mistral brings is model quality and the infrastructure to fine-tune a capable base model on Lean 4 tasks. The reward structure that makes fine-tuning useful was built by the Lean community over the preceding years.

The flywheel in practice

There is a compounding dynamic that runs in both directions. AI tools that make Lean 4 more accessible attract developers who would not otherwise engage with formal methods. Some of those developers contribute to Mathlib. More Mathlib theorems, with the same naming discipline and CI enforcement, become training data for the next generation of AI tools. Better tools attract more developers.

LeanDojo, the MIT/Caltech framework that systematized interaction between ML systems and Lean’s kernel, generated a dataset of theorem-tactic pairs from Mathlib by running Lean’s elaborator over existing proofs and recording every intermediate proof state. That dataset has bootstrapped multiple research projects. Every subsequent year of Mathlib growth makes a refreshed version of that dataset more capable.

Leanstral is one point in this cycle. Mistral fine-tuned on the available corpus, built the scaffolding that turns the model into an agent, and released the result as open source. The open-source release matters for the cycle: a closed tool benefits from Mathlib without contributing back to the community infrastructure. An open tool can be improved by the community and embedded in workflows that generate new Lean 4 proof data, feeding back into the next training run.

What the ecosystem produces

The practical output of this infrastructure is worth stating plainly. Formal verification of software has existed for decades. CompCert, the verified C compiler proved correct in Coq, took years of research effort. seL4, the formally verified microkernel, required similar scale of human investment. These are landmark achievements that also demonstrate the cost of formal verification in the absence of good tooling.

The Lean 4 ecosystem, and agents like Leanstral built on top of it, is in the early stages of changing that cost calculation. The direction is reducing the cost of routine proof work: finding the relevant Mathlib lemma, applying the standard tactic for a standard goal type, suggesting the induction principle that fits a recursive function. The parts that require genuine mathematical insight, specifying what correctness means, choosing which invariants to prove, handling the proofs that do not reduce to known patterns, remain human work. The reduction in routine effort changes who can afford to do formal verification at all.

The Lean community did not set out to build a machine learning corpus. They set out to build a usable formal mathematics library with predictable naming and rigorous CI. The AI theorem proving field, from GPT-f in 2020 through AlphaProof in 2024 to Leanstral now, has converged on Lean 4 because those human-centered decisions happened to produce exactly the kind of structured, consistent, verifiable data that makes AI systems work. Leanstral benefits from that foundation. So does the next system, and the one after it.