How AI Proof Systems Bootstrap Their Own Training Data

Training data is the binding constraint in AI theorem proving. Building a model that can generate correct Lean 4 proofs requires examples of correct Lean 4 proofs, and those are scarce by nature. Writing formal proofs requires rare expertise, proceeds slowly, and produces a corpus that is narrow relative to the full space of theorems anyone might want to verify. AlphaProof reached IMO silver medal performance, but it did so by generating synthetic training data in volumes that dwarf the entire Mathlib library, at a compute budget no university group can replicate.

What Mistral’s Leanstral and DeepSeek-Prover demonstrate is that there is a structural property of formal proof systems that makes synthetic data generation unusually powerful here: the Lean 4 kernel is a perfect filter. If you generate a large volume of proof candidates and check them against the kernel, the accepted proofs form a high-quality, reliable training set automatically. The rejected ones tell you exactly where they failed. This self-reinforcing loop explains why AI theorem proving has advanced faster than almost any other AI subfield in the past three years.

What Human-Written Proof Data Looks Like

LeanDojo (Yang et al., NeurIPS 2023) built the first serious infrastructure for extracting training data from human-written Lean 4 proofs. It instruments the Lean elaborator to record, at each tactic invocation, the complete proof state: the list of open goals, the local hypothesis context, and the target type. The result is (proof state, tactic, next state) triples extracted from all of Mathlib.

The extraction produced around 100,000 proof state and tactic pairs. That corpus enabled ReProver and several early systems to achieve meaningful pass rates on miniF2F. However, 100,000 examples is a modest dataset by contemporary machine learning standards. Code models train on hundreds of billions of tokens. A Lean-specific model trained only on human-written Mathlib proofs is working from a narrow distribution.

Coverage is equally limited. Mathlib reflects the preferences of its contributors, which skew toward abstract algebra, analysis, and topology. Competition mathematics problems, the domain of miniF2F, represent a different stylistic distribution, with heavier reliance on norm_num, omega, and decide. Software correctness problems involve yet another distribution: invariant-based reasoning, state machine arguments, data structure properties. Human-written proofs from a general library do not cover these distributions evenly, which constrains how far a model trained on them can generalize.

The Synthetic Data Pipeline

DeepSeek-Prover V1 (2024) described the approach that has since become standard for training competitive proof models. The pipeline has four stages.

First, informal mathematical statements are translated to Lean 4. The DeepSeek team translated problems from the MATH benchmark, competition datasets, and internally generated problems. Translation requires a model that understands both informal mathematics and Lean 4 syntax, but it does not require the translated statements to have proofs yet, only that they be syntactically valid and mathematically meaningful.

Second, a draft model generates proof candidates for each formalized statement. The draft model can be a general code model fine-tuned on Lean 4, or an earlier iteration of the system being trained. Multiple candidates are sampled per problem with diverse temperature settings. The diversity matters: proofs of the same theorem can take very different tactic routes, and sampling temperature lets you explore that space.

Third, every candidate is checked against the Lean kernel. This step has no false positives. A proof either type-checks completely, with no sorry and no unsolved goals, or it does not. The kernel does not hallucinate or produce confidence scores. Its output is binary.

Fourth, only the accepted proofs are kept for training. The filtered dataset is used to fine-tune the next draft model. The improved model then generates higher-quality candidates, which pass the kernel at higher rates, which produces a larger filtered training set. DeepSeek-Prover V1 used this pipeline to generate approximately 8 million proof pairs, nearly eighty times the size of the LeanDojo corpus. The miniF2F pass rate climbed accordingly.

Why the Kernel Is Unusually Powerful as a Filter

Machine learning training typically works with noisy labels. Human-annotated datasets contain errors, crowdsourced labels have inter-annotator disagreement, and web-scraped data contains outright incorrect information. The standard response is scale: average over enough noisy labels and the signal dominates.

For formal proofs, there is no noise. A proof that the Lean kernel accepts is correct with respect to the axioms of Lean’s type theory: propositional extensionality, functional extensionality, the axiom of choice, and quotient types. Any model learning from kernel-accepted proofs is learning from verified correctness, not from human judgment about whether a proof looks plausible.

This is structurally different from code generation training. A code model trained on GitHub repositories learns from code that compiles and passes tests, but compilation does not imply correctness, and test suites do not cover all behavior. The signal is indirect and incomplete. A proof model trained on kernel-accepted proofs is learning from a signal that is mathematically definitive.

The error messages from rejected proofs are also informative in a way that failed tests are not. When a tactic fails, the kernel returns the exact goal state at the point of failure, the type mismatch if there is one, and the hypothesis context that was in scope. These failures are not just discards. They are training data for the repair capability that makes the agent loop useful. A model trained on (failed proof attempt, error message, corrected proof) triples learns to interpret Lean’s feedback and revise accordingly. DeepSeek-Prover V1.5 built exactly this, and the improvement over single-attempt generation was substantial.

The Autoformalization Bottleneck

The weak link in the synthetic pipeline is the first step: translating informal statements to Lean 4. This is the autoformalization problem, and it remains the hardest unsolved piece.

For competition mathematics, there is structured natural language available (problem statements from AMC, AIME, and IMO archives) that models can translate to Lean 4 with moderate reliability. Autoformalization results from 2022 showed that GPT-4 class models could formalize competition problems at non-trivial rates when given few-shot examples, but error rates remained high enough to require human verification on anything nontrivial.

For software correctness, the informal specification is often implicit or absent. A function’s docstring might say “returns the sorted version of the input list” without specifying permutation equivalence, stability under equal keys, or handling of duplicates. Formalizing that description as a Lean 4 type requires decisions about which properties to encode, and those decisions are not recoverable from natural language alone. AlphaProof’s 2024 IMO experiments required human formalization for each problem. The silver-medal result was genuinely impressive, but it carried an asterisk: the formalization step was hand-done, and it was not trivial. Closing that loop automatically would change the data economics of the entire field.

What Open-Source Changes for Domain-Specific Flywheels

AlphaProof’s synthetic data pipeline was described at a high level in DeepMind’s blog post but not released. The training infrastructure and the specifics of the self-play loop are not publicly available. For anyone outside Google DeepMind, the result is a proof of what is possible, not a tool for building on it.

Leanstral’s open-source release gives researchers and engineering teams the model weights and agent scaffolding to run their own version of this pipeline. A team working on verified cryptographic primitives can generate informal statements about their construction, formalize them in Lean 4, use Leanstral to generate proof candidates, filter with the kernel, and fine-tune for their specific domain. The kernel handles quality control. The team provides domain-specific problem statements.

This is the practical value of open weights in this context, and it is distinct from the usual open-source argument about auditability and cost. The general version of the flywheel has already been run by DeepSeek and Mistral. The domain-specific versions require someone who understands the target domain, a Lean 4 installation, and a model to bootstrap from. Leanstral removes the last constraint, which was previously the most expensive one.

The 2021 Lean 4 paper by de Moura and Ullrich described a language where theorem prover and programming language share the same core. That design decision, made for elegance and expressiveness, turned out to also give AI proof systems an unusually clean training signal. The kernel’s binary feedback is not a side effect of the language design. It is a consequence of type theory’s foundational property: a term either has a type or it does not, and checking that property is the kernel’s only job. The synthetic data flywheel works because that checking is fast, reliable, and available to any process that can speak JSON over a socket. Whether you are Mistral training a general-purpose prover or a compiler team verifying their register allocator, the oracle is the same.