Proof Search Is the Hard Part: What Makes a Lean 4 Agent Different from a Code Generator

The standard complaint about LLMs and code generation is that models produce plausible-looking output that can be subtly wrong. For most programming languages, catching that wrongness requires test suites, human review, or downstream failures. Lean 4 changes the feedback loop entirely: every tactic application either advances a proof state or produces an error, and the kernel either accepts a completed proof or rejects it, with no probability involved. Mistral’s Leanstral is built around this property, and understanding why it matters requires understanding what proof search actually involves.

One-Shot Generation Does Not Work for Proofs

General code generation models, given a function signature and docstring, produce a function body in one forward pass. The output may be wrong, but the failure mode is tolerable for many tasks: run the tests, see what breaks, iterate. For Lean 4 proofs, one-shot generation fails on anything non-trivial because the proof search space is too large and too sensitive to intermediate decisions.

Consider a moderately sized lemma in Lean 4. The proof might require five to twenty tactic applications, where each tactic is chosen from dozens of available tactics, applied with arguments drawn from the current hypotheses, and where a wrong choice at step three causes failure at step twelve in a way that gives no useful gradient signal if you are only checking the final result. A model that generates a twenty-step proof in one pass faces an exponential combinatorial space; even if each individual step is 80% likely to be correct, a twenty-step chain has roughly a 1% chance of being fully correct end-to-end.

This is why every serious AI system for Lean 4 proofs operates as a search procedure rather than a one-shot generator. The distinction between Leanstral-as-agent and a general code completion model is architectural: the agent receives proof state observations, generates tactic candidates, submits them to the Lean server, and branches on the result.

Proof States as Observations

Lean 4 exposes its proof engine through an LSP server that responds to tactic applications with the resulting proof state. A proof state is a structured object: a list of hypotheses, each with a name and type, and a goal, the proposition that remains to be proved. Every tactic transforms one proof state into zero or more new states.

-- Initial state:
-- n m : Nat
-- ⊢ (n + m) * 2 = n * 2 + m * 2

example (n m : Nat) : (n + m) * 2 = n * 2 + m * 2 := by
  ring
-- Proof complete. No remaining goals.

-- A harder state that ring cannot close:
-- f : Nat → Nat
-- hf : ∀ x, f x ≤ f (x + 1)
-- ⊢ f 3 ≤ f 7
-- Requires: apply le_trans (hf 3), apply le_trans (hf 4), ...

The ring tactic handles polynomial arithmetic automatically. It either closes the goal or fails immediately. An agent operating on the harder goal needs to recognize that le_trans applied repeatedly, combined with the hypothesis hf, forms a viable path. The proof state encodes enough structure to make this recognizable, but choosing the right path from among the hundreds of Mathlib lemmas involving ≤ is the actual difficulty.

Premise Selection at Scale

Mathlib, the community mathematics library for Lean 4, contains over 150,000 theorems and definitions. At any proof state, the relevant lemma from Mathlib that unlocks progress might be deeply nested in a namespace the model has not encountered frequently in training. This is called the premise selection problem, and it is a distinct challenge from generating correct tactic syntax.

Caltech’s LeanDojo addressed this with retrieval-augmented generation: before generating a tactic, the ReProver system retrieves potentially relevant premises from Mathlib using a retrieval model trained on proof co-occurrence. The intuition is that lemmas used near each other in past proofs are likely to be relevant to similar goals. This separates the problem into two phases: find plausible premises, then generate a tactic that uses them. The retrieval step significantly improves performance on goals that require non-obvious Mathlib lemmas.

Alternatively, automation tactics sidestep premise selection for specific domains. The omega tactic in Lean 4 decides linear arithmetic over integers and natural numbers completely. norm_num handles numeric computations. decide evaluates decidable propositions. For goals that fall within these domains, the AI agent does not need to select premises at all; it only needs to recognize which automation applies. A well-trained agent learns this domain recognition without explicit labeling.

How the Kernel Creates a Training Signal

DeepMind’s AlphaProof demonstrated the significance of using the Lean kernel as a reward signal in reinforcement learning. Earlier work in ML for theorem proving relied on supervised learning from human-written proofs, which limits the data distribution to the proofs mathematicians have already written down in machine-readable form. The RL approach generates proof attempts, submits them to the Lean kernel, and uses acceptance or rejection as the reward signal.

This matters because the reward is unambiguous and instantaneous. There is no need to label whether a proof attempt was “on the right track”; the kernel gives a binary answer, and partial credit can be assigned based on how many subgoals the attempt closed before failing. The model can explore proof paths that no human has written down and receive meaningful feedback. AlphaProof used this to solve 4 of 6 problems from IMO 2024 after extended search; the problems it solved had never been formalized in Lean before, so the training signal came entirely from the kernel rejecting incorrect attempts and eventually accepting correct ones.

Search Strategies and Their Tradeoffs

Different search strategies impose different tradeoffs on the agent loop. Best-first search, which expands the proof state with the highest model confidence at each step, is efficient when the model’s probability estimates are calibrated but can fail by committing early to plausible-looking dead ends. Monte Carlo tree search, used in AlphaProof and similar systems, explores more broadly by sampling multiple continuations from each state and using the kernel’s binary reward to update value estimates backward through the tree.

The COPRA system showed that GPT-4 with backtracking, a simple form of depth-limited search where the model retries after failures, achieves competitive results without task-specific fine-tuning. The practical implication is that search strategy matters roughly as much as model quality: a weaker model with effective backtracking can outperform a stronger model in one-shot mode. Leanstral as an agent implies some combination of model quality and search strategy; the open-source release makes both auditable and improvable.

What Proof Search Cannot Do

Current systems score in the 60-70% range on miniF2F, a benchmark of 244 competition-level mathematics problems. The problems that remain unsolved share a pattern: they require long proof chains, non-obvious problem decomposition, or the introduction of auxiliary lemmas that are specific to the proof and not retrievable from Mathlib. These are the steps in a mathematician’s informal proof that read “it suffices to show” followed by a reformulation that seems obvious in hindsight but requires insight to identify in advance.

This is the gap between search and understanding. Search can find a path through a proof space given a good enough heuristic for which paths are worth exploring. The insight to restructure the problem, to identify the right auxiliary construct, or to recognize that a long proof simplifies dramatically under a different framing, remains the part of mathematical reasoning that current agents demonstrate inconsistently.

Leanstral’s release into this landscape adds a locally-deployable, open-source option to a field previously dominated by closed systems. The agent architecture it implements is the same one that produced AlphaProof’s results, grounded in the property that the Lean kernel provides an exact verifier for every candidate proof. The challenge of writing the right specification and identifying the right proof structure remains a human responsibility. The challenge of filling in the proof given those decisions is increasingly something an agent can handle.