· 6 min read ·

Leanstral and the Open-Source Turn in AI Formal Verification

Source: hackernews

Mistral AI released Leanstral, an open-source agent for formal proof engineering in Lean 4. It is worth situating this alongside the broader field, because the problems being solved here are genuinely different from standard code generation, and the architectural choices made by different systems matter significantly.

Lean 4 Is Not What You Expect

Lean 4 is simultaneously a theorem prover and a general-purpose functional programming language. Its 2021 CADE paper describes a ground-up rewrite from Lean 3, introducing a novel FBIP (functional but in-place) memory model: values with a reference count of one at runtime are updated in place rather than copied. The practical result is near-imperative performance in functional code. Lean 4 programs compile to native code via LLVM.

The theorem-proving side is structured around the De Bruijn criterion: only the kernel needs to be trusted, and the kernel is kept small enough to audit. Every proof term passes through a few thousand lines of C++ type-checker. Tactics, elaboration, macros, and AI-generated tactic sequences are all untrusted. They produce proof term candidates; the kernel accepts or rejects them. That accept signal is not probabilistic or heuristic; it is a decision procedure for a well-defined type theory.

This matters for AI because it provides any agent with a source of ground truth that requires no human annotation. The model can generate wrong candidates as frequently as it likes; the kernel rejects them before they count as correct. The verification is structural, not statistical.

The Agent Loop

Lean 4 exposes its proof engine through an LSP server. At any point in a proof, the state consists of the current hypotheses, each with a name and type, plus the goal, the proposition remaining to be shown. An agent receives this state, generates a tactic candidate, submits it to the LSP server, and receives either the next proof state or an error.

A typical interaction looks like this:

theorem add_comm (n m : Nat) : n + m = m + n := by
  -- Proof state: n m : Nat ⊢ n + m = m + n
  induction n with
  | zero => simp
  | succ k ih =>
    -- Agent receives new state with ih : k + m = m + k
    simp [Nat.succ_add, ih]

The agent is doing search over a tree of proof states. Each node is a state, each edge is a tactic, and leaf nodes are either complete proofs (kernel accepted) or dead ends. For a 20-step proof where each generated tactic is independently correct 80% of the time, the probability of a one-shot solution is approximately 1.2% (0.8 to the power of 20). Search is not optional; it is load-bearing.

Different systems make different choices about how to search:

  • Best-first search uses the model’s probability estimates to prioritize which states to expand. It is efficient when estimates are calibrated but tends to commit early to plausible-looking dead ends.
  • Monte Carlo tree search, used by AlphaProof, samples multiple continuations from each state and propagates the kernel’s binary reward backward through the tree to update value estimates.
  • Backtracking with depth-limited search, used by COPRA, retries from earlier states when a line of reasoning fails. It requires fewer rollouts per step but explores less broadly.

The Systems Doing This Work

AlphaProof from DeepMind is the most prominent result in the field. In 2024 it solved four of six problems from the International Mathematical Olympiad, reaching silver-medal level. The key architectural choice was using reinforcement learning with the Lean 4 kernel’s accept/reject signal as the sole reward. No human annotation of proof correctness was required; the kernel provided the training signal entirely. AlphaProof remains proprietary and cloud-deployed.

LeanDojo from MIT and Caltech provides an open gym-style interface for Lean 4, exposing proof states as observations and tactics as actions. Their ReProver system adds retrieval-augmented premise selection from Mathlib4: at each proof state, a retrieval model suggests Mathlib lemmas likely to be relevant before the tactic generator runs. The premise selection problem is real because at any given proof state, the lemma that unlocks progress might be buried in a namespace the model rarely encountered during training. On the LeanDojo benchmark, ReProver reached roughly 51% in 2023.

COPRA (2023) demonstrated that GPT-4 with a backtracking search strategy and proof-state feedback competes meaningfully without task-specific fine-tuning. The paper’s central finding: search strategy design matters roughly as much as raw model quality, at least in the regime of moderately difficult theorems.

Draft, Sketch, Prove (2023) decomposed the problem differently. An LLM generates an informal proof sketch; a separate step formalizes it tactic by tactic. This mirrors how mathematicians actually work and sidesteps some of the premise-selection difficulty by anchoring the search to a high-level plan before generation begins.

Current state-of-the-art systems score in the 60 to 70 percent range on miniF2F, a benchmark of 244 competition-level mathematics problems. The remaining hard cases involve long proof chains, non-obvious decompositions, and introducing auxiliary lemmas not present in Mathlib.

What Leanstral Adds

The Leanstral announcement centers on local deployment. This is not an incidental operational detail. Formal verification is most valuable in settings where sending code to a cloud API is not acceptable: aerospace, cryptographic protocol verification, safety-critical embedded systems, financial settlement logic. A locally-runnable model with auditable weights fits into a certification workflow in a way that a cloud API cannot.

Mistral’s open-weight approach also means the model can be fine-tuned on domain-specific proof corpora. A team verifying a specific embedded operating system interface can adapt the model to their particular type definitions and conventions, which is not possible with AlphaProof. The community can inspect the architecture, identify failure modes, and improve the model in ways that a closed system does not permit.

The underlying architecture follows the same pattern as its predecessors: AI-generated tactic sequences validated by the Lean 4 kernel. Mathlib4 contains over 150,000 theorems and definitions, covering most undergraduate and significant graduate mathematics. It serves simultaneously as the primary training corpus and the most demanding benchmark for any Lean 4 agent.

Lean 4’s dual nature as both a programming language and a proof assistant is what makes the software verification use case coherent. A function’s type can encode its specification; the kernel verifies that the implementation satisfies it. The Leanstral vision is verified software synthesis: an AI that produces Lean 4 code for a specification and constructs a proof that the code satisfies it, with the kernel certifying both in a single pass.

The Part AI Does Not Solve

The kernel guarantee has a boundary worth stating clearly. The kernel verifies that a proof term has the type corresponding to a stated proposition. If the proposition itself is wrong, if the specification misdescribes the threat model, misses an edge case, or formalizes the wrong semantics, the verified proof is vacuously correct. Proving that a sorting function returns a sorted list says nothing about whether it handles the specific edge case relevant to the caller’s invariants. Writing the right specification remains a human problem.

The historical argument against formal methods was always that specification plus proof effort was estimated at 10 to 50 times the effort of just writing the code. AI assistance changes this ratio substantially on the proof side. What it does not change is the cognitive work of deciding what to prove. The Liquid Tensor Experiment demonstrated that formalizing Peter Scholze’s condensed mathematics took years of substantial community effort. Proof generation was hard; deciding exactly what the formal statement of each lemma should be was at least equally demanding.

The productive question for the next several years is how much of the specification problem can be assisted by the same models. A system that generates a plausible formal specification from a natural-language description of intent, verifies the implementation against it, and then surfaces cases where the formal specification diverges from the intent would change the software verification workflow significantly. Leanstral does not claim to solve this, but the trajectory of the field points in that direction. The kernel stays in place as the final judge; everything else is being automated around it.

Was this interesting?