The Kernel as Judge: Why Leanstral's Trustworthiness Claim Is Structural, Not Statistical

The field of AI coding assistants has a trust problem that no amount of RLHF solves: a model that looks confident can still produce wrong code, and wrong code that passes review is more dangerous than code that obviously fails. Mistral’s Leanstral approaches this differently by coupling an AI agent to Lean 4’s kernel, a small trusted type-checker that accepts or rejects proof terms regardless of how they were produced.

Most “trustworthy AI” claims are probabilistic, meaning the model is less likely to be wrong than earlier models on some benchmark. Leanstral’s trustworthiness claim is structural: the Lean 4 kernel is the arbiter of correctness, and the AI’s role is to generate candidates for the kernel to judge. A hallucinated proof that doesn’t type-check gets rejected. A convincing-looking but incorrect argument produces a kernel error. The AI can be wrong as often as it likes; the wrong answers don’t pass through.

Lean 4 and the De Bruijn Criterion

Lean 4 is both a theorem prover and a general-purpose programming language, designed by Leonardo de Moura and Sebastian Ullrich and described in their 2021 CADE paper. Its type theory is a variant of the Calculus of Constructions extended with inductive types, universe polymorphism, and a Prop universe for propositions. Every proof in Lean 4 is a term of a type corresponding to the proposition being proved, and the kernel checks that this term has the claimed type.

The kernel is kept small by design, following the de Bruijn criterion: only the kernel needs to be trusted, and it is maintained at a size small enough to audit, a few thousand lines of C++. Tactics, elaborators, macros, and AI-generated tactic sequences are all untrusted. They produce proof term candidates, and the kernel either accepts them or rejects them. A Lean 4 proof accepted by the kernel is correct in the same sense that a program whose types check is type-safe: the checker is the ground truth, not the author.

-- Tactic proof: the kernel verifies the resulting proof term, not the tactic script
theorem add_comm (n m : Nat) : n + m = m + n := by
  induction n with
  | zero => simp
  | succ n ih => simp [Nat.succ_add, ih]

-- Term-mode proof: the kernel checks this directly
theorem and_comm (p q : Prop) : p ∧ q → q ∧ p :=
  fun ⟨hp, hq⟩ => ⟨hq, hp⟩

When an AI generates the tactic script in the first example, the kernel doesn’t care where the tactics came from. It verifies the elaborated proof term. The tactic script is scaffolding; the term is the proof.

The AI Proof Assistant Landscape

Leanstral arrives into an active field. DeepMind’s AlphaProof (2024) demonstrated that reinforcement learning with Lean 4 feedback can solve International Mathematical Olympiad problems at silver-medal level, 4 out of 6 problems from IMO 2024. The critical architectural choice was using the Lean kernel as the reward signal. The model gets a binary reward when the kernel accepts a proof, requiring no human annotation of correctness. The kernel makes RL training tractable because it provides unambiguous, instantaneous verification.

MIT and Caltech’s LeanDojo (2023) provided an open gym-style interface for Lean 4, exposing proof states as observations and tactics as actions. Their ReProver system, which retrieves relevant premises from Mathlib before generating tactics, reached around 51% on their benchmark in 2023; the state of the art across systems has since moved toward 65-70% on the standard miniF2F benchmark.

Copra (2023) showed that GPT-4 with a backtracking search strategy and proof-state feedback could achieve competitive theorem proving without task-specific fine-tuning. Draft, Sketch, Prove (2023) decomposed the problem further: an LLM writes an informal proof sketch, then a separate step formalizes it tactic-by-tactic. These approaches all share the same key property: the Lean kernel validates outputs, and AI errors are caught before results are accepted.

What Mistral adds is an open-source model purpose-built for Lean 4, targeting both proof engineering and verified coding. The open-source distinction matters more in this domain than in general coding assistance. Formal verification tends to happen in settings where running an AI through an external API is not acceptable: aerospace, cryptographic protocol verification, safety-critical systems. A locally-runnable model that does useful formal work is a different category of tool than a cloud-hosted code completion service.

How the Agent Loop Works

The typical architecture for an AI Lean agent follows a tactic-by-tactic loop. Lean 4 exposes its proof state through an LSP server: given a partial proof, the model receives the current hypotheses and goal, generates a tactic, and observes either the new proof state or an error. The model searches over tactic sequences using beam search, Monte Carlo tree search, or best-first search depending on the implementation.

Current state:
  n m : Nat
  ⊢ n + m = m + n

Model generates: induction n

New state (two subgoals):
  case zero
    m : Nat
    ⊢ 0 + m = m + 0

  case succ
    n m : Nat
    ih : n + m = m + n
    ⊢ n + 1 + m = m + (n + 1)

Each tactic application is checked immediately by the Lean server. Dead ends are detected before the model has spent further effort pursuing them. This feedback loop is what makes RL training work: the model explores proof space and learns from verified failures without human annotation of which attempts were on the right track.

Mathlib, Lean’s community-maintained mathematical library, contains over 150,000 theorems and definitions covering most of undergraduate and much of graduate mathematics. It is simultaneously the largest training corpus for AI proof assistants and the most demanding benchmark for evaluating their output. The Liquid Tensor Experiment formalized Peter Scholze’s condensed mathematics in Lean 4 using Mathlib, a result Scholze himself described as a meaningful external check on his own work, and it illustrates how far Mathlib’s coverage has extended.

The “Coding and Proof” Combination

The framing of Leanstral as an agent for “coding and formal proof engineering” points toward something beyond an automated theorem prover. Lean 4 is a capable functional programming language in its own right. The 2021 paper describes its FBIP (functional but in-place) memory model, where values with a reference count of 1 at runtime are updated in place rather than copied, achieving performance close to imperative code while maintaining functional semantics. Lean 4 programs compile to native code via LLVM.

Writing programs in Lean 4 and writing proofs about those programs are not separate activities. A function’s type can encode its specification, and the kernel verifies that the implementation satisfies it. The combination Leanstral targets is verified software synthesis: an AI that produces Lean 4 code for a specification and constructs a proof that the code satisfies that specification, with the kernel certifying both. The programs run; the proofs check. The gap between informal intent and formal specification remains a human problem, but everything downstream of a correct specification becomes AI-assisted and kernel-verified.

Where the Limits Are

The kernel guarantee has a boundary. Lean 4 verifies that a proof term has the type corresponding to a stated proposition. If the proposition is wrong, the verified proof is vacuous. A verified sorting function that sorts in the wrong order, a verified cryptographic protocol with a flawed threat model, a verified compiler for incorrect language semantics: all are kernel-certified wrong answers. The hard part of formal verification has always been writing the right specification, and Leanstral does not change that. What it does is make finding the proof, once a correct specification exists, significantly more automated.

The benchmark numbers for AI proof assistants also reflect a specific distribution. Competition mathematics and algebraic manipulations are well-represented in training data. Large-scale software verification involves specifications that don’t appear in training corpora and requires long proof chains that challenge current context lengths. The progress is real; the distance to full-scale software verification remains large.

The Broader Signal

Mistral releasing Leanstral as open-source adds a capable, locally-deployable tool to a domain previously served mainly by proprietary systems or academic prototypes. The underlying architecture, AI-generated tactic sequences validated by the Lean kernel, is the same architecture that produced AlphaProof’s IMO results, grounded in the de Bruijn kernel design from the 2021 paper. The contribution is making this approach accessible to practitioners who need local deployment, auditable outputs, and a model tuned for Lean 4 rather than general code generation.

For engineers who have looked at formal verification and concluded the tooling overhead is too high, a capable AI agent in the loop changes the calculation. The kernel still does the work of guaranteeing correctness. The AI does the work of finding the proofs. The ratio of proof engineering effort to code writing effort, historically estimated at ten to fifty times higher in expert hands, starts to move toward something workable. Whether it moves far enough to see formal verification in routine software development depends on how well systems like Leanstral handle the distance between mathematical benchmarks and the specifications engineers actually need to write.