The Two-Worlds Problem in Formal Verification, and Why Lean 4 Collapses It

Most formal verification tools exist in a world adjacent to real software. In Coq, you write a verified algorithm, then extract it to OCaml or Haskell, then compile and run that. In Isabelle/HOL, the gap between a theorem and a deployable artifact involves similar translation steps. The verification lives in one place; the running code lives in another.

Lean 4, the target language for Mistral’s Leanstral, is built on a different premise. Lean 4 is simultaneously a dependently typed theorem prover and a compiled systems-level programming language. A verified function written in Lean 4 is also a Lean 4 function; you can call it, benchmark it, and deploy it without leaving the language. This architectural choice is what gives Leanstral’s “trustworthy coding” framing more weight than most AI product announcements.

The Two-Worlds Problem in Formal Verification

Traditional proof assistants were designed around the goal of mechanizing mathematics. Coq, developed at INRIA from the 1980s onward, remains one of the most powerful tools for this, but using Coq for software verification requires a separate extraction step to produce executable Haskell or OCaml. Isabelle/HOL generates Standard ML or Haskell. The verified specification and the running code are different artifacts, and while the extraction is formally justified, the gap is real enough that most software teams do not use formal verification at all.

Lean 4 was designed, as Leonardo de Moura and Sebastian Ullrich describe in their 2021 paper, to eliminate this separation. The same file contains the specification, the proof, and the implementation. The #eval command runs Lean code directly in the IDE. The lake build system compiles Lean projects to native binaries via LLVM. A sort function proven correct in Lean is a Lean function that can be called from a binary deployed in production, with no intermediate translation step and no separate trusted extraction tool to audit.

This design decision has cascading effects on what LLM-assisted proof generation can accomplish in practice, because the output of a model like Leanstral is not a certificate that gets handed off to another system. It is executable code.

What Mathlib Provides as a Training Corpus

The Lean 4 mathematical library, Mathlib, contains over 200,000 formalized theorems and definitions. The count alone understates the value: it represents a coherent, community-maintained body of mathematical knowledge in a single library, with consistent naming conventions and a shared tactic vocabulary spanning algebra, analysis, topology, number theory, and category theory.

For LLM fine-tuning purposes, Mathlib is unusually rich. Each theorem has a proof written by human experts, often using chains of tactics that invoke other lemmas from the library. A model trained on Mathlib learns not just syntax but strategy: which tactics to apply in which situations, how to navigate the library’s hierarchy, and how to decompose complex goals into manageable subgoals.

This is part of why the Lean 4 ecosystem has attracted more LLM proof research than Isabelle or Coq, despite those systems having longer histories. The LeanDojo project uses retrieval from Mathlib to supply relevant premises at each proof step, treating Mathlib as a dynamic knowledge base rather than static training data. Their ReProver model reached roughly 57% on the miniF2F benchmark, which contains 488 formalized math olympiad problems, using a comparatively small open model. Leanstral builds on this same ecosystem with Mistral’s larger modeling capacity.

The Tactic Loop as a Training Signal

Lean 4 proofs are typically written in tactic mode, where the user issues commands that transform a goal state. The proof state at each step is a structured representation: hypotheses in context, a goal to prove, and the available tactics. This structure is well-suited to language models because it provides clear, parseable state at each step.

-- At this point in a proof, the tactic state might look like:
-- h1 : n > 0
-- h2 : m ≥ n
-- ⊢ m > 0
-- The model generates the next tactic:
exact Nat.lt_of_lt_of_le h1 h2

Each tactic is grounded by the kernel: it either produces a valid transformation of the proof state, or Lean rejects it. This feedback loop is what enables reinforcement learning approaches. The model generates a tactic; Lean evaluates it; the result is a training signal that requires no human labeling.

Meta AI’s HyperTree Proof Search applied this loop with Monte Carlo Tree Search, combining model-generated tactics with self-play training on Lean proof states. DeepMind’s AlphaProof scaled it further with full RL, achieving a silver-medal equivalent score at IMO 2024 by solving four of six problems. Both systems treat Lean’s kernel as a total function: it returns a verdict with no room for argument or approximation. Leanstral is Mistral’s open-source entry into this lineage, making the capability accessible beyond a small set of research labs.

The Trustworthiness Claim, Precisely

The claim that Leanstral produces “trustworthy” code rests on Lean’s kernel, a small trusted core implementing the Calculus of Constructions. Every proof, whether written by a mathematician or generated by a language model, must pass through this kernel. If it does, the proof is correct within the axiom system. The model’s confidence score, the quality of its training data, and the sophistication of its search procedure are all irrelevant to the kernel’s verdict.

This is a structurally stronger guarantee than any testing-based approach can provide. Tests cover a finite set of inputs; a formal proof covers all inputs satisfying the preconditions. The limits of this guarantee are well-defined: Lean proves code matches a specification, not that the specification captures what the programmer intended. Writing correct specifications is still a human responsibility.

Consider a sorting function. You can write:

def sort_correct (sort_fn : List Nat → List Nat) : Prop :=
  ∀ l, List.Sorted (· ≤ ·) (sort_fn l) ∧ sort_fn l ~ l

Lean can prove an implementation satisfies this. If you drop the permutation constraint (sort_fn l ~ l), you have formally verified a function that returns an empty list for every input. The kernel confirms the proof; it has no access to your intent. This is not a gap that better models will close; it is a fundamental property of what formal verification means, and it predates LLMs entirely.

What Changes for Working Engineers

The practical barrier to adopting Lean 4 for production code remains high. The language has a steep learning curve, Lean-based libraries for most infrastructure concerns do not exist, and the tooling is still maturing relative to mainstream ecosystems. None of this changes quickly because a new model appeared.

What does change with models like Leanstral is the cost of the proof-writing step. Writing Lean proofs currently requires significant expertise in tactic selection and library navigation. If Leanstral can handle a large fraction of proof obligations automatically, given a correct specification, the remaining task is writing good specifications, which is a more tractable skill to teach and audit than manual proof construction.

The plausible near-term application is narrower than full infrastructure replacement. Verifying critical algorithms, cryptographic primitives, parser logic, numeric kernels, in Lean, then deploying the Lean-compiled output or using the proofs as specifications to generate comprehensive test suites for code running in another language, is a realistic scope for teams willing to invest in the tooling.

Lean 4’s dual identity as a general-purpose compiled language and a proof assistant is the architectural bet that makes this path coherent. Other formal verification tools require translation layers between proof and execution. Lean 4 collapses that gap. Leanstral is the first broadly available open-source model built around that collapsed gap, and the benchmark numbers it achieves matter less than the infrastructure they represent: an accessible entry point to a verification workflow that does not require a dedicated research team to set up and maintain.