Beyond the Borrow Checker: Lean 4, Leanstral, and the Verification Spectrum

Every type system makes promises. TypeScript promises that the variables you labeled as strings will be strings at the point where you use them. Rust’s borrow checker promises that you will not use memory after it has been freed. Haskell’s type system promises that if a value is typed as Maybe Int, you will handle the Nothing case before you can operate on the inner value. These are increasingly strong guarantees, and as you move along that spectrum, the compiler is doing an increasing share of the work of proving your program correct.

Lean 4, the theorem prover and programming language described in de Moura and Ullrich’s 2021 CADE paper, sits at the far end of that spectrum. Mistral’s Leanstral, an open-source agent for Lean 4 coding and formal proof engineering, is a bet that AI tooling can make that end of the spectrum practical for developers who are not formal methods researchers.

What Types Actually Prove

To understand where Lean 4 sits, it helps to be precise about what different type systems are actually proving.

In TypeScript, a type annotation like function getUser(id: string): User is a claim about the function’s interface: it accepts a string and returns a User. The type checker verifies this claim structurally at compile time. But TypeScript makes no claims about the behavior of the function, only its shape. Whether getUser returns the right user, or throws an exception in some cases, or silently returns stale data, is entirely outside the type system’s scope.

Rust’s borrow checker goes further. When the compiler accepts a Rust program, it has proven that no two mutable references to the same memory location exist simultaneously, and that no reference outlives the data it points to. This is a genuine safety theorem, not just an interface contract. Rust programs can still be wrong in many ways, but the class of memory safety errors is formally excluded by construction.

Haskell with DataKinds and GADTs allows types to carry value-level information. A vector type parameterized by its length, Vec n a, lets the type checker reject out-of-bounds accesses at compile time. This is a step toward what dependent types offer in full, but still limited to the specific patterns the extension was designed to handle.

Lean 4’s dependent type system generalizes this without limit. Types can depend on arbitrary values, propositions are types, and proofs are programs. A function signature can state not just what types the inputs and outputs have but precisely what mathematical relationship the output must have to the inputs. The kernel does not just check interface compatibility; it checks the proof that the stated property holds.

-- Dependent types allow specifications inside type signatures
def findIndex (arr : List α) [DecidableEq α] (target : α) :
    Option { i : Fin arr.length // arr.get i = target } :=
  -- If this returns Some ⟨i, proof⟩, the kernel has verified
  -- that arr.get i = target; it is not a runtime assertion
  sorry

The sorry placeholder is rejected in verified builds. The implementation must prove the stated property, or the code does not compile. There is no runtime assertion fallback, no unit test checking specific inputs. The kernel has checked every possible input.

The Gap That AI Tooling Fills

The problem with the Lean 4 end of this spectrum has always been cost. Writing dependent types and proofs requires familiarity with the Calculus of Constructions, Lean 4’s tactic system, and the Mathlib4 library of over 200,000 proven mathematical theorems. For most developers, this barrier has made Lean 4 a research tool rather than a practical one.

AI agents for formal proof generation work because of a specific structural property of Lean 4: the kernel that checks proof validity is fast, reliable, and produces structured feedback. This enables an interaction loop where the agent proposes a tactic step, the kernel either accepts it or explains why it failed, and the agent revises accordingly. The kernel acts as an oracle giving perfect, immediate feedback on each incremental proof step.

OpenAI’s GPT-f work in 2020 established that language models could navigate this loop productively, generating Metamath and Lean proofs that were verifiably correct by the respective kernels. DeepMind’s AlphaProof extended this with reinforcement learning against the Lean kernel as the reward signal, producing a system that solved four of six problems from IMO 2024. LeanCopilot, from researchers at Caltech, integrated the same loop into Lean’s interactive editing environment, offering tactic suggestions and premise retrieval while a developer writes proofs in VS Code.

Leanstral enters this space from Mistral, with an open-source model and an explicit framing of “trustworthy coding” alongside formal proof engineering. The distinction matters. Systems like AlphaProof target autonomous proof search over mathematical statements. A tool aimed at “coding” implies integration into the development workflow: writing specifications alongside implementations, suggesting proofs for functions as they are written, and navigating Mathlib4 to find relevant lemmas the developer would not have known to look for.

The Premise Selection Problem

One aspect of AI-assisted formal verification that receives less attention than tactic generation is premise selection. When a proof requires applying an existing theorem from Mathlib4, the system must identify which of 200,000 theorems is relevant to the current proof goal. This is a retrieval problem over a very large, highly structured knowledge base.

In the LeanCopilot system, premise selection uses an embedding model to find theorems whose statements are semantically similar to the current goal. The retrieved theorems are fed into the tactic generator as additional context. The improvement over naive tactic generation is substantial, because many proofs that appear difficult become straightforward once the right lemma is identified. Knowing that List.length_append exists and applies to the current goal is the entire proof of a statement about list lengths.

This is where a model trained by an organization with strong code and mathematics capabilities has an opportunity to make real improvements. The training corpus for a Lean-specific model includes all of Mathlib4, synthetic proof data generated through interaction with the kernel, and any curated proof engineering examples. The quality of premise retrieval is largely a function of how well the model has internalized the structure of mathematical knowledge encoded in Mathlib, which is itself a function of training data quality and the specificity of fine-tuning.

Where the Spectrum Ends

The TypeScript-to-Lean-4 spectrum corresponds to a fundamental tradeoff between expressiveness and the amount of work the programmer must do to satisfy the compiler. TypeScript is permissive and fast to write. Lean 4, without assistance, requires significant proof engineering effort for each function whose correctness you want to establish formally.

What AI agents change is the effort-to-correctness curve. Tactic generation and premise selection handle the mechanical parts of proof search, leaving the developer responsible for the conceptually hard parts: identifying what property is worth proving, writing the specification, and deciding which guarantees justify the overhead.

For security-critical code, cryptographic implementations, financial arithmetic, or parser logic, the cost has always been worth paying for organizations with the expertise. The formally verified C compiler CompCert and the seL4 microkernel demonstrate what formal verification produces at the far end of the spectrum: guarantees that no amount of testing can match.

Leanstral represents a bet that AI tooling shifts the cost calculation enough to make those guarantees accessible to a wider class of software. The kernel does not care how the proof was found. If an AI agent navigates the tactic search and Mathlib4 retrieval efficiently, the guarantee at the end is identical to one a human expert would have produced: the code is correct with respect to the stated specification, and the kernel has checked every step of the argument.