· 6 min read ·

Why Leanstral Is an Agent, Not a Completion Model

Source: hackernews

Mistral’s Leanstral landed with 746 points and 182 comments on HackerNews, a notable signal for formal verification tooling. The space does not typically generate that level of engagement. The interest reflects something genuine: AI-assisted formal proof has been approaching a meaningful threshold for a few years, and Leanstral is a concrete example of what it looks like when that threshold is crossed at the open-source level.

The positioning as “trustworthy coding” rather than “theorem proving” is deliberate, signaling intended scope: not just mathematical competition problems, but software verification as an engineering practice, writing specifications alongside implementations, keeping proofs valid under refactoring, navigating a large library for applicable lemmas. The architecture reflects that scope. This is not a code completion model with Lean 4 tokens mixed into its training corpus; the core design is an agent loop built around Lean’s kernel, and that architectural difference determines what the system can and cannot do.

Lean 4 Built the Right Machine Interface

Every significant AI theorem proving effort has converged on Lean 4, and the convergence is structural rather than accidental. The 2021 CADE paper by Leonardo de Moura and Sebastian Ullrich describes the design decision that matters most for AI integration: in Lean 4, the tactic language is the term language. Tactics are ordinary Lean 4 programs of type TacticM Unit, using the same do notation, typeclass dispatch, and pattern matching as the rest of the codebase. There is no separate tactic metalanguage, no DSL with its own syntax and semantics sitting alongside the proof term language. A model fine-tuned on Lean 4 code is simultaneously fine-tuned on the tactic language, because they are the same language.

The contrast with Coq’s Ltac is instructive. Ltac is embedded in Coq but separate from Gallina, the underlying term language; writing proofs in Coq requires fluency in two distinct syntactic and semantic systems. Isabelle’s machine interface has historically lagged its development cycle in ways that slowed agent construction. Lean 4’s language server protocol implementation was designed from the start to be machine-readable. After each tactic, the LSP returns the current hypotheses and goal state as structured data, with no REPL output parsing and no text extraction from terminal output. The feedback loop between model and kernel uses clean JSON over a stable protocol.

Mathlib4 compounds the advantage. Over 200,000 theorems and definitions, all kernel-verified, with CI-enforced naming conventions on every pull request. The conventions are not stylistic guidelines; Nat.add_comm, List.length_append, and Finset.sum_comm are predictable from their content. A model retrieving relevant lemmas for a proof goal can construct candidate names directly from the goal structure. Isabelle’s Archive of Formal Proofs has around 800 independent entries with varied conventions and no shared naming system, meaning retrieval requires semantic similarity search over embeddings. At the scale of 200,000 entries, the ability to reason about names rather than only about meaning is a significant practical advantage.

Lean 4’s trusted kernel is approximately 5,000 lines of C++, following the LCF architecture: proof terms are an abstract type whose values can only be constructed through primitive inference rules. User tactics are untrusted; everything they produce is audited by the kernel. A model that generates an incorrect tactic sequence receives a kernel rejection, and that binary, authoritative verification is what makes reinforcement learning from proof feedback tractable.

The Agent Loop in Practice

The interaction between Leanstral and Lean’s kernel is a generate-verify-revise cycle that looks roughly like this:

state = lean_lsp.open(proof_goal)
while not state.is_closed:
    tactic = model.generate(state.hypotheses, state.goal, state.history)
    result = lean_lsp.apply_tactic(tactic)
    if result.is_error:
        state = state.with_error(result.error_message)
    else:
        state = result.new_goal_state

This pseudocode simplifies what is practically a tree search, with the agent exploring multiple tactic branches and using the kernel’s responses to prune or extend the search. The essential property is that the kernel mediates every step. Neither the model nor the scaffolding decides when a proof is complete; only the kernel closes a goal. A weaker model produces a slower, less efficient search, but a correct tactic cannot be rejected and an incorrect one cannot pass through.

A concrete example from Lean 4 shows the feedback structure. Proving that list append is length-additive:

theorem length_append (l₁ l₂ : List α) :
    (l₁ ++ l₂).length = l₁.length + l₂.length := by
  induction l₁ with
  | nil  => simp
  | cons hd tl ih => simp [ih, Nat.add_assoc]

Each simp call and each induction branch produces a distinct goal state that the LSP reports back to the agent. If the model generates a wrong tactic at the cons case, the LSP returns the remaining unsolved goal rather than silently accepting it. The loop terminates only when every subgoal is closed by the kernel.

LeanDojo, the MIT/Caltech framework that formalized the agent interaction model for Lean proofs, established much of the infrastructure that makes this kind of system buildable. LeanCopilot, which integrates the agent loop into Lean’s VS Code environment, demonstrated that interactive tooling built on this model is practical for real proof development.

The Sorry Problem

The hardest design problem for any proof agent is not generating correct tactics; it is avoiding sorry.

In Lean 4, sorry is a tactic that closes any goal without a proof. It compiles, produces valid object code, and generates only a warning. Anything downstream that depends on a sorry-filled proof also compiles. The kernel marks any compiled artifact as axiom-dependent on sorry, which propagates through the dependency graph, but in a codebase with hundreds of modules this is easy to miss.

An agent that inserts sorry to move past a difficult proof state is worse than no agent at all. The codebase appears verified while the actual verification coverage is incomplete. The gap between apparent and actual status is invisible until something downstream fails for reasons the sorry was masking. Sorry hygiene is the requirement that a proof agent must either close a goal correctly or report failure explicitly, never use sorry as a progress placeholder.

Meeting this requirement is a scaffolding problem as much as a model quality problem. The scaffolding must treat sorry insertion as a failure mode, surface those failures to the developer as explicit open obligations, and refuse to count a proof as complete until every leaf is kernel-verified. A capable model that falls back to sorry under pressure is still producing unsafe output.

What Benchmarks Measure

The standard evaluation dataset is miniF2F, a set of competition mathematics problems formalized in Lean 4. Top systems reach the mid-60s pass rate; Google DeepMind’s AlphaProof solved four of six problems at IMO 2024 at a silver medal level, combining language model generation with Monte Carlo tree search over the Lean proof environment.

These numbers track model progress, but they undercount what real software verification requires. miniF2F presents goals with relevant imports already in scope. It does not measure navigating a codebase to understand which lemmas exist and how they relate to the current goal, keeping proofs valid as type signatures evolve, interpreting Lean’s elaboration error messages, or knowing when to search Mathlib for an existing result versus constructing a sublemma.

The scenario benchmarks miss entirely is proof repair. A type signature changes in a core module, and 50 downstream proofs break. Each presents a different error. Some need a tactic swap; some need structural rethinking; some reveal the original proof relied on a coincidental corollary of the old type. The agent must identify which obligations changed, determine which subgoals are still reachable with the existing tactic sequence, generate patches for the rest, and do this without introducing sorry anywhere. This is the central workflow for software verification as an engineering practice, and no current benchmark evaluates it.

Why Open Weights Matter Here

AlphaProof was a closed research demonstration. Leanstral releases open weights, which is relevant precisely for the contexts where formal verification is most valuable. Cryptographic protocol verification, safety-critical embedded systems, and formally verified compilers all involve proprietary code that cannot be sent to a cloud API. A model that runs locally, on hardware controlled by the organization doing the verification, removes that constraint.

The open release also enables domain-specific fine-tuning. A team verifying operating system components has different proof patterns than a team working on cryptographic primitives or financial contracts. The base model can be adapted to those patterns rather than replaced.

Where This Points

The structural foundations for AI-assisted formal proof are in place: Lean 4’s machine-readable LSP, Mathlib’s naming discipline and scale, the unified tactic and term language, the authoritative kernel. The remaining gaps, sorry hygiene under pressure, elaboration error interpretation, proof repair across evolving codebases, are harder than miniF2F performance suggests. Leanstral’s framing as “trustworthy coding” is more ambitious than the benchmarks currently support, and that is the honest state of the field. The kernel-centered agent architecture is the right approach to closing those gaps; the question is how much model quality and search efficiency are needed before the distance between demonstration and engineering practice narrows.

Was this interesting?