· 6 min read ·

Beyond Next-Tactic Prediction: How Leanstral Changes AI-Assisted Formal Proof

Source: hackernews

Lean 4’s type checker has a property that turns out to be extraordinarily useful for LLM-assisted proof engineering: it fails precisely. Every tactic in a proof either transforms the current goal state successfully, or produces an error message that names exactly what went wrong, which hypothesis is unavailable, or why the types do not unify. That behavior was designed for human mathematicians working interactively, but it doubles as something an AI agent can actually use: a deterministic oracle that tells you not just that a proof attempt failed, but why, and where.

Leanstral, Mistral’s recently released open-source agent for Lean 4, is built around this observation. It represents something structurally different from the LLM fine-tunes that have dominated academic work on neural theorem proving for the past few years, and understanding that structural difference is the more interesting story here.

What Next-Tactic Models Cannot See

When researchers first applied language models to formal mathematics, the dominant framing was next-tactic prediction. Given a proof state (the current set of goals and hypotheses), predict the next tactic. ReProver, built on top of LeanDojo, is the clearest expression of this approach: a retrieval-augmented transformer that selects tactics given the local proof context. It performs reasonably well on benchmarks like miniF2F, the competition mathematics testbed, and on LeanDojo’s own benchmark derived from mathlib.

Next-tactic prediction has obvious appeal. It maps cleanly onto language model pretraining, and the training signal is abundant since every theorem in mathlib is a sequence of (proof state, tactic) pairs. The problem is that it treats formal proof as a Markov process where each step depends only on the current state. Real proofs are not like that. Choosing a tactic that works right now but closes off the path you need three steps later is a common failure mode, one that a single-step predictor cannot anticipate because it has no representation of where the proof needs to end up.

Whole-proof generation addresses that myopia but introduces a different problem: the feedback signal is binary. The proof either compiles or it doesn’t, and when it doesn’t, you have to start over or make a new guess without systematic knowledge of which part of the proof was wrong. For anything beyond short proofs, this is a poor use of the model’s inference budget.

The Proof-Repair Loop

What Leanstral does is closer to what a working Lean programmer actually does: generate a proof attempt, identify exactly where it fails, and revise that specific piece. Lean’s elaborator runs incrementally, which means you can feed a partial proof to the Lean process and receive back not just “this failed” but the precise error location, the unsolved goal at that position, and the current hypothesis context. An agent can exploit this to perform targeted repair rather than full regeneration.

This is not a new idea in the research literature. Work on Draft, Sketch, and Prove (Jiang et al., 2023) explored a decomposition where an LLM generates high-level proof sketches and a separate model fills in the formal steps. Later work treated failed proofs as direct inputs for a correction model. What Leanstral does is operationalize this as a coherent agent system: the model generates a proof attempt, the Lean process evaluates it and returns structured feedback, and the agent uses that feedback to guide the next attempt. The loop continues until the proof compiles or a budget is exhausted.

The agent framing also enables tool use in a way that pure generation approaches cannot. A proof assistant interaction is inherently conversational: you state a goal, Lean responds with the updated state, you send another tactic, Lean responds again. An agent running in a loop is a natural fit for this protocol. A completion model generating a fixed-length sequence is not, because it cannot observe Lean’s responses as they arrive.

Why Lean 4 Specifically

The choice of Lean 4 as the target language is not arbitrary, and it matters more than it might appear on the surface.

Lean 4 was redesigned from scratch relative to Lean 3, with a particular emphasis on programmatic extensibility. The elaborator exposes hooks that allow custom tactics to be written in Lean itself. The language server (LSP) implementation is first-class and machine-accessible. There is a subprocess-based REPL interface that allows external programs to send tactic commands and receive proof state responses in a structured format, which is precisely what an agent needs.

Coq and Isabelle are older and have larger bodies of formalized mathematics in absolute terms, but their tooling is considerably less friendly to the kind of agent loop Leanstral implements. Coq’s SerAPI provides REPL-like access, but the ecosystem around LLM integration has developed more slowly. Isabelle has the PIDE framework but agent integration requires substantially more scaffolding. Lean 4 has become the substrate of choice for AI theorem proving research in part because the infrastructure was purpose-built to support external interaction, as documented in the 2021 Lean 4 paper by de Moura and Ullrich.

Mathlib4, the community mathematics library, is the other half of the explanation. It contains over 100,000 definitions and theorems spanning undergraduate and graduate mathematics, all written in a consistent, idiomatic style that gives language models a coherent training target. The coverage includes algebra, topology, analysis, number theory, and combinatorics, which is why benchmarks like miniF2F and ProofNet are now typically evaluated against Lean 4 rather than other systems. The combination of a machine-friendly REPL and a large, well-structured theorem library is what makes Lean 4 the right substrate for this kind of work.

Lean 4’s other design decision that matters here is the unification of the proof language and the programming language. In Lean 4, programs and proofs are terms in the same dependent type theory. This means an agent that can generate and verify proofs is not limited to abstract mathematics; it can, in principle, verify properties of actual programs written in Lean.

The Open-Source Gap

AlphaProof, Google DeepMind’s system that achieved a silver medal at the 2024 International Mathematical Olympiad, demonstrated that LLM-guided formal proof can work at a level nobody expected this soon. The result was significant: four out of six IMO problems solved formally, with the geometry problem being the exception. But AlphaProof is closed-source, available to nobody outside DeepMind, and designed around a specific application (competition mathematics) rather than general software development workflows.

Leanstral being open-source is not just a licensing detail. It means the proof-repair loop can be embedded in a CI pipeline, run locally during development, integrated into an editor, or extended with new retrieval tools and search strategies by the community. The difference between “a research result showing AI can prove IMO problems” and “a tool developers can use to verify software properties” is largely a matter of whether you can actually run it, instrument it, and adapt it to your environment.

There is also a training data dimension to this. Closed systems can be trained on proprietary corpora without disclosure. Open-source systems create pressure toward open training pipelines and reproducible benchmarking, which benefits the broader community working on formal verification infrastructure.

What Trustworthy Coding Actually Means Here

The phrase “trustworthy coding” in Leanstral’s framing points toward something beyond pure mathematics. Lean 4 can be used to state and prove properties of software: that a function terminates on all inputs, that an array access is always within bounds, that a cryptographic protocol satisfies a particular security definition. The Lean 4 standard library and tools like Batteries provide data structures with machine-verified properties attached.

An agent that can generate and repair formal proofs could, given an appropriate specification, verify software components in ways that type systems and static analysis cannot reach. Rust’s ownership system rules out a class of memory errors but says nothing about algorithmic correctness. A Lean proof that a sorting function produces a sorted permutation of its input is a different kind of guarantee. An agent that can close such proofs with minimal human intervention lowers the cost of obtaining them substantially.

That is a longer road than solving competition mathematics, because the specifications themselves are hard to write and the formalization of real software semantics is an open research problem. The tooling for specifying and verifying Rust or C code in Lean exists but is incomplete. Still, the infrastructure that Leanstral represents, a working agent loop over a modern proof assistant with machine-readable feedback, is the right foundation to build on.

The formal methods community has spent decades building tools that most software developers never use because the interaction model is too demanding. Writing Coq proofs requires learning a specialized tactic language, understanding the type theory well enough to guide the elaborator, and tolerating a development cycle that is substantially slower than writing tests. LLM agents that can handle much of the tactic-level work lower that barrier. Whether developers will adopt formal verification workflows at scale is a separate question. But the tooling gap has been the main obstacle, and it is getting smaller.

Was this interesting?