· 7 min read ·

Leanstral Fills the Gap AlphaProof Left: Open-Source AI Proof Engineering for Lean 4

Source: hackernews

The release of Leanstral by Mistral AI puts a major model provider directly into the Lean 4 formal verification ecosystem, which has been building momentum mostly without industry backing. The question worth asking is not whether AI can assist with theorem proving, since the answer to that has been clear for several years, but why Lean 4 specifically keeps emerging as the platform where this work converges, and what changes when a capable open-source agent finally commits to it seriously.

What Lean 4 Offers That Other Proof Assistants Don’t

Lean 4 is the product of a complete rewrite from Lean 3, described in the 2021 paper by Sebastian Ullrich and Leonardo de Moura. The rewrite was justified partly on performance grounds, but the architectural decision that matters most for AI integration is the unification of the proof and programming layers. In Lean 4, a proof is a term, and a term is potentially a program. The type elaborator operates uniformly on both. When an AI agent generates a Lean 4 proof, it is generating code in a language that the kernel can evaluate immediately, at each incremental step.

This property is not unique to Lean 4 among dependently-typed languages. Coq and Agda share the underlying Curry-Howard correspondence. What Lean 4 adds is ergonomics: a well-designed tactic language that makes the structure of a proof explicit and machine-readable, and a REPL that exposes the full proof state after each tactic. The proof state after each step is structured as a list of remaining goals, each with named hypotheses and a target type. An LLM interacting with Lean 4 gets feedback that looks like this:

Goals remaining:
⊢ n + 0 = n
Context:
n : Nat

This is feedback a language model can use. The goal is typed and fully elaborated. The context is explicit. A generation step that produces simp will either close the goal (success, move on) or fail with a message explaining precisely why simp could not make progress. The tight feedback loop is what makes agentic proof search tractable. Compare this to generating Python: the only structured feedback you get is a runtime error or a test failure, both of which require the whole program to execute. In Lean 4, every tactic is a discrete judgment with an immediate yes-or-no verdict.

The LeanDojo Baseline

The infrastructure for LLM-Lean integration was formalized by LeanDojo, presented at NeurIPS 2023. LeanDojo does two things: it extracts a training dataset from Lean 4 repositories, capturing proof states, tactic invocations, and premise dependencies, and it provides a Python API for interacting with Lean 4 interactively during inference. ReProver, the model trained on LeanDojo’s data, demonstrated that retrieval-augmented tactic generation could close non-trivial goals in Mathlib without human guidance.

LeanCopilot extended this into a usable tool inside the Lean 4 editor: a developer working through a proof can invoke an LLM to suggest or complete the next tactic. This is tactic-level assistance, not full proof generation. The developer still drives the overall proof strategy and decides when a suggested tactic makes sense given the broader goal.

Leanstral appears to push further toward autonomous proof completion. The framing as an agent, rather than a copilot, suggests a system that takes a theorem statement and attempts to produce a complete proof, running the interaction loop without requiring constant user intervention. This is closer to what AlphaProof demonstrated with DeepMind’s 2024 IMO work: given a formal problem statement, run a search over proof strategies until something compiles. AlphaProof was never publicly released. Leanstral being open-source is, in this respect, more significant than any claim about the underlying capability.

Why “Trustworthy Coding” Is a Different Claim Than Theorem Proving

Formal proof of mathematical theorems is established territory. Mathlib contains over a hundred thousand formal theorems covering real analysis, algebra, number theory, and topology. The harder challenge is software: proving properties of programs rather than abstract mathematical claims.

Lean 4 is capable of being a software implementation language, not just a proof substrate. It compiles to efficient native code, supports a C FFI, and has been used to write verified implementations of non-trivial algorithms. A sort function written in Lean 4 can be accompanied by a proof that the output is a permutation of the input and that every adjacent pair satisfies the ordering relation:

theorem sorted_insert (a : α) (l : List α) (h : l.Sorted (· ≤ ·)) :
    (l.insertSorted a).Sorted (· ≤ ·) := by
  induction l with
  | nil => simp
  | cons hd tl ih =>
    simp [List.insertSorted]
    split_ifs with hlt
    · exact List.sorted_cons.mpr ⟨fun x hx => le_trans hlt (List.sorted_cons.mp h).1 x hx, h⟩
    · exact List.sorted_cons.mpr ⟨fun x hx => (List.sorted_cons.mp h).1 x (ih hx ▸ hx), ih h⟩

That proof is not a separate artifact or an external annotation; it lives in the same file and is checked by the same kernel that typechecks the implementation. This is qualitatively different from writing tests or adding runtime assertions.

Leanstral’s trustworthy coding framing aims at this use case: an agent that generates a Lean 4 implementation and then attempts to prove its correctness properties, using the proof repair loop to iterate when the first proof attempt fails. The loop can go in either direction. If the proof reveals a bug, the implementation can be revised. This bidirectional feedback is harder to achieve with external verification tools like Dafny, where the implementation and specification live in different languages and the verification backend is an SMT solver with coverage limitations.

SMT-based verification is faster and more automatic but cannot handle properties that fall outside decidable arithmetic or linear arithmetic fragments. Lean 4’s interactive kernel is slower and requires more guidance but covers the full power of higher-order logic with inductive types. An agent that can navigate this space, choosing when to apply decide (which calls an SMT-like decision procedure) versus when to construct a manual inductive proof, is qualitatively more capable than one that defers everything to a fixed solver.

The Historical Context: Why This Took So Long

Research in AI-assisted theorem proving has a longer history than most coverage suggests. GPT-f from OpenAI in 2020 applied neural sequence-to-sequence models to Metamath, demonstrating that LLMs could generate valid proof steps in a formal system. Hypertree Proof Search from DeepMind in 2022 combined learned policy and value functions with Monte Carlo Tree Search over proof states. Each advance was real but remained disconnected from practical development workflows.

The missing piece was infrastructure rather than model capability. Building a system that interacts with a proof assistant requires solving non-trivial engineering problems: how do you extract proof states in a machine-readable format, how do you batch queries to the kernel efficiently, how do you manage the branching factor of proof search without exhausting compute. LeanDojo solved these problems for Lean 4 and published the solutions. The community of researchers who can now build on that infrastructure is larger than the community that could build everything from scratch.

What Mistral brings is a capable model with apparent fine-tuning on Lean proof corpora, packaged as an open-source agent. Earlier tools required researchers to bring their own model and integrate it with LeanDojo’s Python API. Leanstral is closer to a deployable system, which changes the kind of practitioner who can adopt it.

What the Open-Source Release Changes

Research in AI theorem proving has historically fragmented around model access. The best-performing systems tend to be closed, because the underlying model is proprietary or because the infrastructure was never published. AlphaProof is the clearest example: the results were remarkable, but the system was not released, and the community could not build on it.

An open-source release from Mistral changes this dynamic concretely. Researchers who want to extend proof search strategies, fine-tune on domain-specific theorem libraries, integrate Leanstral into CI pipelines for verified software, or adapt it to Mathlib subfields can do so without waiting for API access or reproducing the infrastructure from scratch. The ecosystem effect matters here: more open-source tooling around Lean 4 proof automation increases the pace of Mathlib formalization, which increases available training data, which improves future models in the space.

The connection to Mistral’s code models (Codestral being their prior entry in structured code generation) suggests that Leanstral builds on a model already optimized for syntax-sensitive generation, with additional fine-tuning on Lean-specific corpora. A strong code generation base combined with formal feedback from the Lean kernel is a defensible hypothesis for what gives the agent its capability without requiring entirely novel architecture.

The Remaining Gap

Formal verification remains hard even with an agent doing the heavy lifting. Proving non-trivial properties of real software requires expressing those properties as Lean 4 types, which is still an expert task. Legacy code in C, Java, or Python requires manual encoding before any agent can reason about it formally. A proof the Lean kernel accepts can still correspond to incorrect software if the specification was wrong to begin with, a risk that Lean addresses no better than any other proof assistant.

None of this diminishes what Leanstral represents as a step in the trajectory. The cost of formal verification has been the primary barrier to adoption outside cryptography and operating system kernels. Agents that can absorb part of that cost, closing routine proof obligations automatically and flagging the ones that require human attention, make formal verification viable for a broader class of software than it currently reaches.

The path from tactic suggestion to autonomous proof completion to formally verified code generation has been visible as a direction for years. What has shifted is the availability of open infrastructure, a capable model from a credible lab, and the cumulative weight of Mathlib’s community-built training corpus. Leanstral arrives at a moment when all three are finally in place.

Was this interesting?