The Autoformalization Problem That Comes Before the Proof

Mistral AI’s Leanstral is an open-source agent for formal proof engineering in Lean 4, and its headline capability is proof search: given a formal theorem statement, the agent finds a sequence of tactics that closes the proof, with the Lean 4 kernel validating each step. Most coverage of this kind of system focuses on proof search mechanics, kernel architecture, and benchmark performance on competition mathematics. The less examined question is where formal propositions come from in the first place, which leads to a research problem that proof search systems don’t address: converting informal descriptions of what software should do into formal Lean 4 specifications.

The Front End of the Verification Pipeline

When DeepMind’s AlphaProof solved four of six problems at IMO 2024, achieving the equivalent of a silver medal, each problem’s starting point was a formal Lean 4 statement formalized by DeepMind researchers from the competition text. The proof search system operated on that formalized input; the formalization itself was human work.

For mathematical research and competition problems, this division of labor is manageable. A small number of expert researchers can formalize the statements; the AI handles the proof search. For software verification, the arithmetic changes. The correctness specification for a software function is a formal Lean 4 proposition describing its behavior over all inputs. Writing that proposition from a docstring, a set of tests, or an informal requirements document is the autoformalization problem, and it scales with the number of functions to verify rather than the number of available researchers.

Mathlib, Lean 4’s community mathematics library, contains over 150,000 formally stated theorems written by human experts over years. It is simultaneously Leanstral’s primary training corpus and its most demanding benchmark. The software verification analog of Mathlib, a library of formally stated correctness specifications for common software functions, does not yet exist. Building it requires either human experts writing formal specifications or models that can formalize informal ones reliably.

What Autoformalization Research Has Found

The Draft, Sketch, Prove paradigm addressed autoformalization at the proof level: an LLM generates an informal proof sketch from a formal theorem statement, then a separate step converts that sketch into a Lean tactic sequence. The informal sketch carries structure that guides tactic selection, reducing the effective search space. This is autoformalization operating narrowly, translating a structured natural-language argument into formal tactics, rather than translating a natural-language specification into a formal proposition.

Direct autoformalization of theorem statements from natural language has been evaluated in the context of competition mathematics and undergraduate-level problems. The ProofNet benchmark, which pairs informal mathematics statements from textbooks with their formal Lean 4 counterparts, found that even large models produced correct formalizations only a fraction of the time on undergraduate problems. The failure modes are systematic rather than random: models consistently struggle with quantifier scope that is ambiguous in natural language but must be made precise in Lean 4, with implicit domain conventions that need to be stated explicitly, and with edge cases not mentioned in the informal statement but required by the formal semantics.

These failures matter because they are the same failures that arise in software specification autoformalization. A function described as “returns the largest element” has implicit assumptions about non-empty input, the ordering used, and the return behavior on ties, none of which need to be stated in natural language but all of which must be fixed in a Lean 4 proposition.

The Specification Alignment Problem

Even a correctly formalized proposition may not capture the programmer’s intent, which is the deeper version of the specification problem. Consider specifying a memory allocator: an informal statement like “the allocator returns memory that does not overlap with existing allocations” is precise enough to be useful but requires multiple formal decisions before it becomes a Lean proposition. What does overlap mean at the byte level versus the object level? What happens after deallocation? What assumptions hold about concurrent access?

Each of these decisions produces a different Lean proposition, and each is kernel-verifiable once chosen. None can be derived from the informal statement alone. A system that autoformalized the allocator specification into any one of these propositions and proved it would have produced a kernel-certified proof of the wrong thing, in a form that passes every formal check while missing the programmer’s intent.

This is not a failure mode specific to AI. It is why formal methods have historically required significant domain expertise: deciding exactly what to prove is as hard as proving it. What changes with AI assistance is the cost profile. If a system can autoformalize an informal specification and then explain the formalization back in natural language, the programmer can catch misalignments before proof search begins. That bidirectional workflow, informal to formal and formal to informal, is what would make the verification pipeline practical at engineering scale.

The Informalization Direction

There is a complementary research direction that receives less attention: informalization, generating natural language explanations from formal Lean 4 proofs and propositions. This matters for specification alignment because a programmer who cannot read Lean has no way to verify that an autoformalized specification captures their intent. A system that produces both a formal proof and a readable explanation of what was proved gives the programmer something to audit.

LeanCopilot, an AI assistant for Lean 4 developed at Caltech, includes features for explaining proof steps in natural language. The quality is useful for simple lemmas; for complex proofs involving deep Mathlib dependencies, the explanations can be as opaque as the formal proof itself. Informalization is in some ways harder than formalization because it requires the model to understand what a proof is doing well enough to describe it at an abstraction level that matches the programmer’s mental model rather than the kernel’s.

A practical verification assistant needs both directions. You describe intent informally, the AI formalizes it into a Lean proposition, the AI explains that proposition back to you in terms you can check, and then searches for the proof. The kernel certifies the proof is valid for the stated proposition; the informalization step is how you verify the stated proposition is the one you wanted.

Where Leanstral Fits in This Picture

Leanstral addresses proof search, the middle segment of the verification pipeline. The input is a formal Lean 4 proposition; the output is a proof term accepted by the kernel. The 2021 Lean 4 design gives the kernel a clear role as arbiter: anything it accepts is correct within the axiom system, regardless of how it was produced. This makes proof search tractable for AI because the reward signal is unambiguous, and it is why AlphaProof could train via reinforcement learning using only the kernel’s binary accept/reject signal without human annotation of proof correctness.

The front end, converting informal software specifications into formal propositions, has no equivalent arbiter. Whether an autoformalization correctly captures intent requires the programmer’s judgment. The back end, explaining formal proofs to programmers who need to verify specification alignment, is limited by the same gaps that limit autoformalization.

Mistral’s open-source release makes proof search locally deployable for organizations that require it, particularly in aerospace, cryptographic verification, and safety-critical systems where cloud API access is excluded on certification or confidentiality grounds. Lean 4’s compilation to native binaries via LLVM, eliminating the extraction gap present in Coq and Isabelle/HOL where a verified specification must be translated to a separate runtime language, means the verified output is a deployable artifact rather than a certificate handed off to another toolchain.

Current state-of-the-art proof search systems score in the 65-70% range on miniF2F, a benchmark of 244 competition-level mathematics problems with carefully formalized statements. The benchmark measures what these systems do well. What the next several years of research looks like on the specification front, whether AI-assisted autoformalization reaches the point where formal specifications can be drafted from informal requirements with acceptable correction overhead, is what determines whether systems like Leanstral move from research infrastructure to routine engineering tools. Proof search is increasingly a solved problem for well-stated propositions. The problem of stating them well is where the work is.