Programs Are Proofs: The Type Theory Behind Leanstral's Trustworthiness

Mistral AI’s Leanstral is described as an agent for “trustworthy coding and formal proof engineering.” The word trustworthy has been diluted by marketing, but here it carries a specific technical meaning rooted in Lean 4’s type theory. Understanding what that theory actually proves, and what it does not, is the most useful frame for evaluating what Leanstral delivers.

The Curry-Howard Correspondence, Concretely

The mathematical foundation of Lean 4 is the Calculus of Constructions with inductive types, described in de Moura and Ullrich’s 2021 CADE paper. The key insight this type theory encodes is the Curry-Howard correspondence: propositions and types are the same thing, proofs and programs are the same thing. In Lean 4, this is not a metaphor. A proof is literally a term of a type, and the kernel that checks correctness is the same typechecker that checks programs.

Consider the proposition “if P and Q then Q and P.” In type theory, this is the function type P ∧ Q → Q ∧ P. A proof of this proposition is a function of that type:

theorem and_comm (P Q : Prop) : P ∧ Q → Q ∧ P :=
  fun ⟨hp, hq⟩ => ⟨hq, hp⟩

This is syntactically almost indistinguishable from a function definition. The theorem keyword is sugar for def with the return type living in Prop, Lean 4’s universe of propositions. The kernel does not run a separate proof-checking algorithm; it runs the same typechecker it uses for everything else. Checking correctness of a proof and checking type safety of a program are literally the same operation.

The logical connectives follow the same pattern. The universal quantifier ∀ x : α, P x is a dependent function type: a function that takes an x : α and returns a proof of P x. The existential quantifier ∃ x : α, P x is a dependent pair (sigma type): a value x : α paired with a proof that P x holds. Equality a = b is an inductive type with a single constructor rfl for reflexivity. Every logical connective is either a function type or an inductive type in Lean 4.

Verified Software in a Single Language

In Coq, the classical workflow for verified software is: write a function, prove theorems about it, then extract the verified function to OCaml or Haskell for execution. Verification and runtime live in different places. Lean 4 eliminates this split. Functions compile to native code via LLVM, and theorems about those functions are checked by the same kernel. There is no extraction step; the verified program is the program.

Here is what a verified implementation of insertion into a sorted list looks like:

def insertSorted (x : Nat) : List Nat → List Nat
  | [] => [x]
  | y :: ys =>
    if x ≤ y then x :: y :: ys
    else y :: insertSorted x ys

theorem insertSorted_length (x : Nat) (xs : List Nat) :
    (insertSorted x xs).length = xs.length + 1 := by
  induction xs with
  | nil => simp [insertSorted]
  | cons y ys ih =>
    simp [insertSorted]
    split <;> simp_all [Nat.succ_add]

insertSorted is a real callable function; it compiles and runs. insertSorted_length is a proof, checked by the kernel, that inserting any element into any list produces a list exactly one element longer. If the implementation changes in a way that breaks this property, the proof fails at compile time. There is no separate verification step and no separate test run; the guarantees are checked whenever the code compiles.

This is what “coding and formal proof engineering” means in practice when both live in Lean 4. The Lean 4 paper also introduced the FBIP (functional but in-place) memory model, where values with a reference count of one are updated in place rather than copied. This means the functional style in which you write verified Lean 4 code can compile down to near-imperative performance, further collapsing the traditional gap between verified specifications and production code.

What `Prop` Does at Runtime

One mechanical detail matters for understanding the efficiency of this approach: Prop is proof-irrelevant and computationally erased. The kernel distinguishes between Sort 0 (which is Prop, the universe of propositions) and Sort 1 and above (which are Type, the universes of data). Proof terms in Prop are erased at runtime. This means that carrying proofs alongside programs does not impose a runtime cost; the proofs exist at compile time for the kernel to check, and then they are gone.

The decide tactic exploits this directly. A Decidable proposition can be evaluated by the kernel at compile time:

#eval (10 : Nat) < 20  -- true
example : 10 < 20 := by decide  -- kernel evaluates this at compile time

For propositions within the Decidable typeclass, the proof is a computation the kernel runs. For more complex propositions, tactics like omega (for linear arithmetic over integers and naturals), ring (for polynomial arithmetic), and norm_num (for numeric computations) dispatch entire goal shapes automatically. An AI agent operating over Lean 4 proof states learns to recognize which automation applies to which goal shapes, reducing the proof search problem to premise selection and decomposition for the parts these tactics do not cover.

The Scope of the Guarantee

The kernel guarantees that a well-typed term has the type it claims. For theorems, this means: if the kernel accepts a proof of proposition P, then P holds in Lean 4’s type theory under its assumed axioms, which are propositional extensionality, functional extensionality, and the axiom of choice. These are uncontroversial for software verification.

For pure total functions, this amounts to total correctness. A proof that a sort function returns a sorted permutation of its input holds on all inputs with no exceptions, no undefined behavior, and no runtime assumptions. The kernel verified it statically.

The limits are worth stating precisely:

Termination: Lean 4 requires structural recursion or a termination_by annotation for recursive functions. If you use partial def, termination is not verified, and the kernel accepts the function as axiomatically total.
IO: A function of type IO Unit can execute arbitrary side effects. The type system tracks that effects are present but does not constrain their behavior. Proving properties of IO programs requires modeling the operations in the type theory, which is possible but substantially more work.
External state: External systems, hardware, network behavior, and the operating system exist outside the type theory. Properties that depend on them require explicit modeling in the specification.

The specification gap sits at the base of all this. The kernel verifies that a proof term has the type corresponding to a stated proposition. If the proposition misdescribes the intent, the proof is a certified wrong answer. Verifying that a parser correctly handles all valid inputs does nothing to check whether “valid input” in the formal specification matches what the protocol actually produces. That translation from informal requirements to formal types is not something the kernel touches.

Where Leanstral Sits in This Architecture

Every AI proof system, from DeepMind’s AlphaProof to LeanDojo’s ReProver to COPRA, occupies the same position in this architecture: the AI generates tactic sequences, and the kernel accepts or rejects the resulting proof term. Leanstral is the same, with Mistral’s open-weight model doing the generation.

The Curry-Howard correspondence is what makes the kernel’s feedback trustworthy as a training and verification signal. The kernel is not running heuristics. It is checking whether a term has a type, which is a decidable question with a deterministic answer in Lean 4’s type theory. When the kernel accepts, the proof is correct. When it rejects, the model gets a precise error at the exact proof state where the reasoning broke down. That precision is what makes reinforcement learning with kernel feedback work: the reward signal is unambiguous and comes with diagnostic information.

Mistral releasing Leanstral as open-source adds a locally-deployable, auditable model to a space previously served by AlphaProof (proprietary, cloud-only) and academic systems. The formal verification use cases where this matters most, cryptographic protocol verification, safety-critical embedded systems, compiler correctness, tend to require exactly those properties: local inference, auditable weights, and no dependency on an external API.

The type theory is the fixed point in all of this. Models change, training data improves, search strategies evolve. The kernel stays the same, running the same typechecking algorithm it has always run. That is the architecture that makes the trustworthiness claim structural rather than a benchmark number.