· 8 min read ·

The LCF Lineage: Why Milner's 1972 Design Makes Leanstral Trustworthy by Construction

Source: hackernews

Mistral’s Leanstral carries the word “trustworthy” in its product framing. In most AI announcements, that word signals a benchmark claim: this model is less likely to produce wrong outputs than the previous one. In Leanstral’s case the word is precise in a way that is worth tracing carefully, because the precision does not come from Mistral. It comes from Robin Milner, working in Edinburgh in 1972.

Edinburgh LCF and the Invention of the Abstract Proof Type

The Logic for Computable Functions system, developed by Milner and colleagues at Edinburgh and published in its full form in 1979, was built to mechanize Dana Scott’s logic of domain theory. The core engineering problem was straightforward to state and hard to solve: how do you write a proof assistant where arbitrary user-supplied code can attempt to construct proofs, without that code being able to produce an invalid proof by accident or by malice?

Milner’s solution was to define thm as an abstract data type. In the ML programming language, which Milner invented as the tactic language for LCF, a value of type thm can be constructed only by calling the inference rule functions that the kernel exports. There is no other constructor. The type system enforces this: you cannot build a thm value by hand, pattern-match on an existing one and assemble a fake, or cast from another type. The only way to obtain a thm is to call a rule that actually checks the logical step. Every thm that exists in a running LCF session is a certificate that the corresponding proposition was derived validly.

User-written tactic code, which could be arbitrarily complex, operated by calling these inference rules. If the tactic code was buggy or malicious, it could call rules with wrong arguments and receive errors. It could not produce a thm representing a false proposition. The type system was the enforcement mechanism. The trusted computing base was the small set of rule implementations.

This is the LCF architecture: a small trusted kernel, an abstract type that can only be constructed through that kernel, and an arbitrary amount of untrusted code outside the kernel that is free to attempt anything. Invalid proof attempts are rejected at the type boundary. The kernel does not need to audit the surrounding code. The surrounding code cannot compromise the kernel’s invariants.

ML, the language Milner invented for this purpose, is the direct ancestor of Standard ML, OCaml, and F#. The tactic language for a 1972 proof assistant is the lineage of a significant fraction of the functional programming used in production today.

HOL and the Reduction to Minimal Trusted Code

Mike Gordon at Cambridge took the LCF architecture and applied it to higher-order logic in 1988, producing the HOL proof assistant. The architecture was identical in principle: abstract thm type, small inference kernel, arbitrary user-written tactic code outside. Gordon’s insight was that higher-order logic was expressive enough to encode most of the hardware verification problems his group cared about, while being simple enough to keep the kernel genuinely small.

John Harrison later built HOL Light, a reimplementation of HOL in OCaml with an emphasis on minimizing the trusted core. The HOL Light kernel is approximately 400 lines of OCaml. Those 400 lines are the entirety of what must be audited to trust any proof produced by the system, regardless of how many thousands of lines of tactic code, automation libraries, and user proofs sit above them. Harrison used HOL Light to participate in the Flyspeck project, the machine-assisted verification of Thomas Hales’s proof of the Kepler conjecture, a proof that in informal form had been notoriously difficult for human reviewers to fully check.

The HOL family demonstrated something important: the LCF kernel architecture does not impose a ceiling on the complexity of proofs that can be produced above it. You can verify competition mathematics, hardware circuits, and the packing of spheres in three-dimensional space, all with a trusted core small enough to read in an afternoon.

The Independent Tradition: de Bruijn and AUTOMATH

Parallel to the LCF lineage, Nicolaas de Bruijn at Eindhoven developed AUTOMATH starting in 1967. Where LCF used the ML type system to enforce proof validity, AUTOMATH used the Curry-Howard correspondence: terms are proofs, types are propositions, and type-checking is proof-checking. The separation of concerns was explicit in what came to be called the de Bruijn criterion: proof checking should be separated from proof finding, and the trusted core responsible for checking should be kept as small as possible.

De Bruijn’s influence on the contemporary ecosystem is pervasive. The de Bruijn index representation for bound variables, which eliminates naming issues under substitution, appears in the kernels of Coq, Lean 4, and most serious proof assistants. The criterion itself is a design constraint that the Lean 4 team explicitly cites.

Coq and the Calculus of Constructions

INRIA’s Coq emerged from Thierry Coquand and Gerard Huet’s work on the Calculus of Constructions in the mid-1980s, drawing from Per Martin-Löf’s type theory. The first version appeared in 1989. Coq’s kernel implements the Calculus of Inductive Constructions, and it is kept small enough to audit. Proofs in Coq are terms; the kernel type-checks terms; user-written tactics in Ltac generate those terms. The architecture is the same LCF pattern: small trusted type-checker, arbitrary untrusted tactic code, abstract proof certificates.

Coq was used to verify the CompCert C compiler and the four-color theorem. Both results are kernel-certified in precisely the LCF sense: the validity does not depend on the correctness of the tactic scripts or automation that found the proofs. It depends only on the kernel’s type-checker, which is small enough to reason about independently.

Lean 4: The LCF Architecture at Scale

Leonardo de Moura and Sebastian Ullrich’s 2021 CADE paper describes Lean 4 as a ground-up redesign combining the proof assistant and programming language into a single system. The kernel is a few thousand lines of C++. It implements a variant of the Calculus of Constructions with inductive types and universe polymorphism. Every proof term, no matter how it was constructed, passes through this kernel before being accepted. Tactics in Lean 4 are written in the TacticM monad and are themselves Lean 4 programs, making them subject to the same type-checking as everything else. They are still untrusted in the LCF sense: they produce proof term candidates that the kernel accepts or rejects.

Lean 4’s Mathlib contains over 200,000 theorems, all kernel-verified. The LSP-first design means a running Lean 4 session exposes proof state as structured data that any external process can consume and respond to. This interface is what makes AI agents composable with the proof engine without requiring any changes to the trusted kernel.

Where the LLM Fits in This Architecture

The LCF architecture was designed to accommodate untrusted ML code. The ML code could be arbitrarily sophisticated, could use global state, could call external oracles, could do anything a program can do. As long as it could only produce thm values by going through the inference rules, it could not compromise the kernel’s guarantees. The kernel was the boundary between trusted and untrusted.

An LLM generating tactic sequences for Lean 4 is, in architectural terms, untrusted ML code. It is exactly the entity that LCF was designed to accommodate. The LLM can propose any tactic it likes. It can generate a hundred wrong tactic sequences in a row. It can hallucinate lemma names, produce syntactically valid tactics that are semantically useless, or confidently suggest a proof strategy that terminates in a dead end. None of this matters to the kernel’s guarantee. The kernel rejects invalid proof terms and accepts valid ones. The LLM’s confidence score, its benchmark performance, and the quality of its RLHF training are all irrelevant to whether a specific proof is accepted. The accept signal is a decision procedure, not a probability estimate.

This is the precise sense in which Leanstral’s trustworthiness claim is structural rather than statistical. DeepMind’s AlphaProof used this property as the reward signal for reinforcement learning: the kernel’s binary accept/reject verdict required no human annotation of proof correctness. The RL training loop worked because the kernel is a total function with no gray area. Leanstral inherits this same property. So does every AI system built on any proof assistant that follows the LCF architecture.

The Comparison That Clarifies the Difference

Consider the tools that software teams use to gain confidence in code. Static analysis tools report findings that carry false positive rates; a finding from a SAST tool is evidence, not proof. Test suites cover a finite set of inputs; a green test suite is a probabilistic claim about behavior on untested inputs, not a guarantee. Code review finds many bugs and misses some; the miss rate is nonzero and not precisely known. RLHF reduces the frequency of bad outputs in a language model; the reduction is real but the floor is not zero.

Every one of these approaches provides probabilistic confidence that degrades gracefully as coverage and quality improve. They are useful precisely because they push probabilities toward acceptable levels. The difference between a good test suite and a bad one is a matter of degree.

The Lean kernel does something different in kind. It decides whether a proof term has a given type. The answer is yes or no, computed deterministically, with no rate parameter. The question of whether a proof is valid is not a question the kernel answers probabilistically. The 50-year-old insight from Edinburgh is that you can build a system where the AI or the user tactic code operates entirely in the probabilistic regime, generating candidates, exploring search trees, trying things, while the acceptance of any result operates in the decision-procedure regime. The two regimes coexist because the kernel is the boundary between them.

A Lean proof accepted by the kernel is correct within the axiom system. An LLM that generates 99 wrong tactic sequences before finding one the kernel accepts has produced exactly one correct proof. The 99 failures are not recorded anywhere that matters. This is not a property that RLHF can replicate, not because RLHF is inadequate, but because RLHF is solving a different problem. RLHF makes generation better. The LCF kernel makes acceptance exact.

LeanDojo from MIT and Caltech built an open gym-style environment around this property, exposing proof states as observations and treating tactic acceptance as the reward. The miniF2F benchmark of 488 competition mathematics problems has seen state-of-the-art systems reach the 65 to 70 percent range. Every accepted proof in those benchmarks is kernel-certified in precisely Milner’s 1972 sense. The LLM is doing what tactic code always did in the LCF tradition: attempting proofs, being rejected most of the time, and occasionally producing a term the kernel accepts.

Mistral’s contribution with Leanstral is making this architecture locally deployable and open-weight. The LCF design pattern is 50 years old. The insight that an AI occupies the same architectural slot as untrusted ML code is not a metaphor; it is a literal description of how the type theory works. The trustworthiness guarantee that results from this is the same guarantee that Milner built in Edinburgh, extended through HOL, Coq, and into Lean 4’s kernel. The LLM is new. The guarantee is not.

Was this interesting?