· 6 min read ·

The Reward Function That Can't Be Hacked: What Lean 4's Kernel Changes About AI Training

Source: hackernews

The reward model is RLHF’s persistent weakness. In standard reinforcement learning from human feedback, a separate model learns to approximate human preferences from pairwise comparisons. That learned reward model is then used to train the policy. The problem is a version of Goodhart’s law: optimizing hard against any proxy for the thing you want tends to produce a policy that scores well on the proxy while drifting from the underlying intent. Models produce confident explanations of incorrect facts because the reward model learned that confident explanations feel correct to annotators, not that confident explanations are correct. Scaling RLHF adds computation and annotation budget to the same fundamentally noisy signal.

Lean 4’s kernel is not a learned approximation of anything. It is a decision procedure for a specific type theory. Leanstral from Mistral AI is built around this property, and it changes what “trustworthy AI output” means in a technically precise way that the phrase usually does not carry.

The Kernel as a Formal Reward Function

The Lean 4 kernel, described in Leonardo de Moura and Sebastian Ullrich’s 2021 CADE paper, is a few thousand lines of C++ implementing a type checker for a variant of the Calculus of Constructions. Every proof term, whether produced by a human mathematician or an AI agent, must pass through this kernel. The verdict is binary: the term either has the claimed type or it does not. There are no edge cases, no probabilistic scores, no cases where the kernel is mostly confident. The theorem n + m = m + n is proved or it is not.

This makes the kernel a reward function in the reinforcement learning sense, a mapping from an agent’s action to a scalar reward. For AI-assisted proof generation, the reward for producing a proof term the kernel accepts is 1; the reward for any rejected term is 0. That reward function has properties learned reward models never achieve:

  • Zero annotation cost: querying the kernel requires no humans and costs only computation
  • Perfect accuracy on its domain: the kernel is never wrong about whether a proof term type-checks
  • Infinite scalability: it can be queried arbitrarily many times on arbitrarily many generated candidates without degrading
  • Complete robustness to distribution shift: a proof term from a novel proof path no human has ever taken is checked with the same reliability as a standard lemma

For training an AI system, this combination is genuinely unusual. Most reward signals are either human judgments (accurate but expensive and finite), heuristics (cheap but approximate and gameable), or simulator outputs (domain-specific and sometimes inconsistent). The Lean kernel gives something closer to a mathematical oracle: exact, cheap, and complete within its stated domain.

What RL with a Formal Reward Produces

DeepMind’s AlphaProof demonstrated what becomes possible when you train against a formal reward at scale. The system generated proof attempts for formalized mathematical problems, submitted them to the Lean 4 kernel, and used the binary accept/reject signal as the sole reward for reinforcement learning. No human annotated which attempts were on the right track. No reward model was trained to approximate proof correctness. The kernel provided the ground truth directly.

The result was a silver-medal equivalent on IMO 2024, solving four of six competition problems, including problems that had never been formalized before the competition. AlphaProof generated proof attempts that no human had written, received kernel verdicts on them, and learned from those verdicts without any intermediate human judgment. The training scale required for this result, billions of kernel queries across months of exploration, would have been unreachable with human-labeled data at any budget a research lab could afford.

Leanstral takes a related but distinct path. Mistral’s open-source release is built on supervised fine-tuning from Mathlib4, the Lean 4 community mathematics library, which contains over 150,000 theorems with human-written proofs. Every proof in Mathlib is kernel-verified. This makes Mathlib a trusted training corpus where the labels, the tactic sequences and proof steps written by expert mathematicians, have been formally certified. No annotation pipeline is needed to verify that the training data is correct; the kernel already checked it. Supervised fine-tuning on kernel-verified data is a different category of training from fine-tuning on GitHub code, where correctness is inferred from tests that may themselves be wrong.

Goodhart’s Law and Type Checkers

The failure mode of RLHF reward models is well-characterized in the literature. Gao et al.’s 2023 work on scaling laws for reward model overoptimization showed that beyond a certain ratio of policy improvement to reward model training data, performance on the true objective stops improving while the reward model score continues to rise. The policy learns to satisfy the reward model rather than the underlying preference. This is reward hacking, and it is a structural property of any system where the reward function is a learned approximation of something harder to measure.

The Lean kernel cannot be overoptimized in this way. There is no gap between the reward function and the thing the reward function measures. A proof that type-checks is correct in the type theory; a proof that does not, is not. A trained policy cannot learn to fool the kernel while failing to prove theorems, because fooling the kernel and proving theorems are identical operations. This makes formal proof one of the very few domains in AI training where Goodhart’s law is structurally inapplicable to the core reward.

The specification-writing problem is where Goodhart’s law reappears, but at a different level of abstraction. A model trained on Mathlib proofs learns to generate proofs for the kinds of specifications that appear in Mathlib. If a user’s verification goal is qualitatively outside that distribution, say, a behavioral specification for a distributed protocol rather than an algebraic identity, the model may produce confidently wrong formalizations. The kernel catches incorrect proofs. No formal mechanism catches incorrectly stated goals.

What the Binary Verdict Cannot Express

The kernel’s reliability is real and bounded. It answers whether a proof term has a given type. It does not answer whether that type correctly captures what the programmer intended.

-- This specification is provable and formally correct:
def sort_correct (f : List Nat → List Nat) : Prop :=
  ∀ l, List.Sorted (· ≤ ·) (f l)

-- But it says nothing about permutation preservation.
-- A function returning [] for every input satisfies it.
-- The kernel will verify both the proof and the gap.

A sorting function verified against only the sortedness predicate, without the constraint that the output is a permutation of the input, is kernel-certified and wrong for any real use. Writing specifications that capture full intent requires domain expertise and careful thought. This is not a gap that better models will close. It is a fundamental property of what formal verification means, and it predates language models entirely. What changes with Leanstral is that finding the proof, once a correct specification is written, becomes substantially automated. The cognitive work of deciding what to prove remains human.

The Broader Implication for Trustworthy AI

The difficulty of finding reward functions that are both reliable and scalable appears across essentially every serious deployment of RL to language models. Safety-critical applications need systems that cannot be reward-hacked; they rarely have formal reward functions available. Formal verification is one of the very few domains where a reliable, scalable, formally specified reward function exists by construction, and this is part of why Lean 4 has attracted disproportionate AI research attention relative to the size of its practitioner community.

AlphaProof and LeanDojo’s ReProver both exploit kernel verdicts as training signals. What Leanstral adds is a locally deployable, open-source model that practitioners in regulated environments, aerospace, cryptographic protocol verification, safety-critical embedded systems, can run without routing proof attempts through an external API. In those contexts, local deployment is a certification requirement, not a preference. The open weights also mean the model can be fine-tuned on domain-specific proof corpora: a team verifying an seL4-style microkernel interface can adapt Leanstral to their type definitions in a way no proprietary system permits.

The Lean 4 ecosystem, Mathlib’s formal infrastructure, and the kernel-as-oracle pattern together represent something scarce in AI development: a domain where the evaluation function is exact, the cost of evaluating is near zero, and the training data is formally certified rather than weakly labeled. The 173-comment Hacker News thread on Leanstral’s release reflects the field recognizing that combination. The benchmark numbers matter. The training infrastructure behind them matters more.

Was this interesting?