From Eiffel to Idris: The Long Arc of Specs That Run

Gabriel Gonzalez makes the argument on haskellforall.com that a sufficiently detailed specification is code. The claim is compact, but the history behind it is long, and the history is worth tracing because each tool and tradition that embodies this idea discovered a different failure mode and a different practical payoff.

The gradient runs from prose requirements through types, property tests, refinement types, formal models, and finally dependent types. These are not competing schools of thought; they are different positions on the same continuum, each suited to different stakes and different domains. Where you sit on that gradient should be a deliberate choice, not a default.

The Eiffel Starting Point

Bertrand Meyer gave the idea its most influential early form in the mid-1980s with Design by Contract in Eiffel. The language lets you attach preconditions and postconditions directly to methods:

put (x: G; key: STRING)
    require
        key_not_empty: not key.is_empty
    do
        -- implementation
    ensure
        inserted: item(key) = x
    end

The require and ensure clauses are executable specifications. In debug builds they check at runtime; external tools can analyze them statically. Meyer’s key observation was that a contract layer tight enough eventually collapses the implementation space to a single reasonable option. The spec and the code converge.

Design by Contract influenced C#‘s Code Contracts, Java’s JSR-305 annotations, and assertion idioms in most modern languages. It did not become universal, partly for tooling reasons and partly because runtime contract checking costs something, and most teams were not willing to pay it uniformly. The idea was sound; the operational friction killed adoption outside safety-critical domains.

TLA+ and the Model Checker Tradition

Leslie Lamport’s TLA+ took a different path. A TLA+ spec describes system behavior as state transitions and temporal properties:

VARIABLES count

Init == count = 0
Increment == count' = count + 1
Spec == Init /\ [][Increment]_count

This is unambiguously a specification. It runs through TLC, which exhaustively explores the reachable state space and checks temporal properties. AWS has publicly documented using TLA+ to find bugs in distributed protocols that code review and testing both missed. The spec is not documentation; it is a checked artifact.

Alloy works similarly for structural properties. You write a model of your system’s state space, and the Alloy analyzer checks invariants over it. Neither TLA+ nor Alloy outputs production code directly. The gap between spec and implementation is real and intentional: the abstraction is the point. You can reason about “no two nodes simultaneously believe they are the leader” without implementation details getting in the way. The cost is that translating the spec into working code remains a manual, error-prone step.

Haskell Types as Lightweight Specs

Haskell narrows the spec-to-code distance by making types more expressive than in most production languages. A function signature is already a partial specification:

sort :: Ord a => [a] -> [a]

This rules out a large class of incorrect implementations. The function must be polymorphic over any ordered type, must accept a list, and must return a list of the same element type. It cannot insert new elements, cannot change the element type, and the Ord constraint gives it access to comparison. The type is a partial specification, and the compiler enforces it mechanically.

You can encode more structure in the types. A function typed:

insert :: Ord a => a -> Set a -> Set a

communicates that insertion is defined over a set, returns a set, and requires the element type to be comparable. The Set type from Data.Set additionally guarantees uniqueness by construction, so the type itself encodes the invariant that sets contain no duplicates. You did not have to write that invariant down; the type carries it.

QuickCheck: Specs That Test

QuickCheck, published by Claessen and Hughes in 2000, is the practical midpoint on the gradient for most working developers. Instead of writing tests against specific inputs, you write properties that any correct implementation must satisfy:

import Test.QuickCheck
import Data.List (sort, (\\))

-- sorting twice is the same as sorting once
prop_sortIdempotent :: [Int] -> Bool
prop_sortIdempotent xs = sort (sort xs) == sort xs

-- sorted output is non-decreasing
prop_sortOrdered :: [Int] -> Bool
prop_sortOrdered xs =
  let ys = sort xs
  in and (zipWith (<=) ys (drop 1 ys))

-- sort preserves all elements
prop_sortPermutation :: [Int] -> Bool
prop_sortPermutation xs = sort xs `sameElements` xs
  where
    sameElements a b = null (a \\ b) && null (b \\ a)

These are executable specifications. QuickCheck generates hundreds of random inputs and checks each property mechanically. The spec is no longer a document in a wiki; it is a test suite that rejects incorrect implementations. If you pass those three properties, you have stated almost everything important about what sort must do, without specifying how.

The insight is that unit tests describe behavior at specific points; properties describe behavior universally. A unit test that says sort [3,1,2] == [1,2,3] is a single data point. prop_sortOrdered is a claim about all lists. The two are different things, and the property carries far more information about the specification.

QuickCheck has spawned equivalents across every major language: Hypothesis for Python, fast-check for TypeScript, PropEr for Erlang, jqwik for Java. The pattern has spread because it works, and it works because properties encode spec-level invariants rather than implementation-specific examples.

Dependent Types: The Logical Endpoint

The most direct version of the spec-as-code thesis comes from dependent type systems, where types can depend on values. In Idris 2, you can encode the non-emptiness invariant directly in the type:

data Vect : Nat -> Type -> Type where
    Nil  : Vect 0 a
    (::) : a -> Vect n a -> Vect (S n) a

head : Vect (S n) a -> a
head (x :: _) = x

The type Vect (S n) a is a vector with at least one element, where S n is the successor of some natural number n. Calling head on an empty vector is a type error; the program does not compile. The specification that head requires a non-empty input is encoded in the type itself, with no runtime check, no assertion, no guard clause. The compiler enforces the contract during type checking.

You can push further. A correctly specified sort in a dependently typed language carries its correctness in the return type:

sort : {A : Set} -> (A -> A -> Bool) -> List A
     -> Σ (List A) (λ ys -> IsSorted ys × IsPermutationOf ys)

This Agda type signature says: given a comparison function and a list, return a list ys together with a proof that ys is sorted and a proof that ys is a permutation of the input. Any implementation satisfying this type is correct by construction. The specification is the type, and the compiler is the specification checker.

The Curry-Howard correspondence is the theoretical backbone here. Propositions in logic correspond to types; proofs correspond to programs. A type annotation is a logical proposition; a well-typed program is a proof of that proposition. When you write a type, you state a theorem. When you write a function satisfying that type, you prove the theorem. Under this correspondence, specification and implementation are the same mathematical object at different levels of precision.

Lean 4 takes this to its operational conclusion as both a proof assistant and a programming language. The Mathlib library contains formally verified mathematics where the spec and the proof occupy the same source file. Coq (now officially Rocq) has been used to verify compilers, operating system kernels, and cryptographic protocols. The distance between specification and implementation in these systems is not zero by accident; it is zero by design.

Dhall as a Worked Example

Gonzalez’s own work provides a concrete, production-grade case. The Dhall configuration language, which he created, takes the position that configuration files should be typed and total: no imports from arbitrary URLs, no Turing-complete scripting, just expressions that normalize to a predictable value. The type system is the specification for what a valid configuration looks like, and the interpreter enforces it.

A Dhall type for a server configuration:

let Config =
  { host    : Text
  , port    : Natural
  , tls     : Bool
  , workers : Natural
  }

If a Dhall file typechecks, you know a great deal about its structure before any code that consumes it runs. The type is the specification, and the typechecker is the specification verifier. This is the thesis in product form: Gonzalez built a language where the spec-to-code collapse is the central design principle, not a theoretical possibility.

The LLM Era Changes the Stakes

All of this has a contemporary dimension that was not available to Meyer or Lamport. When you write code yourself, a vague spec is absorbed by programmer judgment. You know your system; you fill in the gaps from context and experience. When you are directing an LLM to generate code, a vague spec produces plausible-looking output that satisfies the letter of the prompt while silently missing the intent.

This is the same gradient under a new load. If your spec is precise enough to be unambiguous to a language model, it is probably precise enough to be automatically checked. A function signature plus documented invariants plus a few QuickCheck properties is a dramatically better spec than a comment saying “sort the list and return only active items.” The precision you were previously deferring to programmer judgment now needs to be made explicit for the tool to produce correct output.

The strongest version of this workflow is already in use: write a TLA+ or Lean specification, use it to drive test generation, use the tests to verify LLM-generated implementations against the spec. The formal spec becomes the ground truth. The model generates candidates; the spec rejects incorrect ones. At that point you have used the model as a compiler from informal description to code, with the formal spec doing the work that a type system or model checker would do in a more traditional setting.

The LLM context makes the cost of imprecision visible in a way that writing code yourself did not. An experienced developer who writes a vague spec produces code that mostly works because they carry the missing precision in their head. A model generating from the same vague spec produces code that compiles and runs and silently does the wrong thing. The feedback loop is faster and more painful, which turns out to be useful.

Where the Gradient Flattens

The thesis has real limits worth naming.

Performance is one. A spec that says “return the shortest path between two nodes” is precise about correctness but says nothing about whether to use Dijkstra, A*, or Bellman-Ford, or how to handle graphs with millions of nodes. Correctness and efficiency are orthogonal specifications, and the efficiency specification is considerably harder to write formally. Liquid Haskell and Dafny can encode complexity bounds in some cases, but this remains research territory for most practical applications.

Concurrency is another. The Go memory model and the C++ memory model are both precise and published specifications. Writing code that correctly implements them is still genuinely difficult, partly because the specifications describe guarantees the hardware provides under conditions that are hard to reason about compositionally. TLA+ handles this better than most tools, which is exactly why AWS uses it for distributed protocol design.

There are also requirements that resist formalization: “the interface should feel responsive,” “the error message should be helpful.” Converting these into formal specs requires making choices about what “responsive” and “helpful” mean, choices that embed assumptions about users and contexts. Some of those choices can be stated explicitly; others remain irreducibly informal.

The Practical Upshot

The useful takeaway from Gonzalez’s argument is not that everyday software needs proof assistants. It is that the gap between specification and implementation is a gradient, and you can move along it deliberately. The tools span a wide range of cost and rigor: a TypeScript interface, a Zod schema, a QuickCheck property suite, a Liquid Haskell annotation, a TLA+ model, a Lean proof. Each step up that gradient makes more of your specification machine-checkable and less dependent on programmer attention at review time.

When you write a comment describing what a function should do, you can ask whether that comment could be a type annotation, a property test, or a precondition instead. Not every informal description translates cleanly, but more do than most codebases reflect. Each conversion moves the spec closer to the code and leaves something that can be checked mechanically rather than re-read carefully at each change.

The threshold where spec becomes code is not fixed; it depends on the tools available. Better type systems, better property testing frameworks, and better verification tooling all lower it. The argument has been available since Eiffel; the tools to act on it have been getting steadily better. The gradient has always been there. What changes is how far up it you can afford to go, and with LLM-generated code, how much it costs when you stop too early.