How LLMs Make the Case for Property-Based Testing That Formal Methods Never Could
Source: martinfowler
Rebecca Parsons brings programming language theory to Martin Fowler’s January 2026 conversation with Unmesh Joshi, and her background in denotational semantics gives the discussion a precision that informal treatments of the same problem lack. The central concept is the what/how loop: the iterative process of specifying intent at one abstraction level and generating concrete implementation at the next level down. Parsons frames this in formal terms. The “what” corresponds to a program’s denotational semantics, what it means. The “how” corresponds to its operational semantics, how those meanings execute.
That framing points at a problem the field has been wrestling with formally since the 1960s. Formal methods researchers spent decades trying to make “what” specifications machine-verifiable before any implementation existed. They built Z notation, VDM, TLA+, Alloy, and Design by Contract. Most of these tools never reached mainstream software development. LLMs are creating economic conditions where the reasons for that failure are worth revisiting, and where property-based testing, a more pragmatic descendant of the formal methods tradition, may finally find the adoption it never achieved on its merits alone.
The Formal Methods Attempt
Z notation, developed at Oxford in the 1970s, gives specifications as mathematical sets and relations. A specification of a user lookup function describes the pre-condition (the user ID exists in the known user set), the post-condition (the returned user record satisfies the membership relation), and the invariants (the user set remains unchanged). These constraints are complete and unambiguous. They can be reviewed for correctness independently of any implementation.
TLA+, Leslie Lamport’s specification language developed in the 1990s, takes a temporal logic approach. You specify system behavior as state transitions and verify that safety properties hold across all reachable states. Amazon Web Services used TLA+ to find bugs in DynamoDB’s distributed transaction logic before any implementation; the specification found errors that testing did not catch.
The adoption problem with both tools is overhead. Z notation requires mathematical fluency most software developers lack. TLA+ has a learning curve that narrows its user base to engineers with formal verification backgrounds. Both produce specifications that cannot run, only be analyzed. The workflow is: write specification, analyze specification, write implementation separately, verify that implementation matches specification. That workflow adds a substantial layer of work whose payoff is only visible when the specification catches errors that testing would have missed.
Most software teams never ran that cost-benefit calculation in favor of adoption. The tools stayed in aerospace, defense, and financial systems where specification failures carry consequences that justify the investment.
Design by Contract as the Accessible Middle Ground
Bertrand Meyer’s Design by Contract, built into the Eiffel language in the 1980s, made a different trade-off. Pre-conditions, post-conditions, and class invariants are expressed in executable code syntax rather than mathematical notation. They are checked at runtime during development and testing, and can be compiled out for production. The specification and the implementation share the same language:
withdraw(amount: REAL)
require
amount > 0
amount <= balance
do
balance := balance - amount
ensure
balance = old balance - amount
balance >= 0
end
Design by Contract influenced Clojure’s spec library, Python’s icontract, and Kotlin’s built-in require, check, and ensure functions. But even the accessible version never became mainstream. Writing contracts for every function in an existing codebase is labor-intensive, and the benefit, finding specification violations earlier, competes with the simpler alternative of writing more tests.
Property-Based Testing as the Pragmatic Descendant
The tool that did gain significant traction is property-based testing. QuickCheck, developed by Koen Claessen and John Hughes for Haskell in 1999 and described in their ICFP 2000 paper, generates hundreds of random inputs to verify that a property holds across the range. You specify what should be true of a function’s outputs, and the framework tries to find inputs that falsify the claim.
-- What sorting should do, expressed as properties rather than examples
prop_sort_idempotent :: [Int] -> Bool
prop_sort_idempotent xs = sort (sort xs) == sort xs
prop_sort_permutation :: [Int] -> Bool
prop_sort_permutation xs = length (sort xs) == length xs
These properties are statements about what a function should do, not specific test cases for how it does it. They express invariants at the “what” level, constraints that should hold regardless of implementation details.
Hypothesis brought property-based testing to Python in a form that gained real adoption outside academia. fast-check did the same for JavaScript and TypeScript. Both provide structured shrinking: when a property fails, the framework minimizes the failing input to the simplest case that reproduces the failure, which makes debugging tractable.
The Economics Change With LLMs
Property-based testing was always valuable, but it competed against writing more unit tests, which are faster to write initially and easier to understand when they fail. The common objection from practicing developers: “I’d rather write ten unit tests than one property, because I know exactly what the unit tests are covering.”
LLMs change this calculation in a specific way. When a developer writes the implementation, there is an implicit mental model of edge cases and invariants, which surfaces through test cases. When an LLM generates the implementation, that model does not exist. The test cases generated alongside the implementation tend to reflect the same distributional assumptions as the implementation, which means they are well-positioned to catch the cases the model expected to handle and poorly positioned to catch violations of invariants that the prompt never stated.
Consider the failure mode the Fowler conversation identifies: an LLM generating a plausible-but-wrong implementation from an underspecified prompt.
from hypothesis import given, strategies as st
from hypothesis.strategies import builds
@given(st.lists(
builds(User,
id=st.integers(min_value=1),
activity_score=st.integers(min_value=0, max_value=1000)
),
min_size=0, max_size=100
))
def test_top_users_properties(users):
result = get_top_users(users, n=5)
# What: result is bounded by both n and input size
assert len(result) <= min(5, len(users))
# What: every returned user came from the input
assert all(u in users for u in result)
# What: result is ordered by activity score, descending
scores = [u.activity_score for u in result]
assert scores == sorted(scores, reverse=True)
# What: no excluded user has a higher score than the lowest included one
if result and len(users) > len(result):
min_returned = result[-1].activity_score
excluded = [u for u in users if u not in result]
assert all(u.activity_score <= min_returned for u in excluded)
This property test encodes what get_top_users should return across the full input space. It catches implementation A’s ascending sort error (property 3 fails), implementation B’s unstated admin filtering (property 2 fails when admins appear in the input), and subtle edge cases like empty lists or tied scores that example-based tests rarely exercise.
When you regenerate the implementation with a refined prompt or a different model, the property test survives. It tests the new “how” against the specified “what” without modification. The specification outlasts any particular implementation.
This is the economic shift. Before LLMs, investing time in property tests competed with investing the same time in writing the implementation. With LLMs, implementation time approaches zero and verification time stays where it was. Writing properties that will survive repeated regeneration becomes a better investment than writing example tests tied to one implementation’s specific outputs.
The Gap Property Tests Cannot Close
Property-based testing is not a complete answer to the what/how specification problem, and the Fowler conversation’s framing around cognitive load points at why.
Properties are suited to expressing structural invariants: order, membership, size bounds, type constraints. They are less suited to semantic domain constraints, the kind Rebecca Parsons’s formal semantics background is most concerned with. A property test can verify that a function returns a non-empty list with correctly sorted scores. It cannot verify that “active customer” means what the billing domain’s bounded context requires it to mean, because that definition lives in the team’s shared understanding rather than in any formal artifact the test can access.
Architecture fitness functions, as developed by Neal Ford, Parsons, and Pat Kua in Building Evolutionary Architectures (O’Reilly, 2017), cover structural constraints: dependency direction, module coupling thresholds, layer boundary enforcement. ArchUnit for Java and Kotlin makes these executable in CI. They catch the architectural “what” violations that property tests cannot reach.
The practical picture is layered:
- Types and interfaces encode the structural “what” that the LLM reads before generating code
- Property-based tests encode behavioral invariants that survive implementation regeneration
- Architecture fitness functions encode system-level structural constraints that run in CI
- Domain documentation and ubiquitous language encode semantic constraints that no tool fully automates
Each layer closes a different slice of the gap between natural language intent and machine-verifiable specification. None closes it completely. Together they extend the distance the “what” specification reaches into the system, which is exactly what the Fowler conversation argues needs to happen as LLMs take over more of the “how” generation.
What the Tradition Gets Right
The formal methods tradition has been saying for sixty years that the “what” specification needs to be as precise as the “how” implementation it drives. Every practical failure the Fowler conversation discusses, underspecified prompts generating plausible-but-wrong implementations, architectural “what” information absent from the prompt context, evaluation difficulty when reviewers lack domain depth, maps onto problems that formal methods researchers had formal names for decades ago.
The tools from that tradition did not fail because the problems they addressed were not real. They failed because the overhead was too high relative to the benefit under the cost structure of hand-written development. LLMs do not change what the specification gap is. They change who pays for closing it and when, shifting cost out of implementation and into verification in a way that makes the trade-offs come out differently.
Property-based testing is not the only answer to that shift, but it is the most accessible tool from the formal specification tradition that working software teams can adopt without the mathematical prerequisites that kept Z notation and TLA+ on the shelf. The Fowler conversation points at a need; sixty years of specification research points at a set of tools worth reconsidering under changed economic conditions.