Properties as Specifications: The Verification Mindset in Property-Based Testing
Source: lobsters
Adolfo Ochagavía’s recent post on property-based verification uses a specific word deliberately. Not testing. Verification. The choice is not accidental, and unpacking why it matters reveals something important about how to think about this family of tools and when they genuinely earn their keep.
Property-based testing and property-based verification use exactly the same frameworks. The difference is intent, permanence, and what the properties are treated as. In the testing mindset, you write a property to find bugs; once you find and fix them, the property is a regression guard. In the verification mindset, the property is the specification, and the implementation exists to satisfy it. The tests are not there to guard a known-correct implementation. They are the authoritative statement of what correctness means for the system.
Where the Tools Came From
Koen Claessen and John Hughes introduced QuickCheck at ICFP 2000 as a lightweight tool for random testing of Haskell programs. The core idea was elegant: write a predicate over your data, let a generator produce random inputs, and let the framework find counterexamples. The key technical contribution was shrinking: once a counterexample is found, the framework automatically reduces it to the smallest failing case, making the bug dramatically easier to debug.
The properties themselves are ordinary Haskell functions returning Bool:
prop_reverseReverse :: [Int] -> Bool
prop_reverseReverse xs = reverse (reverse xs) == xs
prop_sortLength :: [Int] -> Bool
prop_sortLength xs = length (sort xs) == length xs
Neither of these fully specifies the behavior of reverse or sort. They constrain it. Together with more properties, you can narrow down the specification considerably, but you cannot pin it completely without dependent types. This is where PBT sits on the precision spectrum: it is far more machine-checkable than comments or documentation, but it falls well short of formal proof.
The concept spread across every language ecosystem over the following two decades. Hypothesis brought it to Python with a richer strategy model and integrated shrinking. fast-check brought it to TypeScript. jqwik to Java. In Rust, the proptest crate provides composable strategy combinators that integrate cleanly with Cargo’s test runner:
use proptest::prelude::*;
proptest! {
#[test]
fn sort_is_idempotent(mut xs: Vec<i32>) {
let first_pass = { let mut v = xs.clone(); v.sort(); v };
let second_pass = { let mut v = first_pass.clone(); v.sort(); v };
prop_assert_eq!(first_pass, second_pass);
}
#[test]
fn sort_preserves_length(xs: Vec<i32>) {
let mut sorted = xs.clone();
sorted.sort();
prop_assert_eq!(xs.len(), sorted.len());
}
}
The ergonomics differ across languages, but the underlying machinery is the same everywhere: generators, shrinking, and falsifiable predicates.
The Verification Claim
Ochagavía’s framing is specific. He is not using properties to find bugs during development and then moving on. He is encoding what the system must do as properties and treating those properties as permanent infrastructure that gates all future changes. This is closer to how formal methods researchers use the word “verify” than how most engineers use “test.”
The practical consequences are significant. When a property fails, it does not mean “a bug was found.” It means “the system no longer satisfies its specification.” That reframing changes how you respond. You cannot argue that the property is overconstrained or that the failing input is unrealistic. If the property encodes what the system is supposed to do, and the system fails it, the system is wrong.
This also changes what properties you write. In the testing mindset, properties tend to be narrow and implementation-adjacent: “does this function return the same result as last time?” In the verification mindset, properties encode behavioral invariants: membership preservation, ordering guarantees, conservation laws, commutativity, idempotence, round-trip conditions. These are the same properties that would appear in a formal specification written in Z notation or TLA+, just expressed as executable predicates instead of mathematical formulae.
The Techniques That Make It Practical
Shrinking is the feature that makes PBT viable in production code. Without it, a counterexample might be a list of 400 integers with a subtle structural property that triggers the bug. With shrinking, QuickCheck or Hypothesis reduces that to a list of two integers that demonstrates the same failure. The reduction is automatic. Hypothesis’s integrated shrinking, where the strategy itself shrinks rather than the generated values being shrunk afterward, produces particularly clean counterexamples for complex structured inputs.
Oracle testing is another pattern worth naming. When you have a simpler reference implementation alongside the production one, you can write a property that runs both on the same generated input and asserts their agreement:
from hypothesis import given, strategies as st
@given(st.lists(st.integers()))
def test_sort_agrees_with_reference(xs):
assert my_sort(xs) == sorted(xs)
This sidesteps the challenge of specifying the exact output. You cannot always say what the output of a complex function must be in closed form, but you can compare it to a trusted simpler version. This pattern is common in cryptographic implementations and compiler backends.
Metamorphic testing covers cases where neither the exact output nor a reference implementation is available. You specify relationships between outputs for related inputs: if you sort a list and then add a larger element and sort again, the prefix of length n should be unchanged. If you encrypt and then decrypt, you should recover the original. These relational properties are often much easier to state than direct correctness conditions.
Where PBT Cannot Reach
The honest accounting is that property-based verification, however disciplined, is not proof. It is falsification. You can run a million inputs and find no counterexample; you have not proven the property holds for all inputs, only that it held for all the inputs you tried.
This is where the spectrum continues past QuickCheck. LiquidHaskell uses the Z3 SMT solver to verify refinement types at compile time, within the decidable fragment of quantifier-free linear arithmetic. Dafny lets you annotate imperative programs with preconditions, postconditions, and loop invariants that Z3 discharges statically. Lean 4 goes all the way: types are propositions, programs are proofs, and the type checker verifies that your implementation constitutes a valid proof of its specification.
The gap between these tools and property-based testing is not that the tools are better. It is that they are vastly more expensive to use. A Lean 4 proof of a sort algorithm’s correctness is possible; it also requires a level of mathematical sophistication and engineering time that most production codebases cannot justify. CompCert, the formally verified C compiler developed by Xavier Leroy, took years of research effort. It is genuinely correct; it is also not maintained by a startup team.
Property-based verification occupies the productive middle ground. You invest in precise behavioral specifications expressed as predicates, you get counterexample-finding and shrinking for free, and you get persistent machine-executable documentation of what the system must do. The probability that a subtle bug survives a well-designed property suite with millions of test runs is low, even if it is not zero.
The Mindset Is the Discipline
The word choice in Ochagavía’s title is doing real work. The tools for property-based testing have been available since 2000, in increasingly polished form, across virtually every language. The thing that has been scarce is the discipline to treat properties as specifications rather than test generation shortcuts.
That discipline means writing properties before or alongside the implementation, not after. It means keeping properties in the codebase permanently, not deleting them once the immediate bug is fixed. It means designing generators that cover the domain thoroughly, including edge cases, boundary values, and structurally complex inputs. And it means reading a failing property as a specification violation rather than as a test failure.
None of this requires new tools. It requires a shift in how you think about what you are writing and why it should live alongside your code forever.