From Scenarios to Specifications: What Property-Based Testing Demands in Practice

A real-world case study from Adolfo Ochagavia recently made the rounds, showing property-based verification applied to an actual production problem rather than a tutorial toy. That framing, “verification” rather than “testing,” is worth sitting with. The distinction is doing real work, and understanding it helps explain why property-based testing finds bugs that years of unit tests do not.

The Fundamental Difference

Unit tests are scenarios. You construct a specific input, run the code, and assert a specific output. This is valuable, but it encodes your assumptions about what cases matter. The inputs you choose reflect what you thought to think of. Bugs live in the cases you did not.

Property-based testing inverts the process. Instead of writing assert encode(decode("hello")) == "hello", you write: for all strings s, decode(encode(s)) == s. The framework then generates hundreds or thousands of values of s and checks whether the property holds for each one. When it fails, you get a counterexample, and then the framework tries to shrink that counterexample into the smallest possible input that still triggers the failure.

The shift from scenario to specification sounds modest. In practice it is a significant change in how you think about correctness. You have to articulate what your code promises to do for all inputs of a given shape, not just the inputs you happened to verify by hand.

Useful Properties Are Not Hard to Find

A common objection to property-based testing is that coming up with useful properties requires knowing too much about the implementation. This is mostly unfounded. Several structural patterns cover the majority of interesting cases.

The roundtrip property is the most broadly applicable: encode followed by decode should return the original value, and vice versa. Any serialization format, compression algorithm, or data transformation pipeline has this property. A codec that passes a hundred hand-written unit tests can still violate roundtrip on inputs containing specific byte sequences, particular Unicode code points, or values at the boundary of length-prefixed fields.

The oracle property compares a fast production implementation against a slow reference implementation. If you have a SIMD-optimized sort and a naive reference sort, they must agree on every input. The oracle approach has found bugs in BLAS libraries, sorting implementations, and numeric routines that no unit test suite caught, because the bugs were specific to certain value distributions.

The idempotence property covers any normalization function: applying it twice should produce the same result as applying it once. Canonicalization, URL normalization, whitespace trimming, and schema migration functions all carry this obligation.

Model-based or stateful properties are the most powerful form, where a sequence of operations against a real implementation is compared against an abstract model of what those operations should do. John Hughes and his team at Quviq used this approach to find over one hundred bugs in Ericsson’s distributed telephony software in two days, bugs that years of system testing had not surfaced. The same approach found subtle consistency violations in the Riak KV store under specific operation sequences that no integration test had exercised.

Shrinking Is the Feature

Random testing without shrinking is marginally useful. The counterexamples it produces are typically large, complex, and opaque. A 50-field struct with nested collections that triggers a failure tells you very little. Shrinking is what makes property-based testing a practical engineering tool.

Koen Claessen and John Hughes built shrinking into QuickCheck when they published it at ICFP 2000, and it remains the most important idea in the paradigm. The mechanism is this: once a failing input is found, the framework systematically generates “smaller” variations of that input and checks whether each one still fails. For integers, smaller means closer to zero. For lists, it means fewer elements. For structs, it means shrinking each field independently. The framework keeps the smallest failing input it has found and continues shrinking until no further reduction is possible.

Classic QuickCheck-style shrinking requires each type to implement a shrink function that enumerates smaller candidates. This puts burden on the programmer: you must write shrink logic for every custom type, and composed shrink logic can be incomplete.

Hypothesis, David MacIver’s Python library, resolved this with integrated shrinking. Instead of shrinking the generated value, Hypothesis records all random choices as a bytestream. Shrinking operates on the bytestream itself, and because the generator is deterministic over the bytestream, any valid simpler bytestream produces a simpler value automatically. No per-type shrink logic is required. The proptest crate in Rust uses a similar strategy-tree approach to achieve the same result.

The practical consequence is that when proptest finds a failure on a complex nested input, it returns the minimal input that still fails. A bug that manifests on a 300-element vector gets shrunk to a two-element vector. A failing string of arbitrary Unicode gets shrunk to the single character that triggers the problem. Debugging becomes tractable.

use proptest::prelude::*;

proptest! {
    #[test]
    fn roundtrip_holds(input in prop::collection::vec(any::<u8>(), 0..1024)) {
        let encoded = encode(&input);
        let decoded = decode(&encoded).expect("decode should not fail");
        prop_assert_eq!(input, decoded);
    }
}

This test, as written, will run 256 times by default with varying input lengths and contents. If it fails, proptest saves the minimal failing case to a proptest-regressions/ directory so it replays on every future run, even when the random seed changes. The regression file is committed to version control and becomes a permanent part of the test suite.

The Ecosystem and Its Lineage

QuickCheck’s influence is visible across nearly every language ecosystem. Hypothesis in Python, ScalaCheck, FsCheck for F#, PropEr for Erlang, fast-check for TypeScript, jqwik for Java, gopter and rapid for Go, and proptest plus quickcheck for Rust all trace their design to the 2000 paper. The ideas have proven durable because they address a real structural problem in how software gets tested.

The differences between implementations are instructive. Hypothesis’s byte-stream model makes integrated shrinking universal and removes the need to write shrink implementations. PropEr’s proper_statem module makes stateful model testing a first-class feature, which is particularly important for testing protocols and concurrent systems. The proptest crate’s persistence of regressions addresses a real pain point: random tests that fail on one run but not the next are difficult to debug and tend to get ignored.

Quviq’s commercial Erlang QuickCheck, built by Hughes after the original research, added automatic race condition detection and used generated operation sequences to find bugs in distributed systems that no other testing approach had found. The talk Hughes gave on this work remains one of the clearest arguments for the approach.

Verification or Testing

The word “verification” in the framing of property-based testing needs careful handling. Property-based testing does not prove that a property holds for all inputs. It provides probabilistic evidence that no counterexample exists in the space explored. With the default 256 or 1000 test cases, you are sampling the input space, not exhausting it.

True formal verification, using tools like Coq, Lean, Isabelle, or TLA+, produces a mathematical proof that a property holds universally. That guarantee is substantially stronger but carries a proportionally higher cost in expertise and time.

Property-based testing occupies the middle position. It is far more thorough than hand-written examples and requires no formal methods expertise. The confidence it provides is probabilistic but calibrated: a property that survives ten thousand generated inputs without failure is much more likely to hold than one that survives ten hand-picked cases. For production software where formal verification is not economical, this middle position is often the right trade-off.

Ochagavia’s case study illustrates what this looks like in a real codebase: an actual correctness invariant, specified as a property, verified against generated inputs, with shrinking producing actionable counterexamples when failures occur. The gap between knowing about property-based testing and applying it to real problems is mostly one of practice. The tools are mature, the patterns are learnable, and the failure modes that get found are the ones that matter.