· 7 min read ·

Measuring How Code Scales: The Case for Empirical Complexity Testing in Rust

Source: lobsters

There are two separate questions you can ask about a piece of code. The first is how fast it runs at a given input size. The second is how its runtime grows as input size increases. Benchmarking tools like criterion answer the first question very well. They run your function many times, apply statistical analysis, and give you a precise latency estimate with confidence intervals. What they do not tell you is whether your function is O(n) or O(n²). Those are different instruments measuring different things, and conflating them is how you end up with production systems that perform fine on small inputs and collapse under real load.

bigoish is a Rust crate designed to answer the second question, directly inside cargo test. The premise is simple: run your function at a geometric sequence of input sizes, record how long each run takes, and then fit the resulting data points to a set of candidate complexity classes. The class with the best fit wins, and the result is a typed value you can assert on in a test.

What the Crate Does

The API surface is small. You construct a Bench, configure the input sizes and number of samples, and pass a closure that accepts a usize representing the input size and performs whatever work you want to measure:

#[test]
fn sort_scales_correctly() {
    let complexity = bigoish::Bench::new()
        .with_sizes(&[100, 200, 400, 800, 1600, 3200])
        .with_samples(5)
        .run(|n| {
            let mut v: Vec<u64> = (0..n as u64).rev().collect();
            v.sort();
        });

    assert_eq!(complexity, bigoish::Complexity::ONLogN);
}

That test will fail if the sort implementation’s observed growth does not match the expected O(n log n) class. The Complexity enum covers the standard complexity classes:

pub enum Complexity {
    O1,     // constant
    OLogN,  // logarithmic
    ON,     // linear
    ONLogN, // linearithmic
    ON2,    // quadratic
    ON3,    // cubic
    O2N,    // exponential
}

The crate runs inside cargo test without requiring a separate --bench profile. It does not depend on criterion or any external harness. You get algorithmic regression testing wired into the same pipeline as your unit tests.

The Math Behind the Classification

The core technique is least-squares curve fitting on log-transformed data. For each candidate complexity class O(f(n)), the model is:

t(n) = c · f(n)

Taking the logarithm of both sides:

log(t) = log(c) + log(f(n))

This transforms the problem into ordinary linear regression: fit a line through the points (log(f(n_i)), log(t_i)). The slope should be 1 and the intercept is log(c). The sum of squared residuals from that line tells you how well the model fits the observed data.

For O(n log n), f(n) = n · log(n), so the transformed feature is log(n · log(n)), which is still a scalar per data point. The regression is the same structure. For O(n²), f(n) = n², and the log transform becomes log(n²) = 2 · log(n), which again fits the linear regression framework.

The complexity class with the lowest residuals wins. The approach is intentionally simple: no Bayesian inference, no iterative optimization, just fast closed-form least squares that can complete in milliseconds during a test run. The tradeoff is that you need reasonably clean data to get a reliable classification, which is why the crate takes multiple samples at each size and uses robust statistics to suppress outliers.

Prior Art: Python Got Here First

This idea is not new. The Python big-O package, installable via pip install big-O, has offered the same capability for years:

from bigO import BigO

lib = BigO()

def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        for j in range(n - i - 1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]

complexity, _ = lib.test(bubble_sort, "random")
print(complexity)  # O(n^2)

The Python library uses numpy.polyfit on log-transformed (n, t) pairs, which is the same fundamental approach as bigoish. It also supports test harness mode where you assert complexity in a unittest context. The algorithmic idea has been validated; bigoish brings it to Rust with native test integration and a typed enum result instead of a string.

The Java ecosystem has no equivalent that has gained traction. JMH, the standard Java microbenchmark harness, operates at the same level as criterion: it measures one thing very precisely, but does not classify scaling behavior. Google’s Caliper project approached this territory but focused on throughput measurement rather than complexity classification. Some research tools in the performance testing literature do curve fitting, but they never made it into the standard developer workflow.

JavaScript has nothing comparable in wide use. Ad-hoc implementations using performance.now() appear occasionally in blog posts, but no library has established itself.

Why This Matters More in Rust

The appeal of complexity testing in Rust specifically has to do with what Rust code promises. Rust’s zero-cost abstractions mean that the language does not impose overhead between your algorithm and the machine. When you write an O(n log n) sort in Rust, you get O(n log n) runtime behavior; there is no GC pause, no interpreter overhead, no hidden allocation that adds a constant factor large enough to mask quadratic growth at small scales.

In Python, constant factors are so large that distinguishing O(n log n) from O(n²) at small input sizes is genuinely difficult. The interpreter overhead dominates until n is large enough that the asymptotic difference becomes visible. In Rust, the constant factors are small and predictable, which means the empirical signal at sizes like 100 to 3200 elements is cleaner.

This makes algorithmic complexity assertions more meaningful in Rust. An O(n²) test failure in a Rust test suite represents a real problem with your algorithm, not an artifact of interpreter overhead or GC behavior.

Where Criterion Leaves Off

It is worth being precise about what criterion does and does not do. Criterion applies Welch’s t-test to determine whether a performance change between two runs is statistically significant. It applies linear regression to estimate per-iteration cost at a single input size. These are both genuinely useful operations.

But criterion has no mechanism for classifying complexity across input sizes. If you run criterion benchmarks at n=100 and n=200, you get two separate benchmark numbers. Comparing them by eye and noticing that the time quadrupled while n doubled is manual work that falls outside the tool. Criterion’s regression is a model of (sample_count, total_time) to estimate cost per sample, not a model of (input_size, iteration_time) to estimate asymptotic growth.

bigoish occupies a different position in the toolchain. Use criterion when you need precise, statistically rigorous latency numbers for a fixed-size benchmark. Use bigoish when you want to assert that an algorithm’s scaling behavior matches its documented or expected complexity class.

Practical Applications

The clearest use case is algorithmic regression testing. Suppose you have a data structure with documented O(log n) lookup complexity. Over time, someone refactors the implementation, perhaps accidentally introducing an O(n) scan in a code path that runs for certain key patterns. Your unit tests pass because they check correctness, not performance. Your benchmarks still look fast because they run with small datasets. The bigoish test catches the regression:

#[test]
fn lookup_stays_logarithmic() {
    let complexity = bigoish::Bench::new()
        .with_sizes(&[64, 128, 256, 512, 1024, 2048, 4096])
        .run(|n| {
            let tree = build_tree(n);
            black_box(tree.lookup(n / 2));
        });

    assert_eq!(complexity, bigoish::Complexity::OLogN);
}

This kind of test is not a substitute for code review, but it provides a safety net that activates at the right moment: when the behavior changes, not before.

Another use case is documentation verification. When a library documents that an operation is O(1) amortized or O(n) worst case, an empirical test gives you a machine-checked claim that the documentation and the implementation agree, at least over the range of input sizes you tested.

Limitations Worth Understanding

Empirical complexity testing is inference, not proof. A test that passes does not establish that an algorithm is O(n log n); it establishes that over the tested range of input sizes, the timing data fits an O(n log n) curve better than the alternatives. The distinction matters.

For small n, cache effects dominate. An O(n²) algorithm operating on data that fits in L1 cache will appear faster than an O(n log n) algorithm with worse cache behavior. The curve fitting can misclassify if the tested range does not extend far enough for asymptotic behavior to dominate.

Machine state affects results. Running on a loaded CI machine where processes compete for CPU time introduces noise that can blur the classification. The crate’s recommendation to use RUST_TEST_THREADS=1 during complexity tests helps, but does not eliminate the problem. Dedicated CI jobs or quiet machines produce cleaner results.

The candidate set is fixed to a handful of standard classes. An algorithm with O(n · sqrt(n)) complexity, or O(n² / log n), will be classified as whichever standard class happens to fit best over the tested range, which may not accurately represent the true growth function.

These are not fatal objections. They are constraints that define the tool’s effective range. For the common case of distinguishing O(n) from O(n²) in a real-world algorithm over input sizes that exercise asymptotic behavior, the approach works well.

The Gap It Fills

Most performance testing in Rust lives at two extremes. At one end, microbenchmarks measure single operations at fixed input sizes with high precision. At the other end, load tests and integration tests exercise full system behavior with realistic workloads. Between those extremes, there is limited tooling for the structural question of whether an algorithm’s scaling behavior matches its specification.

bigoish fills that gap by making complexity assertions first-class test citizens in cargo test. The implementation is not sophisticated, and the tool does not claim to be. What it does is take an idea that the Python ecosystem has used for years, implement it in idiomatic Rust with a clean typed API, and make it trivial to add to any test suite without pulling in a benchmarking framework or changing your test runner.

For a Rust project that cares about algorithmic correctness alongside functional correctness, that is a useful addition to the standard test vocabulary.

Was this interesting?