The Missing Complexity Regression Test: How bigoish Fills a Gap Criterion Leaves Open

Criterion is excellent at catching magnitude regressions. Run a benchmark, store the baseline, and get notified when the same operation takes 20% longer next week. What it does not catch is a change in complexity class: the O(n log n) sort that becomes O(n²) after a well-intentioned refactor, the O(n) lookup that acquires a hidden O(n) scan inside a helper. These regressions are invisible to fixed-input benchmarks until the load grows large enough to surface them.

bigoish approaches this from a different angle. It measures how your function scales across a range of input sizes, fits the resulting timing data against candidate complexity models, and returns a Complexity enum that you assert on directly inside a standard #[test] function.

use bigoish::*;

#[test]
fn test_sort_complexity() {
    let complexity = measure(|n| {
        let mut v: Vec<i32> = (0..n as i32).rev().collect();
        v.sort();
    })
    .with_sizes([100, 1_000, 10_000, 100_000])
    .run();

    assert_eq!(complexity, Complexity::NLogN);
}

The test passes or fails through normal cargo test mechanics. There is no special test runner, no separate CLI, no reporting dashboard to check afterward. If a commit turns your O(n log n) function into O(n²), this test fails in CI the same way a broken unit test does.

How the Measurement Works

The statistical approach inside bigoish combines two complementary techniques.

The first is log-log linear regression. For polynomial-time algorithms, runtime follows roughly T(n) = c * n^p. Taking the logarithm of both sides gives log(T) = log(c) + p * log(n), which is a straight line in log-log space. Ordinary least squares finds the slope p, and you read off the complexity class: a slope near 1 is O(n), near 2 is O(n²), near 3 is O(n³). The R² goodness-of-fit score selects the best candidate from the set {O(1), O(log n), O(n), O(n log n), O(n²), O(n³)}.

The second technique is the doubling ratio. Compute T(2n) / T(n) for successive doubled input sizes and observe where the ratio converges:

Ratio	Complexity
~1.0	O(1)
~2.0	O(n)
~2.0+ (declining)	O(n log n)
~4.0	O(n²)
~8.0	O(n³)

O(n log n) is the tricky case. It is not a pure power law, so log-log regression can conflate it with nearby exponents. The doubling ratio converges toward 2 from above as n grows, since 2n * log(2n) / (n * log(n)) = 2 * (1 + log(2)/log(n)). At large n the correction term vanishes, but at modest n it is visible and distinguishable. Combining both techniques gives bigoish a more reliable signal than either method would provide alone.

Both techniques require geometrically spaced input sizes, not linearly spaced ones. The .with_sizes([100, 1_000, 10_000, 100_000]) API nudges you toward the right choice: each step is a 10x jump spanning four orders of magnitude. Linear spacing like [100, 200, 300, 400] covers only 4x total, which is rarely enough for asymptotic behavior to dominate over constant-factor noise.

The Prior Art

The idea of inferring complexity class from timing data is not new. Catherine McGeoch formalized empirical algorithm analysis in A Guide to Experimental Algorithmics, and the academic methodology was established through the 1990s.

The Python big-O library, available on PyPI since 2013, implements this approach cleanly. You pass a callable and a data generator and get back the best-fit complexity class:

import big_o

best, others = big_o.big_o(sorted, big_o.datagen.integers, min_n=100, max_n=100_000)
print(best)  # <NLogN: time = 0.28 * n*log(n)>

Google Benchmark in C++ added complexity estimation around 2015. You annotate a benchmark with a candidate complexity class, and the framework prints the inferred scaling alongside timing results:

BENCHMARK(BM_StringCompare)
    ->RangeMultiplier(2)
    ->Range(1<<10, 1<<18)
    ->Complexity(benchmark::oN);

The key architectural difference between bigoish and both of these predecessors is where the result lands. Python’s big_o returns a value for you to inspect; Google Benchmark prints a metric in benchmark output. Neither naturally becomes a CI pass/fail gate without additional glue code. bigoish produces a value that goes directly into assert_eq!, which means the complexity assertion lives alongside your unit tests, fails in the same way unit tests fail, and requires no tooling beyond cargo test.

Rust-Specific Engineering Concerns

Two engineering decisions in bigoish are specific to Rust.

The first is std::hint::black_box. The Rust compiler is aggressive about eliminating dead computations. If the closure passed to measure produces a value that is never observed by the outside world, the optimizer can legally elide the work entirely, producing false O(1) measurements regardless of input size. bigoish wraps timed computations with black_box to prevent this, the same technique used by Criterion. C++ benchmarks require the same discipline with benchmark::DoNotOptimize, but Rust’s optimizer is thorough enough that skipping black_box consistently produces wrong results.

The second is the closure API itself. The measure(impl FnMut(usize)) signature accepts any callable that takes an input size, with no trait objects or runtime dispatch overhead in the hot path. Rust’s ownership model also prevents a common testing mistake: reusing mutable state across iterations without resetting it. Since each closure invocation owns its data independently, the timing signal reflects actual per-size work rather than accumulated side effects from prior iterations.

Criterion and bigoish Are Complementary

Criterion tracks timing magnitude; a run that used to take 1ms and now takes 5ms triggers a regression. bigoish tracks complexity shape; a run whose timing curve shifts from c * n * log(n) to c * n² triggers a different kind of regression.

These are different failure modes. A constant-factor regression from an inefficient allocation path is Criterion’s domain. A complexity regression from accidentally introducing a nested loop or a hidden O(n) scan inside a hot path is bigoish’s domain. They are not competing tools; a well-tested performance-sensitive codebase benefits from both.

Criterion has richer statistical machinery: confidence intervals, outlier detection, warm-up periods, and comparison against stored baselines. bigoish is narrower. Its goal is a single classification, not a detailed timing profile. The two tools ask different questions about the same code, and the answers are independently useful.

Where the Approach Has Limits

The most important limitation is input distribution dependence. Empirical complexity testing characterizes behavior for the specific inputs you provide. An algorithm with O(n log n) average-case complexity may have O(n²) worst-case behavior on adversarial inputs. If your with_sizes closure always generates random data, worst-case behavior is invisible. You can test multiple input distributions in separate test functions, but the tool cannot enumerate what you have not tested.

The second limitation is range sensitivity for nearby complexity classes. O(n log n) and O(n^1.1) are nearly indistinguishable at n between 10³ and 10⁵. The log factor grows slowly enough that several orders of magnitude of input size are needed for the difference to emerge clearly. This is inherent to the measurement approach, not specific to bigoish. If your algorithm can only be tested at small n due to memory or runtime constraints, the classifier will produce noisier results.

Constant-factor masking can also fool the classifier at small input sizes. An O(n²) insertion sort with small constants can exhibit O(n log n) timing behavior in the range n = 100 to n = 1000 simply because cache effects and lower-order terms dominate. Choosing appropriate minimum sizes is part of writing a reliable complexity test, and there is no substitute for picking sizes where asymptotic behavior has room to emerge.

None of this undermines the practical value of the tool. Catching gross complexity regressions (O(n) becoming O(n²), O(n log n) becoming O(n³)) in CI is worth doing even when subtle superlinear drift remains undetectable. These are the regressions that cause production incidents at scale, and they produce unambiguous statistical signal across a wide input range.

A Useful Addition to the Rust Testing Toolkit

bigoish fills a genuine gap: complexity class is a meaningful property of an algorithm, it can change without triggering any existing Criterion regression, and catching that change in CI before it reaches production is worthwhile. The tool takes an empirical algorithm analysis methodology that has existed since the 1990s, with implementations in Python since 2013 and C++ since around 2015, and integrates it into cargo test as a first-class pass/fail assertion.

The statistical foundation is log-log regression combined with doubling-ratio analysis. The API is a closure over input size returning a Complexity enum that slots into assert_eq!. The scope is narrow by design: it catches complexity shape regressions and nothing else. For anyone maintaining performance-sensitive Rust code, the bigoish crate is worth adding alongside Criterion rather than instead of it.