· 7 min read ·

Empirical Complexity Testing in Rust: What bigoish Does That Benchmarks Cannot

Source: lobsters

There is a persistent gap in most performance testing workflows. On one side you have tools like Criterion and Divan, which tell you with statistical confidence that a particular operation takes 47 microseconds at a fixed input size. On the other side you have theoretical analysis, which tells you a sorting algorithm is O(n log n). What neither gives you is an automated answer to the question that matters most during development: as my input grows, is this code actually behaving the way I think it is?

bigoish addresses that gap. It is a Rust crate for empirical complexity inference, which is a fancy way of saying it runs your function at several input sizes, measures the time at each, and fits those measurements to a set of known growth functions to determine which complexity class best describes the observed behavior.

How the Inference Works

The mathematical approach is not novel, but it is sound. For each candidate complexity class, the crate performs a least-squares regression between the measured timings and the corresponding growth function evaluated at each input size. The fit quality for each class is measured with the coefficient of determination, R², where a value closer to 1.0 indicates better fit.

Working in log space makes this tractable. For a power-law relationship like T(n) = a * n^k, taking logarithms gives log T = log a + k * log n, which is linear in log n. That means standard linear regression over log-transformed (n, T(n)) pairs cleanly separates polynomial complexity classes by their slope. An O(n) function has slope 1 in log-log space; O(n²) has slope 2; O(log n) has a slope that declines as n grows.

The complexity classes the crate evaluates are the standard suspects: O(1), O(log n), O(n), O(n log n), O(n²), O(n³). After fitting all candidates, the class with the highest R² above a threshold is reported.

The name telegraphs the epistemic status of that report. “Bigoish” is Big-O-ish: the result is an empirical approximation, not a proof. That hedging is honest and important, for reasons covered below.

The API Design

The core function takes a closure with a specific two-level structure:

use bigoish::complexity;

let c = complexity(|n| {
    // Setup: runs outside the timer
    let mut v: Vec<i32> = (0..n as i32).rev().collect();
    Box::new(move || {
        // This closure is what gets timed
        v.sort();
    })
});

println!("{:?}", c); // Complexity::Linearithmic

The outer function receives the input size n and returns a closure. Only the inner closure is timed. This separation is the central design decision. Allocating a Vec, generating random data, or constructing a tree takes time that is itself a function of n. If that setup were included in the measurement, the fitter would be analyzing allocation cost rather than algorithm cost. By separating them explicitly in the type signature, the API makes it structurally impossible to accidentally include setup time in your measurements.

This is the same insight that Criterion communicates through its b.iter(|| ...) pattern, but here it is load-bearing: the outer closure needs to run once per input size to generate the data, and the inner closure runs once (or a small number of times) for the timing measurement.

What This Is Not

Criterion and Divan are precision instruments for a fixed point in a function’s domain. They answer: “how long does sorting 10,000 integers take, and how confident are we in that number?” Criterion’s bootstrap sampling and Welch t-test machinery is designed to give you tight, statistically defensible measurements at that one size.

bigoish sacrifices that per-size precision entirely. It runs each size fewer times in order to cover a wide range of sizes. The precision loss is deliberate. You are not trying to know whether sorting 10,000 integers takes 47.2 or 47.4 microseconds; you are trying to know whether the time grows as n log n or n² as you move from 1,000 to 1,000,000.

The two tools are complementary. Run Criterion or Divan to benchmark your hot path. Run bigoish to verify that the hot path is not secretly quadratic.

Prior Art in Other Languages

The technique is well-established outside Rust. Python’s big-O library has been available since around 2011 and works similarly, fitting observed timings to candidate growth functions using NumPy’s least-squares solver. Its API requires a separate data generator function:

import big_o

best, others = big_o.big_o(
    lambda n: sorted(list(range(n, 0, -1))),
    big_o.datagen.n_,
    min_n=100,
    max_n=100_000,
    n_measures=20,
)
print(best)  # Linearithmic: time = -0.00026 + 2.4E-06*n*log(n)

The Python library returns the full fitted equation, not just the class, which is useful for comparing constant factors across implementations. bigoish currently reports the class; whether it surfaces the fitted coefficients depends on the API version.

In other ecosystems, the technique exists mainly as a folk practice. Go developers running testing.B benchmarks will sometimes write separate benchmark functions for n=100, n=1000, n=10000, observe the ratios manually, and reason about the class from the doubling relationship: if time roughly quadruples each time n doubles, the algorithm is O(n²). This works, but it is manual, non-reproducible, and easy to get wrong. JavaScript has no widely-used equivalent library; most JS performance analysis stops at Benchmark.js-style microsecond measurements.

The academic lineage goes back further. The Sofya tool (University of Maryland, early 2000s) automated empirical complexity analysis for Java programs with enough sophistication to handle multi-dimensional complexity like O(n * m). Most of that work stayed in research settings. bigoish brings the core idea into idiomatic Rust with minimal ceremony.

Why It Is “Ish”

The crate’s name is an honest disclosure of several hard limits.

Input distribution matters. A naive quicksort is O(n log n) on random input and O(n²) on sorted input. If your test function generates random data, you measure average-case complexity. The crate cannot tell you what complexity class your algorithm falls into for adversarial inputs unless you write a separate test with adversarial inputs. Every empirical complexity result is implicitly conditioned on the input distribution you chose.

Cache effects create false phase transitions. For small n, the working set fits in L1 or L2 cache and operations are fast. At some threshold, cache misses begin to dominate, and the time-per-operation jumps. In a log-log plot this looks like a kink that can shift the apparent complexity class depending on where in the size range the kink falls. An O(n) algorithm with a large constant can look quadratic over a narrow size range that spans a cache boundary.

Amortized complexity is not worst-case complexity. Pushing to a Vec is O(1) amortized but O(n) on occasional reallocation. If your timed closure runs many push operations, amortization averages out and you see O(n) total time. If it runs one operation, you might catch a reallocation. Empirical tests implicitly average over whatever cases your input size generates.

The model set is finite. If your algorithm is O(n^1.5) or O(n log² n), those are not in the candidate set. The fitter will select whichever standard class minimizes residuals, which might be O(n log n) or O(n²), both of which would be technically wrong. The confidence score should reflect this (a lower R² indicates a poor fit), but interpreting a poor fit requires understanding why the fit was poor.

None of these limitations are unique to bigoish; they apply to all empirical complexity analysis. The name just refuses to pretend they do not exist.

A Concrete Use Case: Complexity Regression Testing

The most compelling reason to add this to a CI pipeline is catching accidental complexity degradation. Consider a lookup table that starts as a HashMap with genuine O(1) expected lookup. A later refactor introduces a subtle bug where the hash function returns a constant, causing all keys to land in the same bucket and turning lookups into O(n) linear scans. Criterion will not catch this unless you happen to benchmark at a size where the slowdown is already obvious. bigoish will catch it: O(1) will have a high R² at one size but terrible fit across a range, while O(n) will fit cleanly.

#[test]
fn lookup_complexity_is_constant() {
    let complexity = bigoish::complexity(|n| {
        let map: HashMap<usize, usize> = (0..n).map(|i| (i, i)).collect();
        let key = n / 2;
        Box::new(move || {
            std::hint::black_box(map.get(&key));
        })
    });
    assert_eq!(complexity, bigoish::Complexity::Constant);
}

This test passes for a working hash map and fails the moment lookup degrades to linear. That is a regression test with algorithmic semantics rather than just timing semantics, and it is more robust to machine speed differences across CI environments than raw microsecond thresholds.

Where It Fits

The Rust benchmarking ecosystem is well-stocked at the absolute-performance end. Criterion handles statistical precision. Divan handles ergonomics and fast compile times. iai-callgrind handles deterministic instruction-count measurement for CI environments where timing is noisy. bigoish occupies a different position entirely: it does not care how fast your code is, only how its speed scales. That makes it more of a correctness tool than a performance tool, which is perhaps why the community has been slow to develop it. Correctness and performance are both worth testing; the latter just tends to get more attention.

For any data structure or algorithm where the complexity class is part of the contract, bigoish offers a way to turn that contract into a test that runs.

Was this interesting?