Putting Your Algorithms on Trial: Empirical Complexity Testing in Rust

Formal algorithmic complexity analysis gives you a proof. Empirical complexity testing gives you evidence. They answer different questions, and Rust’s testing ecosystem has historically been strong on benchmarking raw performance while leaving the second question largely unaddressed. bigoish is a crate that fills that gap, letting you write tests that assert an algorithm runs in O(n log n) and fail if it drifts toward O(n²).

The distinction between benchmarking and complexity testing is worth being precise about. A benchmark tells you that your sort function processes 10,000 integers in 1.2 milliseconds on your machine today. A complexity test tells you that runtime scales roughly as n log n across a range of input sizes, regardless of the absolute numbers. Both are useful. They catch different regressions.

The Math Underneath

The core technique in empirical complexity testing is curve fitting on a log-log scale. If a function runs in O(n^k) time, then log(t) = k·log(n) + c, meaning the log of runtime should be linear in the log of input size. You run the function at several input sizes, collect (n, t) pairs, take logarithms, and fit a line. The slope approximates the exponent.

For the full set of common complexity classes, you try each candidate model and pick the best fit:

O(1): runtime is flat across input sizes
O(log n): runtime grows proportional to log(n)
O(n): linear slope of 1 on a log-log plot
O(n log n): slightly above linear, distinguishable from O(n) if your input range is wide enough
O(n²): slope of 2 on a log-log plot
O(n³): slope of 3
O(2^n): exponential, identifiable by extremely rapid growth

The fit is measured by residual error, and the complexity class with the lowest residual wins. bigoish does this fitting for you and exposes the result as something you can assert_eq! against in a test.

A typical usage looks like:

use bigoish::{measure, Complexity};

#[test]
fn sort_is_nlogn() {
    let complexity = measure(|n| {
        let mut v: Vec<u32> = (0..n as u32).rev().collect();
        move || v.sort()
    });
    assert_eq!(complexity, Complexity::NLogN);
}

The closure structure is deliberate: the outer closure receives n and sets up the input, and the inner closure is the actual work being timed. This separation is critical. You want to measure only the algorithm, not the data generation. bigoish times only the inner closure, across a range of n values it controls.

Why Criterion Does Not Do This

Criterion is the standard Rust benchmarking library, and it has a throughput API that lets you associate input size with a benchmark. You can run the same benchmark at different sizes and inspect the plots it generates. But Criterion stops there. It produces graphs; it does not classify. Interpreting whether the curve looks like O(n) or O(n log n) is left to the developer reading the output.

This is a reasonable design choice for a benchmarking tool. Criterion’s goal is precision in wall-clock measurement, and classification requires making statistical judgments about the shape of a curve that could be misleading if the measurement noise is high. Criterion surfaces data and lets humans interpret it.

bigoish makes the opposite tradeoff. It accepts some imprecision in exchange for a machine-readable classification that can live in your test suite. The name itself acknowledges this: it is Big-O-ish. The measurement will not catch the difference between O(n) and O(n·α(n)) where α is the inverse Ackermann function. It will catch an accidental O(n²) nested loop introduced by a refactor.

Prior Art Across Languages

This pattern exists in several other ecosystems. The Python big-O library has been around since the early 2010s. Its interface is a function from int to a callable, which maps almost directly to bigoish’s outer/inner closure pattern:

import bigO

def my_sort(arr):
    arr.sort()
    return arr

lib = bigO.BigO()
complexity, _ = lib.test(my_sort, "random")
print(complexity)  # O(n log n)

The Java ecosystem has big-o-test, which integrates with JUnit and provides assertion-style complexity checks:

BigOAssert.assertLinearOrBetter("myAlgorithm", n -> {
    // set up and run with n elements
});

The JavaScript ecosystem has multiple packages with varying levels of maintenance, though empirical complexity testing is less culturally embedded in JS test suites than in Java or Python ones.

What makes bigoish interesting relative to these is that Rust’s type system and closure semantics make the input-generation/measurement separation cleaner than in most other languages. The move semantics on the inner closure guarantee the timing harness is measuring only the intended work, with no shared mutable state leaking execution time across iterations.

The Regression Testing Use Case

The most practical application for a tool like this is catching algorithmic regressions in CI. Consider a data structure that starts with a linear scan for lookups and is later refactored to use a hash map. A benchmark might still pass because the constant factor improvements elsewhere mask the complexity improvement. But a complexity test makes the contract explicit:

#[test]
fn lookup_is_constant_time() {
    let complexity = measure(|n| {
        let map: HashMap<u64, u64> = (0..n as u64).map(|i| (i, i * 2)).collect();
        move || {
            let _ = map.get(&(n as u64 / 2));
        }
    });
    assert_eq!(complexity, Complexity::Constant);
}

If someone later replaces the HashMap with a Vec and a linear scan, this test fails, and the failure message is expected Constant, got Linear rather than a mysterious slowdown in production under load.

This also works as living documentation. A test that asserts Complexity::NLogN tells future maintainers something about the intended contract of the function that neither a doc comment nor a benchmark communicates quite as directly.

Where Empirical Testing Falls Short

Empirical complexity measurement has real limitations, and a tool that pretended otherwise would be misleading.

The input range matters enormously. If you test from n=100 to n=1000, an O(n log n) algorithm and an O(n) algorithm may be statistically indistinguishable. The logarithm grows slowly enough that you need a wide range, often several orders of magnitude, to reliably separate the two. bigoish controls the input sizes it uses, but the default range may not always be wide enough for your algorithm’s characteristics.

Constant factors can confuse the measurement. An O(n²) algorithm with a very small constant, run on inputs small enough to fit in L1 cache, might time faster than an O(n) algorithm with a large constant across the same range. The measured exponent reflects what the hardware actually does, which is not always what asymptotic analysis predicts at small n.

Measurement noise is always present. Context switches, garbage collection in other processes, thermal throttling, and branch predictor state all affect timing. The fitting procedure has to be robust to outliers, and on a heavily loaded CI machine, the measurements can be noisier than expected. This is another reason the “ish” framing is appropriate.

For tight numerical code, cache hierarchy effects can dominate. An O(n) algorithm that is cache-friendly may empirically appear faster-than-linear on small inputs due to prefetcher behavior, then degrade as working set exceeds cache size. The measured complexity across different size ranges can look inconsistent.

The Right Tool for the Problem

None of these limitations mean empirical complexity testing is not worth doing. They mean it occupies a specific niche: coarse-grained complexity regression testing, where catching a refactor that accidentally introduces a quadratic loop is more valuable than distinguishing O(n) from O(n log n) with statistical precision.

For that use case, a crate like bigoish fits naturally into the existing #[test] workflow. It does not require a separate benchmark harness, does not demand a quiet machine for stable results, and produces assertions that fail clearly. The source is available on docs.rs and the interface is minimal enough to understand in a single reading.

Rust’s ecosystem has excellent tools for the performance measurement half of the problem. Having a lightweight option for the complexity classification half rounds out the toolkit in a practical way.