Multi-Start NFA Simulation and the Overrun Argument That Lexers Were Missing

Russ Cox’s 2007 writeup on regular expression matching did a lot to rehabilitate Thompson NFA simulation in an era of backtracking engines. The argument is clean: represent the current “live” NFA states as a deduplicated set, advance that set one character at a time, and you get O(n * m) time with no exponential blowup, where n is input length and m is pattern size. RE2 was built on exactly this foundation, and the Rust regex crate followed with its PikeVM.

But that argument covers finding a single match, or at most one match per call. What lexers and tokenizers actually need is different: scan left to right, find the longest possible match at the current position, emit it, advance past it, repeat until end of input. This is the maximal munch rule, and it is what Flex, re2c, and the Rust logos crate implement. The assumption that this also runs in O(n * m) when done with an NFA has been floating around as folklore. A recent article by Eerik Muuli writes the proof down for the first time.

The gap is subtle enough that it is worth understanding why the single-match proof does not immediately transfer.

Why All Matches Is a Different Problem

A naive all-matches scanner works like this: start an NFA simulation at position i, run it forward until the active state set empties, record the last accepting position j, emit the match [i, j], then restart at j + 1. The worry is what happens between j and the position where the simulation died. Call the death position k. Characters from j+1 through k were examined during the failed attempt to extend the match, then discarded. The next simulation, starting at j+1, has to re-examine those same characters.

If this “overrun” of k - j characters happens at every match boundary, and both overruns and matches can be O(n) long, the naive bound becomes O(n^2 * m). That would make the all-matches approach quadratic even with Thompson-style NFA simulation, which is why DFA-based lexer generators precompute a combined DFA at build time to avoid the problem entirely.

The Overrun Monotonicity Argument

The key observation in the article is that overruns cannot accumulate. Suppose simulation i starts at position p_i, last accepts at j_i, and dies at k_i. Simulation i+1 starts at j_i + 1. Its own overrun begins where it last accepts, which must be at or after k_i. The overrun of simulation i covers [j_i+1, k_i]. The overrun of simulation i+1 begins at or after k_i. The two overrun ranges do not overlap.

Phrased differently: each character position is re-examined by at most one simulation beyond the one that first scanned it. Every character appears in at most two consecutive simulations. Total character-steps across all simulations: at most 2n. With O(1) work per step (DFA), you get O(n). With O(m) work per step (NFA state-set advancement), you get O(n * m).

This is overrun monotonicity. It relies on one property that might not be immediately obvious: the next simulation’s death position must be at least as far right as the previous simulation’s death position. If it were not, the previous simulation’s overrun would have extended past the next simulation’s entire run, meaning the previous simulation already established that no match could start in that range at all, which would be a contradiction with the maximal munch rule. The argument is tight.

State-Set Boundedness Under Multiple Concurrent Starts

The overrun argument handles the character re-examination problem, but there is a second concern. A concurrent implementation might inject a new NFA start thread at every character position, running multiple match attempts simultaneously. If you have O(n) concurrent threads, each with O(m) states, the state set blows up to O(n * m) and the per-character work becomes O(n * m) rather than O(m), giving O(n^2 * m) total.

The article handles this with a “latest start wins” deduplication rule. The data structure is a map from NFA state to the latest match-start position that has reached that state. When two threads converge on the same NFA state, keep the entry with the larger start position and discard the other. The map is keyed by NFA state, so it has at most O(m) entries regardless of how many concurrent match attempts are in flight.

This is correct for non-overlapping leftmost-longest semantics. If NFA state q is reachable from both start position i and start position j, where i < j, then the thread from j will produce a match that starts further right, and any match from i that overlaps with an already-committed match is inadmissible anyway. Keeping only the later-starting thread is not an approximation: it produces exactly the same set of non-overlapping longest matches.

In pseudocode:

state_map = {}  # nfa_state -> latest start_pos reaching it
last_accept = None

for i, ch in enumerate(text):
    for s in epsilon_closure(nfa.start):
        if s not in state_map or i > state_map[s]:
            state_map[s] = i

    for state, start_pos in list(state_map.items()):
        if nfa.is_accepting(state):
            last_accept = (start_pos, i)

    next_map = {}
    for state, start_pos in state_map.items():
        for t in nfa.step(state, ch):
            if t not in next_map or start_pos > next_map[t]:
                next_map[t] = start_pos
    state_map = next_map

    if not state_map and last_accept:
        yield last_accept
        last_accept = None

The deduplication on line if t not in next_map or start_pos > next_map[t] is the entire mechanism. State-set size stays O(m).

How This Compares to What Real Tools Do

Flex and re2c compile all token patterns into a single combined DFA at build time using subset construction. DFA simulation is O(1) per character: one table lookup, one state transition. Total: O(n). The cost is O(2^m) preprocessing and potentially large tables. re2c generates direct goto-based C code rather than a table, which makes it fast enough that this is rarely a practical concern. logos does the same via a Rust procedural macro.

RE2 uses a lazy DFA: NFA state sets are frozen into DFA states on demand and cached. For common patterns this approaches O(1) per character. When the DFA cache overflows, RE2 falls back to NFA simulation. RE2’s FindAll returns all non-overlapping matches and is guaranteed linear. Its RE2::POSIX mode enforces true leftmost-longest semantics at the cost of tracking submatch boundaries, which requires a POSIX DFA and is slower.

The Rust regex crate’s find_iter returns all non-overlapping leftmost-longest matches and is linear. The underlying regex-automata crate exposes the PikeVM directly, which is structurally similar to what the article describes.

Python’s re.findall, PCRE-based engines, and most backtracking engines restart a backtracking search at each position. No linear-time guarantee, and adversarial inputs like (a+)+ on a string of repeated characters followed by a non-matching character produce exponential behavior. This is well-documented and was the original motivation for RE2.

What the Article Adds Beyond Folklore

Russ Cox’s 2007 series established that Thompson NFA simulation gives O(n * m) for single match. Aho and Corasick’s 1975 paper established O(n + m + z) for multi-literal matching, where z is match count. Laurikari’s 2000 paper on tagged NFA simulation established O(n * m^2) for POSIX submatch tracking, implemented in the TRE library. Each of these has a written proof.

The all-longest-matches case, which is precisely what every lexer needs, sat between them with an understood algorithm and no written proof. Practitioners assumed it was O(n * m) because the overrun argument seemed obvious once you thought about it, and tools like Flex validated the assumption empirically. The article writes the proof down: overrun monotonicity for the character re-examination bound, latest-start-wins deduplication for the state-set size bound, both together giving O(n * m) time with O(m) space.

The scoping is important. The proof covers match boundaries only, not internal capture group positions. Full POSIX submatch semantics for all matches would require tagged NFA simulation and is O(n * m^2) by Laurikari’s result. For lexer use cases, match boundaries are all you need.

Practical Relevance

For anyone writing a lexer with a build step, the DFA approach from Flex and re2c is still the right default. O(n) at runtime and the exponential preprocessing is bounded in practice by sensible pattern design.

The NFA approach matters when patterns are dynamic or the build-step DFA is too large. Runtime regex libraries that want to guarantee linear behavior on all-matches queries need exactly this proof to know their implementation is correct, not just fast in practice. Libraries like RE2 and the regex crate already implement the right algorithm; the article provides the theoretical grounding that was previously missing.

The Aho-Corasick connection is worth noting. The overrun monotonicity argument for NFA simulation maps directly onto failure-link traversal in an Aho-Corasick automaton: you never backtrack past a position you have already passed. State deduplication maps onto trie prefix-sharing. The NFA case is strictly more general, and the same structural intuition applies.

That this proof gap existed for nearly sixty years after Thompson’s original paper is not particularly alarming. The algorithm was correct; the formal argument just had not been written out. What makes the article valuable is that it does write it out, cleanly, with the two claims separated and proved independently. That is the kind of work that lets implementors say with confidence that their linear-time guarantee is not an approximation.