· 7 min read ·

Multi-start NFA Simulation and the Linear-Time Guarantee That Lexers Actually Need

Source: lobsters

The claim sounds obvious once you hear it: yes, you can find all longest regex matches in an input string in linear time. Of course you can. Regex matching is O(n), and it has been since Ken Thompson’s 1968 paper in CACM. But “all longest matches” is a subtler problem than “does this string match a pattern,” and the distinction is exactly where lexer design lives.

What “all longest matches” means

When you write a lexer, you are not asking whether some string matches some pattern. You are asking something more specific: scan left to right through the input; at each point where a match begins, take the longest possible match, then resume scanning from just after it. This is the POSIX leftmost-longest rule applied repeatedly across an entire string, and it is the semantic contract that flex, lex, and every serious tokenizer generator has implemented since the 1970s.

This is what a recent article by Eerik Muuli formalizes and proves is achievable in O(n) with a pure NFA simulation, without the exponential preprocessing overhead that DFA-based tools like flex typically carry.

The distinction matters in practice. Consider a simple tokenizer for a language with these rules:

FLOAT  ::= [0-9]+\.[0-9]+
INT    ::= [0-9]+
IDENT  ::= [a-z]+

Input: x123 45.6 789

A single call to “find match” returns the first token and stops. Tokenization means finding all tokens across the whole input, one after another, each as long as possible. The regex engine sees this input character by character across multiple match attempts, and the naive approach restarts the simulation from scratch after each match. That restart cost, accumulated across n characters, breaks the linear-time guarantee.

Thompson’s algorithm and what it actually guarantees

Ken Thompson’s NFA simulation converts a regular expression of size |r| into an NFA with O(|r|) states, then simulates that NFA on input by tracking the set of currently active states at each position. Each character step transitions the entire set in O(|r|) time, so the total cost is O(n * |r|): linear in the input length, with the regex size as a constant factor.

Russ Cox’s 2007 series on regex matching revived this approach and explained why most production engines (Perl, Python, Ruby) use backtracking instead, despite its exponential worst case. Cox went on to build RE2 around NFA and lazy DFA simulation, which guarantees O(n) time for all inputs.

But Thompson’s guarantee and RE2’s guarantee are for finding one match. They answer: starting at position 0 (or the first matching position), where does the longest match end? The full tokenization problem is different. After finding that match, you need to find the next one, and the next, until the input is exhausted. If you restart the NFA simulation from scratch after each match, you visit each character multiple times, and the total cost becomes O(n * average_token_length) in the best case.

For most practical inputs, that is fine. For adversarial or pathological inputs with very short tokens, it is not.

The DFA solution and its tradeoff

Tools like flex sidestep this problem by compiling all token patterns into a single combined DFA before the program runs. DFA simulation is O(1) per character, one table lookup per step, so scanning the entire input costs exactly O(n). This is the gold standard for runtime performance.

The cost is in preprocessing. Converting an NFA to a DFA via the powerset construction can produce O(2^|r|) states. flex historically hit this limit with complex token sets; modern tools use lazy DFA construction (building states on demand) to manage it, but the worst case remains exponential in the regex size. For a fixed set of tokens compiled at build time, this is acceptable. For dynamic regex compilation at runtime, it is not.

The two-thread trick

The key insight in Muuli’s formulation is that you can achieve linear time with an NFA simulation, not a DFA, by running two interleaved simulation threads simultaneously:

  • Thread A starts at position 0 and extends as far as possible, tracking the last accepting state it passes through.
  • Thread B starts at position 1, running one step behind, ready to become the new primary thread when Thread A’s match is committed.

In pseudocode:

thread_a = epsilon_closure({start})
thread_b = epsilon_closure({start})
last_accept = -1
last_accept_pos = -1
match_start = 0
b_start = 1

for i, ch in enumerate(input):
    if accepting_state in thread_a:
        last_accept = i

    thread_a = step(thread_a, ch)
    thread_b = step(thread_b, ch)

    if thread_a is empty:
        if last_accept >= 0:
            emit_match(match_start, last_accept)
            thread_a = thread_b   # thread B becomes the new primary
            thread_b = epsilon_closure({start})  # fresh thread starts
            match_start = b_start
            b_start = last_accept + 1
            last_accept = -1
        else:
            match_start += 1
            thread_a = thread_b
            thread_b = epsilon_closure({start})
            b_start = match_start + 1

The critical property is that each character is processed by at most two NFA simulations: once by the current primary thread and once by the background thread. Since each simulation step costs O(|r|), the total cost is O(2 * n * |r|) = O(n). The “restart” is not actually a restart; it is a handoff between two threads that were already running in parallel.

This structure is not coincidentally similar to Aho-Corasick, which solves multi-pattern string matching in O(n + |patterns| + output) by building failure links into a trie. Aho-Corasick’s failure links let the automaton transition to the next potential match without rescanning characters. Muuli’s contribution is showing this same idea generalizes naturally to full NFA simulation for arbitrary regex, not just literal string patterns.

POSIX semantics and where complexity actually accumulates

For match boundaries, the two-thread NFA approach gives correct POSIX leftmost-longest semantics at O(n) cost. Where things get harder is subgroup positions: the positions within a match where capture groups start and end.

The state of the art for POSIX subgroup semantics is Ville Laurikari’s tagged NFA (TNFA), implemented in the TRE library. Each NFA transition carries tags that record subgroup boundary positions, and POSIX ordering rules determine which thread wins when multiple threads reach the same state. This gives correct semantics at O(n * |r|^2) time in the worst case. RE2 deliberately skips full POSIX subgroup semantics for performance; it approximates or punts on them entirely.

Muuli’s article focuses on match boundaries rather than subgroup positions, which is the right scope. Most practical applications of all-longest-match scanning (lexers, tokenizers, scanners) care about which token pattern matched and where it starts and ends, not the internal structure of capture groups within each token. The O(n) guarantee applies cleanly to this common case.

Why this matters for implementations that cannot use flex

The NFA simulation approach is practically relevant in several situations where DFA compilation is too expensive or unavailable.

Dynamic regex at runtime. A rules engine that lets users define patterns, or a WAF that loads signatures from a database, cannot afford exponential DFA preprocessing for each new pattern set. NFA simulation with the two-thread trick gives linear-time scanning without a compilation step.

Large combined pattern sets. Hyperscan, Intel’s SIMD-accelerated regex library for network packet scanning, handles thousands of simultaneous patterns by combining NFA simulation, specialized DFA construction for tractable patterns, and literal pre-filters. The all-longest-match semantics, done correctly at NFA level, fits naturally into that architecture for cases where DFA construction is too expensive.

Language implementation without a build step. Lexer generators like flex and re2c compile token patterns to DFA tables at code-generation time. A lexer implemented in a scripting language, an interpreter, or a proof assistant, where you cannot run flex as a build step, needs a runtime approach. NFA simulation with correct longest-match semantics covers this case without requiring external tooling.

Education and verification. There is genuine value in a clean proof that the result is achievable. The flex manual says “longest match wins” and leaves the complexity analysis implicit. A formal account of the two-thread NFA simulation makes the linear-time guarantee explicit and auditable, which matters if you are implementing this in a context where correctness needs to be demonstrated, not just asserted.

The broader point

Linear-time regex has been folklore since 1968, but folklore papers over the details. “Linear time” for a single match is straightforward. “Linear time for all longest matches across an entire input” requires the two-thread structure, and that structure is close enough to obvious that most people assume it without proving it. Muuli’s article does the work of making it explicit and correct.

That kind of formalization might seem academic until you are building a new lexer in a language that lacks a flex equivalent, or debugging a tokenizer that exhibits O(n^2) behavior on certain inputs and cannot figure out why. The two-thread NFA simulation is the right reference implementation to have, and now there is a clean writeup to point to.

The result also reinforces something worth keeping in mind about automata theory in general: the gap between “provably achievable in theory” and “clearly implemented correctly in practice” is often bridged by one careful paper. This is one of those papers.

Was this interesting?