· 8 min read ·

The Lexer Gap: Why Thompson NFA Simulation Has One Linear-Time Blind Spot

Source: lobsters

Ken Thompson published his regex matching algorithm in 1968, and it guarantees O(n) matching time for any regular expression against any input of length n. The algorithm never backtracks, never revisits a character, and cannot be made to blow up by adversarial input. For decades this stood as the definitive answer to catastrophic backtracking in engines like Perl and Python.

The catch is that Thompson’s algorithm, as typically presented, answers a specific question: does this pattern match anywhere in the string? Once you need to find all matches, and specifically all non-overlapping longest matches, the standard formulation introduces a subtle O(n²) trap. A recent article by Egbert Teeselink closes that gap for pure NFA simulation, and the reason the result is interesting is that this particular variant is the one every lexer and tokenizer actually needs.

What Thompson’s Algorithm Guarantees

The core of Thompson NFA simulation is state-set deduplication. Rather than tracking one thread through the NFA and backtracking on failure, which is the approach that causes exponential blowup, Thompson simulation tracks the full set of active NFA states simultaneously. On each input character, you advance every state in the current set along matching transitions and take the union. The set has at most |NFA| members, which is O(|pattern|), so each character costs O(|pattern|) time and the total is O(n * |pattern|).

The deduplication is the critical property. If two threads reach the same NFA state by different paths, their future behavior is identical. Only one needs to be kept. This is what prevents exponential blowup: no matter how many ways the pattern could be trying to match at a given position, the simulation collapses them into a state set of bounded size.

Russ Cox’s series on regex matching revived awareness of this algorithm after backtracking engines dominated for decades. His articles include the famous plot where Perl’s time grows exponentially while Thompson simulation runs flat. The RE2 library he built at Google makes the linear-time guarantee contractual: no input, regardless of how it is constructed, can cause RE2 to run in superlinear time.

The Lexer Problem

Compilers, log parsers, tokenizers, and network DPI systems all need something slightly different from “does the pattern match somewhere.” They need: given a pattern (or a set of patterns), scan the input and produce every non-overlapping occurrence, where each occurrence is the longest possible match starting from the current scan position. After a match ending at position j, the next scan begins at j and applies the same rule.

This is the maximal munch rule. Flex, re2c, ANTLR, and Ragel all implement it. It ensures that >= is tokenized as a single token rather than > followed by =, and that a greedy repetition like [a-z]+ consumes the longest run of letters before the engine moves on.

The standard approach for lexers compiles all token patterns into a combined NFA, converts it to a DFA via subset construction, and runs the DFA with the following logic:

pos = 0
while pos < len(input):
    state = dfa.start
    last_accept_pos = -1
    i = pos
    while i < len(input):
        state = dfa.transition[state][input[i]]
        if state == DEAD: break
        if dfa.is_accept[state]:
            last_accept_pos = i
        i += 1
    if last_accept_pos != -1:
        emit match [pos, last_accept_pos + 1)
        pos = last_accept_pos + 1
    else:
        pos += 1

Each character is visited at most twice in this loop: once during a match attempt, and once as the start of the next. So the total work is O(n). The DFA transition table makes each character cost O(1). This is the textbook solution and it works.

The problem is DFA construction. A pattern with k NFA states can produce a DFA with up to 2^k states. For simple token sets this is fine. For patterns with large character classes, especially Unicode, the DFA can be enormous. RE2 addresses this with a lazy DFA that builds states on demand and flushes the cache when it grows too large, giving amortized O(n) behavior. The Rust regex crate uses the same approach with additional SIMD literal pre-filters.

But suppose you want to avoid DFA construction entirely and use pure NFA simulation, perhaps because the DFA blows up or because you need POSIX submatch semantics that require tracking thread history. This is where the O(n²) trap appears.

The Restart Problem

With a DFA, the advance-until-dead-then-restart-at-last-accept logic is O(n) because DFA transitions are O(1). With an NFA, each step costs O(|NFA|). The outer loop runs until a match ends at position j, then restarts at j. In the worst case, those restarts happen at every position, and each restart runs a simulation that takes O(n) steps. Total: O(n²).

Concretely, consider the pattern a* against the input aaaa...a of length n with no separator between matches. Every character is a match boundary. The NFA simulation restarts n times, each time running from the current position to the end of the current match. With a non-trivial pattern and a long match, this is quadratic.

The naive fix is to run the NFA continuously without restarting. The grep-style all-overlapping-matches simulation does exactly this: at every character position, inject the NFA’s start state (with epsilon closure) into the active set. This finds all positions where a match ends, including overlapping ones. Tools like GNU grep use this to find all occurrences.

But all-overlapping is not what you want for tokenization. If the pattern aa is applied to aaaa, overlapping matches give you matches at positions 0, 1, and 2. Non-overlapping longest gives you one match at 0 and one at 2. The overlap-injecting approach does not directly implement the non-overlapping longest semantics, because it starts new threads at every position including positions that are already inside an in-progress match.

The article at iev.ee works out the bookkeeping needed to get non-overlapping longest matches from a continuous NFA simulation. The key ideas work together across three passes of logic:

First, run the NFA simulation continuously, injecting the start state at every character position as in the overlapping case. This creates conceptually many concurrent threads, one started per position. State deduplication collapses threads that reach the same NFA state into one, keeping the one with the better leftmost-longest claim. At any character position, the number of live threads is bounded by |NFA|, so the per-character cost stays O(|NFA|).

Second, track for each active state set the earliest pending match start position that is still alive. When all threads that started at some position s have either accepted or died, and the best acceptance from s ended at position e, you can commit the match [s, e). This is the lazy commit: you do not emit a match as soon as the NFA accepts; you continue simulating to ensure no longer match is possible from the same start.

Third, after committing a match ending at e, suppress new thread injections for positions between s and e, since those positions are already covered by the committed match. The next injection happens at e, which is where the next scan begins. This implements the non-overlapping constraint without restarting the simulation.

The complexity argument follows from state deduplication. Even though many threads conceptually coexist, their state sets remain bounded by |NFA|. The lazy commit and injection suppression add O(1) bookkeeping per character. Total: O(n * |NFA|) = O(n * |pattern|), the same as Thompson’s algorithm for single-match queries.

This is the same asymptotic bound as the DFA approach, which achieves O(n) total with O(1) per-character transitions but potentially exponential preprocessing. Here there is no DFA construction at all. For patterns where DFA construction is cheap, the DFA approach is faster in practice. For patterns where it is not, the NFA approach with this bookkeeping gives the same asymptotic guarantee.

Why This Matters in Practice

The re2c lexer generator compiles token patterns to a combined DFA and generates O(n) C or Go code directly. It sidesteps the NFA-restart problem by always working at the DFA level, and for most practical lexers the DFA is small. Ragel does the same for protocol parsers. These tools are unaffected by the O(n²) trap because they never use NFA simulation at runtime.

For runtime regex engines that cannot afford DFA construction, the problem is live. The Rust regex crate’s find_iter method implements non-overlapping longest matches, and its documentation notes that the DFA may be bypassed for certain patterns. The tagged NFA work by Ville Laurikari, which is the basis for the TRE library, solves the submatch capture problem with O(n * |pattern| * |tags|) time using tag transitions to track group boundaries without a DFA. Teeselink’s result sits in the same space: submatch-free but solving the all-matches variant that tagged NFAs often leave as an exercise.

Bioinformatics is worth noting as a concrete scale argument. Genome sequences run to billions of characters, and motif search needs to find every non-overlapping occurrence of a pattern. The difference between O(n) and O(n²) at that scale is not a constant factor; it is the difference between minutes and days. Tools like seqkit and custom sequence scanners that apply regex-like patterns to DNA already use DFA-based approaches for this reason. A clean NFA-based solution with a proved linear-time bound gives implementers another option when the DFA is impractical.

Network intrusion detection is the other demanding case. Systems like Hyperscan (Intel’s high-performance regex library used in Snort and Suricata) match hundreds of patterns simultaneously against network packets at 10 to 100 Gbps line rates. Hyperscan uses bitset-parallel NFA simulation where each NFA state maps to one bit in a SIMD register, and transitions become bitwise operations on 64- or 256-bit vectors. The asymptotic complexity is O(n * |NFA| / word_size). That kind of SIMD parallelism is only possible because the NFA simulation model, with its bounded state sets, maps cleanly onto vector registers. The all-longest-matches variant Teeselink describes is compatible with this bitset approach.

The Broader Picture

The Thompson NFA simulation framework is one of the more elegant results in practical algorithmics. State deduplication solves the backtracking problem by making the question of whether you have visited a particular state from a particular position irrelevant: you track all states simultaneously and the bounded set size does the work. What Teeselink’s article shows is that the same deduplication argument extends cleanly to the all-longest-matches variant, once the bookkeeping for lazy commit and injection suppression is made explicit.

The result probably exists in folklore; some form of it is implicit in how grep finds all matches and how lexer DFAs implement maximal munch. The value of the article is in the explicit proof that the NFA simulation version achieves the same O(n * |pattern|) bound, without DFA state explosion. The algorithm also handles the POSIX leftmost-longest requirement correctly, which matters because most POSIX regex implementations have bugs in this area, including glibc’s own regexec().

For anyone building a regex engine that must handle this variant without falling back to a DFA, that explicit proof is the piece that was missing from the standard literature. Thompson’s 1968 paper closed the single-match problem. Teeselink’s article closes the all-longest-matches problem, and the two results use exactly the same underlying mechanism.

Was this interesting?