Multi-Start NFA Simulation and the Linear-Time Guarantee Lexers Were Missing
Source: lobsters
The piece at iev.ee starts with a deceptively simple claim: yes, you can find all longest regex matches in linear time. The “yes” frames the article as a reply to skepticism, and the skepticism is reasonable, because the problem the article solves is not quite the problem that most linear-time regex literature addresses.
The problem lexers must solve
Ken Thompson’s 1968 paper introduced the NFA simulation technique that underlies every safe regex engine of the last fifty years. The core insight is that instead of backtracking, you maintain the set of all NFA states reachable on the input seen so far. Because that set has at most |Q| elements, where Q is the set of NFA states, each input character costs O(|Q|) to process, and the total over an input of length n is O(n · |Q|). Russ Cox’s 2007 writeup made this technique widely known again, showing pathological inputs on which PCRE takes exponential time while Thompson NFA finishes in milliseconds.
But Thompson NFA answers a specific question: does regex R match string T, and if so, where is the first (leftmost) match? That is not the question a lexer needs to answer.
A lexer needs something more demanding: starting from each position where a previous match ended, find the longest possible match, emit it as a token, and repeat until the input is consumed. This is the maximal munch rule, and it underlies every serious tokenizer. It is why -> is one token rather than - followed by >, why >= takes precedence over >, why 3.14 parses as a float literal rather than an integer, a dot, and another integer. The formal version, which the iev.ee article addresses directly, is: given a regex R and input T of length n, find the sequence of non-overlapping substrings T[s₁..e₁], T[s₂..e₂], …, where each sᵢ₊₁ = eᵢ (left-to-right, no gaps), each substring matches R, and each match is maximal in that no longer match exists from the same start position. This is the “all longest matches” problem.
Why the naive extension is quadratic
Extending Thompson NFA to lexer semantics looks straightforward at first. After emitting a match ending at position e, restart the NFA at position e and scan forward, continuing past any accept states because longer matches may still exist, until the NFA has no valid transitions. Record the last accept position as the match end. Repeat.
Each restart costs O(|Q|) per character scanned, and in the worst case you may scan O(n) characters from each of O(n) start positions. Total cost: O(n² · |Q|). For inputs with many short matches, this is a real problem, not a theoretical one. A tokenizer processing megabytes of source code cannot afford quadratic behavior.
The standard solution has been to pre-compute a DFA. Tools like flex and re2c generate a deterministic finite automaton from the regex at scanner-generation time. The generated scanner has O(1) transitions per character via table lookup, and handles longest-match semantics with a backup register: it records the last accept state seen, keeps advancing, and when the DFA enters an error state, rolls back to the backup position. Because the DFA is pre-computed, the potential O(2^|Q|) state explosion cost is paid once offline. The runtime scanner runs in O(n) time.
This is what makes flex-generated scanners fast in practice, and it has been the canonical answer for decades. But it has constraints. DFA construction can blow up exponentially for certain pattern combinations. More importantly, it requires building the DFA before the scanner runs, which is unavailable if you are embedding a scanner in a library that accepts user-provided patterns at runtime, or building a query engine where tokenization rules are part of the schema rather than the compiler.
The multi-start NFA simulation
The argument in the iev.ee article is that you can solve the all-longest-matches problem with a modified Thompson NFA simulation in O(n · |Q|) total time, without pre-computing a DFA.
The mechanism is a multi-start simulation. At each input position i, a new NFA “thread” is spawned at the start state, representing a potential match beginning at position i. All active threads advance in parallel on each character, exactly as in standard Thompson NFA. Several modifications carry the all-longest-matches semantics:
Each thread tracks its start position and its last-seen accept position. When a thread enters an accept state, it records the current position as its latest successful match endpoint but does not stop, because a longer match may still be possible.
Threads are killed only when the NFA’s transition function returns an empty state set. At that point, if the thread has recorded an accept position, a match is emitted.
When the earliest-started thread dies and commits a match from position s to position e, all threads that started after s and before e are discarded, since their matches would overlap with the committed match. Simulation continues from position e with any threads that started at or after e.
The critical observation for complexity is that threads share NFA state sets. At any input position, the combined state space across all active threads is bounded by |Q|, not by the number of active threads. Each input character is processed once. Total work is O(n · |Q|), matching the complexity of single-match Thompson NFA.
The proof is the non-trivial part
The algorithm above is not hard to state. What requires careful argument is the correctness proof: why does committing to the earliest thread’s match never cause you to miss a longer valid match from an earlier position?
The argument turns on the structure of the leftmost-longest rule. Because matches must be non-overlapping and produced left to right, the match starting at the earliest unmatched position takes priority by definition. If thread s₁ (starting at position s₁) commits to a match ending at e, then any thread s₂ started at s₁ < s₂ < e would produce a match overlapping with s₁’s match. Under the leftmost rule, s₁ wins, and discarding s₂ is correct.
The edge case that requires care is when thread s₂, started after s₁, continues to make progress past position e, where s₁’s simulation died. This is not a problem: s₂ is now looking at characters at or after position e, which is precisely where the next match should start. Thread s₂ becomes the new earliest active thread, and the simulation proceeds normally. The commit rule is conservative in the right direction; it never kills a thread that could produce a non-overlapping match from a position not yet covered.
This is the part the article emphasizes, and rightly so. The algorithm may seem like a straightforward generalization of Thompson NFA, but the correctness argument for the thread interaction is distinct from anything in Thompson’s original paper. Single-match NFA simulation does not need to reason about multiple threads with different start positions, so the invariant that makes multi-start work correctly does not appear in Thompson’s original formulation. It has to be established from scratch, and the iev.ee article does that work.
DFA construction versus NFA simulation
Both approaches, DFA pre-computation and multi-start NFA simulation, produce linear-time scanners, but they occupy different points in the design space.
The DFA approach pays construction cost upfront, potentially O(2^|Q|) but manageable in practice for typical lexer patterns. The runtime cost is O(1) per character, making the scanner very fast for fixed, known-ahead-of-time pattern sets. re2c generates particularly tight code and is used in PHP’s lexer and many other production parsers. The limitation is that the DFA must exist before the scanner runs.
The NFA simulation approach pays O(|Q|) per character at runtime, with no construction cost. This tradeoff fits when patterns are determined at runtime, or when the pattern set is large enough that DFA construction would itself be too slow. Hyperscan (now maintained as Vectorscan) takes a related approach for multi-pattern matching at network speeds, using bit-parallel NFA simulation with SIMD instructions to reduce the per-character constant significantly.
Go’s regexp package, which wraps re2, uses a hybrid: it starts with NFA simulation and lazily builds DFA states for frequently-visited NFA state sets, bounding the DFA cache size to control memory. This gives near-O(1) per character for repeated patterns while preserving linear-time guarantees throughout.
The complexity table looks roughly like this:
| Approach | Construction | Runtime per char | Space |
|---|---|---|---|
| Backtracking (PCRE) | O(|P|) | O(2^n) worst | O(n) |
| Thompson NFA (single match) | O(|P|) | O(|Q|) | O(|Q|) |
| DFA (flex, re2c) | O(2^|Q|) | O(1) | O(2^|Q|) |
| Multi-start NFA (all longest) | O(|P|) | O(|Q|) | O(|Q|) |
What this adds to the practical picture
Most textbooks treat DFA-based lexers as the canonical solution to tokenization and mention NFA simulation only as a stepping stone toward DFA construction. The iev.ee article fills a gap in the other direction: it shows that NFA simulation is a complete solution, not an intermediate one, for the full lexer semantics including all-longest-matches. The O(n · |Q|) bound holds for the complete problem, not just for the single-match subproblem.
For anyone building a parsing library, query engine, or protocol decoder that embeds regex-based tokenization without a code-generation step, this result matters. You do not need to implement DFA construction and the associated state explosion management to get linear-time lexing. The Thompson NFA technique, extended to multi-start simulation with the commit rule described above, is sufficient, and the proof that it is sufficient is now written down in a form you can verify and build on.