Multi-Start NFA Simulation: The Algorithmic Gap Between Linear-Time Regex and Real Lexers
Source: lobsters
Most developers who have spent time with performance-sensitive text processing know that RE2 and the Rust regex crate guarantee linear-time matching while PCRE does not. Russ Cox’s foundational series on regex matching made this idea widely accessible: Thompson’s 1968 NFA simulation processes each character once per NFA state, giving O(n * m) time for input length n and pattern length m, with no exponential blowup on adversarial inputs.
But linear-time matching for what operation, exactly? The standard API is something like find: given a pattern and an input, return the first match. That is a meaningful guarantee. It is not, however, the guarantee that a lexer needs. A lexer needs to scan an entire input from left to right, emitting the longest possible token at each position, then advancing past that token and repeating. This is called the all-longest-matches problem, or maximal munch in parser terminology. And the two problems are more distinct than they appear.
Ivo Vatne’s recent article formalizes this distinction and gives a clean NFA-based algorithm that solves the all-longest-matches problem in linear time. The result itself is not entirely surprising, but the treatment is careful and the gap it fills is real.
What Maximal Munch Actually Requires
Consider the pattern [a-z]+ applied to the input foo bar. A single call to find gives you foo. That is one match. But a lexer scanning the full input needs foo, then bar, with the space skipped. Naive iteration: call find at position 0, get foo (length 3), advance to position 3, call find again starting at position 3, get nothing (space), advance to position 4, call find at position 4, get bar.
For a simple pattern like [a-z]+ this is fine. The NFA for this pattern has a handful of states, and restarting it at each position costs little. But in a real lexer, the pattern is a union of many token regexes: keywords, identifiers, numeric literals, operators, string literals. The combined NFA may have hundreds of states. And the restart-at-each-position approach costs O(|NFA|) per input character per restart, giving O(n * |NFA|) per match attempt and O(n² * |NFA|) overall when there are O(n) matches. For large inputs or complex token sets, this is a real cost.
More precisely: when matching [a-z]+ at position 4 in foo bar, you are re-examining a character the NFA already saw during the match attempt that started at position 0. Those characters at positions 4..6 were processed once during the first match, and they get processed again. That double-counting is the source of the quadratic behavior.
How Flex and re2c Solve It
The standard answer to this problem is the one that Flex and re2c have used for decades: compile all token regexes into a single combined DFA at build time. A DFA has no epsilon-transitions and no state sets; each character causes a deterministic state transition in O(1). The DFA runs forward until it hits a dead state (no valid transitions), and the last accepting state it passed through determines the token and its length. Then the DFA resets to its initial state at the new position.
The key insight in the DFA approach is that the DFA’s initial state implicitly handles the restart. After emitting a token, you drop back to state 0 of the DFA. No character is re-examined. Total time is O(n) with O(1) per character. re2c in particular generates direct C code for the DFA transitions rather than table lookups, making it extremely fast in practice.
The problem is the DFA itself. Subset construction, the standard algorithm for converting an NFA to a DFA, can produce a DFA with 2^|NFA| states in the worst case. For practical lexer grammars this is rarely catastrophic, but it is a real constraint. Complex Unicode character classes, overlapping patterns, or large keyword sets can produce DFAs that are unwieldy. Flex handles this with DFA minimization and table compression; re2c uses state merging and similar techniques. They work, but they are build-time costs and the resulting tables can be large.
The NFA approach has the opposite trade-off: O(|NFA|) space, O(n * |NFA|) time. For a fixed pattern, space is constant and time is linear. The catch is that no widely-used tool implements NFA-based all-longest-matches directly.
The Multi-Start Concurrent Simulation
The insight in Vatne’s algorithm is to run multiple NFA simulations concurrently across a single linear pass over the input. At any position i in the input, the algorithm maintains a list of active NFA threads, where each thread tracks a start position and the NFA’s current state set for a match attempt beginning at that position.
On each input character:
- Every active thread advances: each thread’s state set is updated by taking transitions on the current character.
- A new thread is added for a match attempt starting at the current position (initialized to the NFA’s start state).
- Any thread whose state set has become empty, meaning no NFA states remain and no further match is possible, is retired. If that thread ever reached an accepting state, the span from its start position to its last accepting position is a completed match, and that token is emitted.
The non-overlapping semantics are enforced by priority: a thread started at an earlier position takes precedence over one started later. If two threads would produce overlapping matches, the earlier-starting one wins. When a thread is retired having seen an accept state, the algorithm advances the “scan position” past that match, suppressing any threads that started within the matched span.
Each input character is processed by each active thread, and each thread dies when the NFA can no longer advance from that start position. The number of simultaneously active threads is bounded by the length of the longest possible match for the given pattern, which is a constant for a fixed pattern. Total work: O(n * |NFA|). No character is processed more than |NFA| times, and |NFA| is proportional to the pattern length.
This is essentially the observation that Thompson NFA simulation already implicitly explores all continuations from a given state set. When you start a new thread at position i, you are adding the NFA’s initial state to the current simulation. The simulation then naturally carries this state forward alongside all the other states already active from earlier start positions. The “multi-start” framing is almost a restatement of what Thompson simulation already does; the contribution is recognizing that this is exactly the structure needed to solve the all-longest-matches problem, and formalizing when to emit a token.
Where This Sits in the Ecosystem
The Rust regex crate exposes this behavior through find_iter, which returns an iterator over all non-overlapping leftmost-longest matches for a single pattern. Under the hood it uses a lazy DFA with a Thompson NFA fallback. The regex-automata sub-crate exposes a lower-level PikeVM that is closer to the multi-start simulation described here. For multiple patterns, the picture gets more complicated: RegexSet reports which patterns match but not positions; aho_corasick handles the multi-literal case efficiently; a full multi-pattern lexer still typically needs DFA construction.
Hyperscan, Intel’s regex library designed for network packet inspection, takes yet another approach: it reports all end positions where any pattern ends, using SIMD acceleration and a mix of NFA and DFA representations. Its semantics are different (report all match endpoints, not just longest), but it demonstrates that the space of regex execution models is larger than the PCRE vs. RE2 binary.
For people writing lexers in languages without a mature lexer generator, or with regex patterns complex enough to stress DFA construction, an NFA-based all-longest-matches engine is a cleaner primitive than either “run find in a loop” or “build a possibly-enormous DFA.” The algorithm is implementable in a few hundred lines and requires no build-time compilation step.
The Broader Point
The “linear-time regex” story that Cox popularized is important and underappreciated. But it describes a specific operation, find, under specific semantics (leftmost match). Real text processing pipelines need more: scanners that partition input into tokens, multi-pattern matchers that emit different results for different patterns, engines that handle Unicode grapheme clusters or normalization. Each additional requirement is a separate algorithmic problem that may or may not be solved by the “standard” linear-time guarantee.
Vatne’s article is worth reading precisely because it is careful about this. It does not claim to generalize the linear-time guarantee to all regex operations; it identifies one specific gap, the all-longest-matches case, proves that it can be filled within the NFA simulation framework, and shows the algorithm. That kind of precise contribution, without overstating what it achieves, is the kind of writing that makes theoretical results actually useful to practitioners.