Why Lexer-Semantics Regex Matching Is Harder to Prove Linear Than to Implement
Source: lobsters
Every lexer generator in widespread use, from Flex to Ragel to re2c, produces scanners that run in O(n) time on n characters of input. For DFA-based scanners, this guarantee is obvious: each character causes exactly one DFA state transition, and the total work is bounded directly. For NFA-based scanners, the situation is subtler, and a recent post on iev.ee provides the formal proof that was previously absent.
Thompson’s Baseline
Ken Thompson’s 1968 CACM paper established the foundation. Given a regex with m operators, Thompson’s construction produces an NFA with O(m) states and O(m) epsilon transitions. The simulation maintains a current set of active NFA states; at each input character, every active state advances simultaneously, producing the next active set. The generation counter trick, a single integer incremented per step with a lastlist[s] field per state, makes membership testing O(1) and prevents any state from appearing in the set twice.
The total work per character is O(m), bounded by the number of distinct NFA states. For an n-character input, the total is O(n × m). No character is revisited; no exponential blowup is possible. Russ Cox made this accessible in 2007, contrasting it with the catastrophic backtracking in PCRE and Python re, where patterns like a?ⁿaⁿ matched against aⁿ produce O(2ⁿ) behavior.
Thompson’s analysis covers whether a match exists. A lexer needs more: it must find all longest non-overlapping matches, scanning the entire input without restarts.
Three Problems That Share a Name
“Does the regex match this string” and “tokenize this input” sound like the same problem but have meaningfully different algorithmic requirements.
Existence matching: simulate the NFA, stop at the first accepting state. O(n × m). The NFA starts once at position 0 and runs to completion.
Leftmost-longest matching (POSIX semantics): among all matches starting at the leftmost position, find the longest. This requires continuing the simulation past the first accepting state, recording the last accepted position, and stopping only when the active state set empties. For a single match, this is O(n × m). Finding all such matches naively requires restarting the NFA at each new position after a match completes, which is O(n² × m) in the worst case.
All non-overlapping longest matches (lexer semantics): after reporting a match at position i with length k, the next scan begins at position i + k. The tokenizer never re-examines consumed characters. This is the problem every lexer generator solves, and the case the iev.ee proof addresses directly.
The DFA Solution and Its Limits
Flex and Lex solve this with a DFA. The combined NFA for all token patterns is converted to a DFA via subset construction. At runtime, the scanner maintains two extra variables: the last accepting DFA state and the last accepting input position. When the DFA enters a dead state, it reports the token recorded at the last accepting state, resets the input position pointer to last-accepted-position + 1, and continues.
Each character causes one DFA table lookup. The position reset is a pointer operation, not a re-scan. Every character in the input is read exactly once across the entire tokenization. The O(n) proof is immediate: n characters, O(1) work each, no revisits.
The cost is the DFA itself. Subset construction produces at most 2^m DFA states from an m-state NFA. Re2c uses extensive state minimization and table compression; Flex employs equivalence classes, mapping characters with identical transition behavior to the same column in the transition table. For a realistic lexer covering a programming language’s token set, the DFA remains manageable. For use cases where the pattern set is large, dynamic, or frequently updated, DFA construction is impractical. Protocol dissectors, network intrusion detection systems, and streaming pipelines matching thousands of patterns cannot afford the offline build. These systems simulate the NFA directly, and that is where the proof gap lives.
The NFA Case: Where the Proof Does Its Work
The multi-start simulation approach keeps the NFA running continuously across the entire input. At each position, the start state is added to the active set, representing a potential new match beginning there. Existing active states carry forward threads that started at earlier positions. When any active state is an accepting state, the simulation records a candidate match.
A rough sketch of the per-character step:
active = {start_state} // seed: new thread starting at position 0
for i = 0 to n-1:
// advance all threads on input[i]
next = ε_closure({ delta(s, input[i]) for s in active })
// seed a new thread for any match starting at i+1
next = next ∪ {start_state}
// record any matches found
for s in next where s is accepting:
record candidate match ending at i, for the thread that birthed s
active = next
Each step is O(m). The total is O(n × m) because state set size is bounded by m and no character is re-processed.
The priority ordering invariant is what makes this produce correct lexer output rather than just all possible matches. The active state list is ordered by thread priority: threads that started earlier rank higher. When two threads would occupy the same NFA state, the lower-priority thread is discarded. This is safe because any future accepting state the lower-priority thread would reach is either dominated by the higher-priority thread (which started earlier and so wins the leftmost rule) or corresponds to a token that will be captured by an independent thread started after the current match completes.
With this invariant maintained, the simulation reports the same tokens a DFA-based scanner would, in the same order, without re-scanning.
The State Handoff Problem
The formal challenge is the transition between one token and the next. When a match is finalized and reported, the simulation needs to continue with a state set appropriate for the next token’s scan. The threads that accumulated during the previous match’s extension, some of which processed characters now inside the previous token, form the starting point for the next match.
Showing that this handoff is correct requires proving that the accumulated state set is exactly equivalent to what you would get by starting the NFA fresh at the correct next-scan position. For the DFA case, this is immediate: the DFA state at any position is fully determined by the transition function and the input, so the handoff is just a pointer reset. For the NFA case, the state set was built incrementally over multiple input characters, carrying threads that have processed varying amounts of the input. The priority ordering invariant is what ensures that the incrementally built set, after discarding dominated threads, matches the from-scratch result.
This is the specific gap the iev.ee article fills with a rigorous invariant statement and proof. The algorithm itself is not new; the formalization is.
POSIX Submatching: Where This Does Not Apply
POSIX submatching, where each capture group must independently satisfy the leftmost-longest rule recursively, is a strictly harder problem. Ville Laurikari’s tagged NFA approach assigns tags to NFA transitions corresponding to capture group boundaries. The active state set carries tag vectors recording where each group was opened and closed. Disambiguation follows the POSIX priority rules applied recursively to each group. The complexity is O(n × m × k) where k is the number of tags, two per capture group.
The iev.ee proof covers full-pattern longest match without capture group tracking, which is the standard lexer case. Token rules in a lexer grammar are typically flat patterns, not nested submatch structures. For that case, the O(n × m) bound is achievable cleanly.
RE2 handles the submatch case with a two-pass approach: a forward lazy DFA scan to locate the match end, then a backward NFA pass to extract submatch boundaries. The lazy DFA builds states on demand and caches them, giving amortized O(n) for the forward scan. Hyperscan, Intel’s multi-pattern matching engine used in network security applications, uses a hybrid NFA and DFA approach to handle large pattern sets where neither pure technique is dominant.
Why the Proof Matters
For engineers using Flex, Ragel, or re2c, none of this changes their workflow. DFA-based lexer generators already carry the linear-time guarantee implicitly, and the tools are mature and well-understood.
For anyone implementing a custom tokenizer using NFA simulation directly, which includes most streaming pattern matchers and some embedded parsers that avoid DFA construction for space reasons, the iev.ee proof provides the formal backing that was previously missing. The gap it closes is between “this runs fast in practice” and “this is provably O(n) in the worst case, with a precise invariant you can check your implementation against.”
Thompson’s 1968 insight made linear-time regex matching possible. Cox’s 2007 article made the argument accessible and drove a new generation of correct implementations. The iev.ee formalization closes the specific lexer case that the tokenizer community has relied on for fifty years, making the invariant explicit and the proof checkable.