Multi-Start NFA Simulation and the Linear-Time Guarantee That Lexers Were Missing
Source: lobsters
The single-match case for regular expression matching has been thoroughly documented since Ken Thompson’s 1968 paper. Compile the pattern to an NFA, simulate all active states simultaneously rather than backtracking, and you get O(n·m) time with O(m) space, where n is the input length and m is the pattern size. Russ Cox’s 2007 series of articles on this topic brought the ideas back into mainstream awareness and eventually produced RE2. That part of the story is well told.
What has not been told as carefully is the “all matches” variant: given a regex and an input string, find every non-overlapping longest match from left to right, covering the whole input. This is exactly what a lexer does. flex, re2c, and ragel all do it correctly and efficiently; anyone who has written a language frontend takes for granted that tokenization is linear. But the proof that it must be linear, spelled out from first principles, turns out to be harder to find than you might expect.
The article on iev.ee provides that proof. The claim itself is not surprising. The value is in making the argument precise.
What POSIX Leftmost-Longest Actually Means
POSIX specifies two rules for resolving match ambiguity. First, leftmost: among all possible match start positions, prefer the earliest one. Second, longest: among all matches starting at the same position, prefer the one that consumes the most input. These two rules, applied in that order, define the “correct” answer for any POSIX-compliant regex engine.
For lexing, the operational reading of these rules is: scan the input left to right; at each position, find the longest match starting there; emit it; advance past it; repeat. The “leftmost” rule is enforced by the scanning order. The “longest” rule is enforced by continuing the NFA simulation as long as new accepting states are reachable and taking the last accepting position seen before the simulation dies.
This seems simple, but the subtlety is that “all matches” means repeatedly restarting the simulation at the end of each match. Each restart looks like a new O(n) scan. If n characters remain after each match, you seem to be doing O(n²) work in the worst case, for example if the pattern matches only a single character at a time.
The reason this does not actually happen is straightforward once stated: each character in the input is processed by exactly one simulation run. When a match of length k is found starting at position i, the next simulation starts at position i+k. The characters from i to i+k-1 have been consumed; they are never revisited. Across all simulation runs, the total characters processed sums to n, not n times the number of matches. Multiplying by the O(m) cost per character gives O(n·m) overall.
The Algorithm, Concretely
The core procedure looks like this:
def all_longest_matches(nfa, text):
matches = []
pos = 0
while pos < len(text):
# Start NFA simulation at 'pos'
active = epsilon_closure({nfa.start})
last_match_end = None
i = pos
while i < len(text):
active = step(active, text[i])
i += 1
if nfa.accept & active:
last_match_end = i # record furthest accepting position
if not active:
break # simulation is dead, no point continuing
if last_match_end is not None:
matches.append((pos, last_match_end))
pos = last_match_end # advance past the match
else:
pos += 1 # no match here; skip one character
return matches
The step function advances every state in active by one input character and returns the union of successor states. epsilon_closure expands epsilon transitions. Both operations are O(m) using standard bitset representations of state sets.
The critical line is pos = last_match_end. That assignment ensures the outer loop’s next iteration begins exactly where the current match ended. No character before last_match_end will ever be fed to an NFA simulation again. The inner loop invariant is that i advances monotonically. Between the two loops, the total increments of i across all outer iterations equals len(text).
The linearity proof is just that accounting argument. Each unit of input work is charged to one simulation run and is never charged again. The simulation itself never rewinds.
Where the Literature Leaves a Gap
Thompson’s original paper and Cox’s RE2 articles both focus on the single-match problem. They prove that the NFA simulation finds one match in O(n·m). The natural extension to iterating all matches is usually handled by saying “call the engine repeatedly” without proving that the total call cost is linear. For non-overlapping matches this is fine but deserves to be written down, because it is not obvious from “each call is O(n)” that repeated calls over a shared input sum to O(n) rather than O(n²).
The confusion is partly notational. When people say RE2’s FindAll runs in linear time, they mean the total time for all matches is O(n·m), not that each individual call is O(n). That distinction gets lost when documentation just says “linear time.” The iev.ee article forces you to confront the accounting explicitly.
There is also a subtler issue involving the empty-string case. If the pattern can match the empty string, naive iteration will loop forever at a position where no progress is made. The standard fix is to advance by one character after an empty match, but this interacts with leftmost-longest semantics in ways that need careful handling. Any correct all-matches implementation has to deal with this; most discussions skip it.
How Existing Tools Handle It
re2c and flex compile regex patterns into a DFA at build time. The DFA approach sidesteps the per-character O(m) factor: DFA transitions are O(1), giving O(n) total per scan, though the DFA itself may be exponentially larger than the NFA. Both tools output a while loop with a state variable, exactly implementing the “scan, accept, restart” pattern described above. The linearity is baked into the generated code structure; there is no runtime proof needed because the DFA has already been materialized.
RE2 takes the NFA simulation approach at runtime, avoiding DFA explosion for pathological patterns. Its FindAll method advances through the input non-overlappingly, but the proof that the cumulative cost is O(n·m) rather than O(n²·m) is not spelled out in the RE2 documentation or the Cox articles. It follows from the same accounting argument, but you have to reconstruct it yourself.
PCRE2 in its default mode uses a backtracking NFA with worst-case exponential time. Its DFA mode (pcre2_dfa_match) avoids backtracking but does not implement POSIX leftmost-longest semantics by default; it finds the longest match rather than the POSIX-correct one for subgroup captures. Using it to find all matches requires careful position management and an understanding of what “longest” means in each mode.
Hyperscan, Intel’s multi-pattern matching library, handles overlapping matches across a stream of input, which is a harder problem than non-overlapping leftmost-longest. It uses a combination of DFA, NFA, and SIMD acceleration to achieve throughput measured in gigabytes per second. For lexing use cases, Hyperscan is overkill; for network traffic inspection where all overlapping matches matter, it addresses a genuinely different problem class.
Why the Proof Matters Beyond Theory
A rigorous proof that all-longest-matches runs in O(n·m) has practical value for a few reasons.
First, it gives you confidence when implementing lexers in environments where you cannot use a compiled DFA. If you are embedding a scripting language in a resource-constrained system and need runtime regex evaluation without precompiled tables, knowing that NFA simulation with the restart pattern gives a linear-time guarantee lets you reason about latency bounds. Without the proof, you are trusting intuition.
Second, the proof structure guides implementation decisions. The no-rewind property of the outer loop, combined with the dead-simulation early exit in the inner loop, is not just an optimization but a correctness requirement for linearity. An implementation that rescans already-consumed input breaks the guarantee. Understanding why linearity holds tells you which parts of the code cannot be simplified without changing the asymptotic behavior.
Third, there is a class of practitioners, particularly those working on protocol parsers, query engines, and embedded language runtimes, who need to justify performance bounds to stakeholders or safety reviewers. Pointing to a clean proof is more useful than pointing to benchmarks, which can always be dismissed as artifacts of the specific input distribution tested.
Viliam Búr’s article on iev.ee is not the first place this result has appeared, but it may be the most self-contained treatment of the all-matches case that is freely available online. The related work in the Laurikari 2001 paper on tagged NFAs, which extends Thompson simulation to capture groups while preserving the O(n·m) bound, addresses a harder problem; the iev.ee article is narrower in scope and therefore easier to follow for the unadorned all-longest-matches case.
The result itself is not new. What is new is that it is now written down cleanly enough to cite.