The Linear Time Guarantee for All Longest Regex Matches, and Why It Took This Long to State Clearly
Source: lobsters
Regex engines have a well-known split personality. On one side you have backtracking engines like PCRE, where features like lookahead and backreferences can send matching time into exponential territory on adversarial inputs. On the other side you have automata-theoretic engines built on NFA simulation, where matching time is bounded by the length of the input times a constant that depends only on the regex. The classic result, going back to Ken Thompson’s 1968 paper, is that you can simulate an NFA in O(n·m) time where n is the input length and m is the regex size. For a fixed regex, that’s linear in n.
But that result is specifically about deciding whether a string matches a pattern. The question that a recent article addresses is different: can you find all non-overlapping longest matches of a regex across a string, in linear time? The “all” and “longest” parts are both doing work in that sentence, and they interact in ways that make the problem less obvious than it first appears.
What Longest Match Actually Means
Most programming contexts use what is sometimes called leftmost-longest or POSIX matching semantics. The rule is: among all possible matches starting at a given position, take the longest one. Then advance past that match and repeat. This is the “maximal munch” rule that every lexer generator uses. When flex encounters >= in a C source file, it doesn’t stop at > and call it a match for the comparison-operator rule; it consumes as many characters as possible before committing.
This is distinct from what Perl-compatible regex engines do by default. In PCRE, (a|ab) applied to ab matches a, because alternation is ordered: the engine tries the first branch, succeeds, and stops. Under POSIX semantics, the same pattern matches ab, because that’s the longest possible match from position zero.
The distinction matters for scanning applications. A lexer that tokenizes === as = followed by == because of alternation-ordering artifacts is a broken lexer. Leftmost-longest semantics exist precisely to make this kind of pathology impossible.
The Single-Match Case
Finding the single longest match starting at a given position with NFA simulation is straightforward. Build the NFA using Thompson’s construction, which produces a machine with O(m) states. Initialize the active state set to the epsilon-closure of the start state. Then process input characters one at a time, advancing the active state set and recording the last position at which the state set contained an accepting state. Stop when the active state set becomes empty (no thread can continue), or when the input is exhausted. The last recorded accepting position is the end of the longest match.
The time cost for this is O(n·m): you process at most n characters, and each step takes O(m) work to advance and close the state set over epsilon transitions. The key property that makes this fast is the representation of the state set as a subset of the NFA’s states: there are only 2^m possible subsets, but you never need to enumerate them all, and each step is computed incrementally.
Russ Cox’s 2007 survey on this approach, motivated in part by PCRE’s vulnerability to catastrophic backtracking, became the canonical reference for this style of regex engine. The RE2 library, which Cox wrote at Google, implements NFA simulation and gives a linear-time guarantee for all patterns it accepts (which excludes backreferences, since those require lookahead into captured text and are not regular).
Why “All Matches” Raises a Question
Once you have a single-match procedure, the naive approach to finding all non-overlapping matches is: find the first match, advance the input pointer past the end of the match, find the next match from the new position, repeat. Each match requires a fresh scan starting from the new position.
For the worst case, consider how long each scan can take. If the regex is a+ and the input is aaa...a (n repetitions), the algorithm finds one match covering the entire string in O(n) time and you’re done. But consider a? applied to the same input: each scan finds a single a (length 1), advances by one, and the total work is O(n) scans each taking O(1) time, so O(n) total. That’s fine.
Now consider a pathological case: a regex where the NFA has many states and where matches are short relative to the scan length. Suppose matches have average length k and each scan takes O(k·m) work because the NFA runs for k steps before dying. The number of matches is O(n/k), so total work is O(n/k · k·m) = O(n·m). The k terms cancel. Each character of the input contributes to exactly one scan (the one that started at or before it and ended at or after it), so the total work across all scans is bounded by n times the per-character work, which is O(m). The naive approach is already O(n·m) total, i.e., O(n) for a fixed regex.
But this argument has a gap. Consider what happens when the scan at position i fails to find any match. You advance to position i+1 and try again. If matches are sparse, you might scan past the same characters many times: once in the failing scan at position i, once in the failing scan at i+1, and so on. A regex that matches only the string xyz applied to an input of n copies of x would scan O(n) characters per position (trying x, then xy, waiting for y that doesn’t come), across O(n) starting positions: that’s O(n²).
This is the crux. Finding all matches (including handling positions that don’t match at all) is where the naive algorithm breaks down.
The Fix: Interleaved NFA Threads
The resolution mirrors how Aho-Corasick works for multi-pattern string matching. Instead of restarting the NFA from scratch at each position, you run all possible starting points simultaneously.
At each position in the input, you maintain an NFA state set that represents the union of all active match attempts, including a fresh thread that starts at the current position. Processing a character c produces a new state set: take the epsilon-closure of all states reachable from the current set on c, then add the epsilon-closure of the start state (to account for attempts beginning here). The state set remains bounded in size by the number of NFA states.
When the state set contains an accepting state, you record a potential match. The key bookkeeping is tracking which input offset each accepting thread originated from, so you can report the match boundaries. In practice, you also need to track the current best accepting position to implement longest-match semantics: you don’t emit a match as soon as you see an accepting state; you wait until the active state set goes empty (indicating that no thread can be extended further), then emit the longest match found and reset.
With this structure, each character of the input is processed exactly once. The per-character work is O(m) to advance and close the state set. Total: O(n·m), linear in input length for a fixed regex.
The POSIX Complication
The reason this result still warrants a clear writeup in 2024 is that true POSIX leftmost-longest semantics are surprisingly hard to implement correctly and efficiently at the same time. Many engines claim POSIX semantics but implement them with backtracking, which gives the right answer on simple cases but degrades poorly. The POSIX standard requires that when multiple matches of the same length exist at the same starting position (as can happen with union patterns like (a|aa)), the one produced by the semantics of leftmost-longest among all paths through the NFA must be chosen. This is a stronger requirement than just “take the longest match” in cases where multiple subexpressions can contribute.
For whole-match testing, there is a clean NFA simulation that implements true POSIX semantics using a “leftmost-biased” state set ordering, as described in research by Kuklewicz and later formalized by others. Extending this cleanly to the all-matches scanning case, with proof that linearity is preserved, requires careful argument. The contribution of the original article is doing this rigorously.
Practical Implications
Lexer generators like flex, re2c, and ragel have been doing this correctly for decades, but they work by compiling the regex to a DFA first. A DFA has at most one active state at any time, so simulation is O(n) unconditionally. The tradeoff is construction cost: DFA construction is O(2^m) in the worst case, which makes it impractical for regex patterns that are computed at runtime.
For runtime-compiled regexes with NFA simulation, the O(n·m) bound is what you get. RE2 and the Go regexp package both provide this guarantee. The grep family of tools uses DFA-based matching when possible (GNU grep switches between NFA simulation and DFA simulation opportunistically), but the theoretical basis for the all-matches case in NFA simulation is exactly what the original article addresses.
The other practical domain is streaming text processing. If you’re scanning a gigabyte of log data for all occurrences of a pattern, the linearity guarantee means your runtime is proportional to the file size, not the file size squared. For most patterns this is what naive implementations deliver anyway, but the guarantee is what makes you confident that no input, however structured, will cause a regression.
What the Result Settles
The value of a clear proof here is not primarily the algorithm itself, which has been implicit in implementations for a long time. It’s the formal statement: for any regular expression and any input string, all non-overlapping longest matches can be found in time O(n·m) using NFA simulation, without backtracking, with no pathological inputs. That’s the kind of claim that a lexer author or a library maintainer can point to when someone asks why catastrophic backtracking isn’t a concern for their tool.
The distinction between “this is how it’s been implemented” and “this is provably correct and linear” is not pedantic. PCRE2 has documented worst-case inputs that cause polynomial or exponential time. Knowing which class your engine falls into is load-bearing information for anyone building a system that processes untrusted text.