· 7 min read ·

Lexer Semantics and the Linear Time Proof You Actually Need

Source: lobsters

The Problem Behind Every Tokenizer

Every lexer generator, from Flex to modern Rust crates like logos, solves the same fundamental problem: given a string, partition it into tokens by repeatedly finding the longest match for any of a set of patterns. This is the maximal munch rule. Most programmers take its O(n) performance for granted.

Proving it rigorously turns out to be more interesting than it looks, which is what this recent post by Ieva Viliums sets out to do. The claim is that you can find all longest non-overlapping regex matches in linear time. The proof is more subtle than you might expect, and it fills a gap in the literature that has mostly been covered by informal arguments and compiler folklore.

A Brief History of O(n) Regex

The theoretical foundation goes back to Ken Thompson’s 1968 paper in Communications of the ACM, which showed how to compile a regular expression into a non-deterministic finite automaton and simulate it in O(n·m) time, where n is the text length and m is the NFA size. Thompson’s key insight was simulating all possible NFA states in parallel, rather than backtracking through them one at a time.

Russ Cox revived and popularized Thompson’s ideas in his 2007 series on regular expression matching. Cox showed that the catastrophic exponential-time behavior that plagues Perl-compatible regex engines comes from allowing backtracking, which Thompson’s original NFA simulation never does. The pattern (a+)+b applied to a string like aaaaaaaaaaaaaaac causes exponential blowup in backtracking engines but is O(n) in Thompson-style NFA simulation. Re2, the regex library Google built on these ideas, processes any input in O(n) time.

But O(n) for what, exactly? For checking whether a regex matches a string, or for finding the leftmost match. The problem of finding all non-overlapping longest matches, repeatedly, across an entire input, is a different problem and requires a separate argument.

Where the Argument Gets Subtle

The naive approach to finding all longest matches works as follows:

  1. Run the NFA or DFA from the current position
  2. Track the last accepting state seen and the position at which it was seen
  3. When the automaton has no active states, report the match ending at the last accepted position
  4. Restart from last_accept + 1
  5. Repeat until end of input

This is exactly how Flex works internally. Flex compiles patterns into a single DFA and uses a table-driven simulation with a “last accept” register. Each character is a single table lookup.

The question is whether this process is actually O(n) in the length of the input. The subtlety lies in step 3. When the automaton finally dies, you may have scanned past the last accepted position. If the match ended at position j but the automaton kept running until position k before dying, you restart at j+1, which means characters in [j+1, k] get rescanned in the next iteration.

This “overrun” looks like it could accumulate. If every match is short but the automaton consistently runs far past the match end before dying, you might process some characters many times. The question is: how many times can any given character be rescanned?

The Linear Time Argument

The core insight is that overruns cannot chain. Here is why, stated concretely.

Simulation 1 starts at position 0, accepts last at position 5, and overruns to position 8. Characters 0 through 8 are scanned. We restart at position 6.

Simulation 2 starts at position 6, accepts last at position 9, and overruns to position 12. Characters 6 through 12 are scanned. Characters 6, 7, and 8 appear in both simulation 1’s overrun and simulation 2’s main scan. We restart at position 10.

Simulation 3 starts at position 10. Characters 10, 11, and 12 were in simulation 2’s overrun and will be in simulation 3’s main scan.

The pattern is consistent: each character position appears in at most two consecutive simulations, once as the overrun of one simulation and once as the main scan of the next. The overrun of simulation i covers positions [j_i+1, k_i]. Simulation i+1 begins at j_i+1, so that entire overrun is re-scanned as its main scan. The overrun of simulation i+1 begins at j_{i+1}+1, which is strictly later than where simulation i+1 started. Overruns therefore advance monotonically and cannot wrap back to positions that have already been overrun.

Total character-steps across all simulations: at most 2n. Combined with the O(1) per-step cost of DFA simulation, this gives O(n) total work.

For NFA simulation (where each step costs O(m) rather than O(1)), the bound becomes O(n·m), which is linear in n for fixed pattern size. The distinction matters when m is large relative to n, which is why DFA-based lexer generators remain preferable for performance-critical tokenizers.

Making this argument fully rigorous requires handling edge cases: what happens when no match is found at a position? The automaton dies immediately without an overrun; you advance by one. What about empty matches? Empty matches must be handled carefully to avoid infinite loops, typically by advancing at least one character after each empty match even if the pattern could match again at the same position.

Why This Matters Despite Flex Existing

If Flex has been doing this in O(n) for decades, why does a formal proof matter?

Flux uses a precomputed DFA, which can be exponentially larger than the corresponding NFA. The DFA for a complex set of lexer rules might have millions of states. Flex accepts large compilation times and large tables for large grammars. Newer lexer generators like logos in Rust and RE/Flex in C++ use similar DFA-based approaches with various compression strategies.

But DFA construction is not always feasible. For patterns that are compiled at runtime, for Unicode-aware patterns where character classes expand to large transition tables, or for applications where the set of patterns changes dynamically, direct NFA simulation is more practical. Whether NFA-based simulation can find all longest matches in O(n) for variable pattern sizes, with a clean proof, is a question worth answering.

There is also a distinction between the lexer use case and full POSIX regex semantics. POSIX specifies leftmost-longest for a single match, which requires additional bookkeeping in the NFA simulation to correctly disambiguate among alternative states. The Ville Laurikari paper from 2000 on tagged NFA simulation addresses the single-match POSIX problem formally. The all-matches case used in lexing is related but different, and it deserves its own treatment.

The Connection to Other Linear Time Results

For the special case of literal string patterns (no regex metacharacters), the Aho-Corasick algorithm published in 1975 finds all occurrences in O(n + m + z) time where z is the number of matches. This is essentially a precomputed DFA built from a trie of patterns, augmented with failure links that prevent backtracking. It is used in grep with the -F flag, fgrep, and many network intrusion detection systems.

The regex generalization, with alternation, repetition, and character classes, requires the full machinery of NFA or DFA simulation and a more careful argument. The Aho-Corasick case is a sanity check: if the result holds for literal patterns, we should expect it to hold more generally, but the proof needs to account for the additional structure of general regex patterns.

Hyperscan, Intel’s high-performance regex library (now maintained as Vectorscan by the community), takes a different approach: it decomposes complex patterns into simpler components and uses SIMD instructions to process multiple bytes at once. Hyperscan can sustain throughput in the tens of gigabytes per second on appropriate hardware, but it does not always target linear time in the classical sense. Its performance depends on pattern structure and text characteristics in ways that the theoretical bounds do not capture.

Practical Takeaways

For engineers choosing regex infrastructure, the linear time property is not just a theoretical nicety. It is a security property: a lexer or parser with linear-time guarantees cannot be made to hang by adversarial input, which matters when processing untrusted data from networks or files.

PCRE and Python’s re module can suffer catastrophic backtracking on certain inputs. This is not a corner case limited to deliberately adversarial patterns; it appears in production code written by developers who are not thinking about regex complexity. The engineering consequence is that if you care about worst-case behavior, you should use:

  • A lexer generator that builds a DFA (Flex, logos, RE/Flex)
  • A library based on NFA simulation without backtracking (re2, Rust’s regex crate)

The regex crate in Rust uses Thompson NFA simulation and provides the same linear-time guarantee. Its design notes explicitly document the O(n·m) bound and the trade-offs involved in avoiding DFA precomputation.

What the article at iev.ee adds is a clean formal statement and proof covering the all-matches case, the exact case that lexers require. It is the kind of result that textbooks on formal language theory should include but often skip, because the textbook treatment focuses on language recognition rather than tokenization.

Folklore into Proof

Regular expression theory is old. What remains interesting is the gap between the theoretical results and their precise applicability to real engineering problems. The linear time result for single-match NFA simulation is well-known; its extension to all-matches lexing has circulated as folklore among compiler writers without being written down rigorously.

The proof is not complicated. Its core fits in a paragraph: each character position is visited at most twice across all simulations, because overruns advance monotonically and cannot overlap. The details of making this precise for both DFA and NFA simulation, and for patterns with empty matches, are what make the formal treatment worthwhile. But the simplicity of the central argument is a good sign. Clean proofs for practical results mean the result is real and the intuition behind it is sound.

Was this interesting?