· 7 min read ·

All Longest Regex Matches in Linear Time: Why the Proof Is the Hard Part

Source: lobsters

The standard framing of linear-time regex matching starts with Ken Thompson’s 1968 paper and a clean result: simulate an NFA over a string of length n with a regex of size m, and you get O(n × m) time with no backtracking. This is the foundation that Russ Cox later popularized through his RE2 engine and his “Regular Expression Matching Can Be Simple And Fast” series.

But this framing answers a narrower question than it might appear. Thompson’s algorithm tells you whether a string matches a pattern. It can be extended to find the leftmost-longest match. What it does not immediately address is finding all non-overlapping longest matches in a single pass over the input, which is the problem every lexer and tokenizer must solve. This article on iev.ee works through why that extension is non-trivial and, importantly, proves it is still achievable in linear time.

What Thompson NFA Simulation Actually Gives You

Thompson’s key insight was to track sets of NFA states rather than individual states. A backtracking engine commits to one path through the NFA and retreats when it fails. Thompson simulation maintains all possible paths simultaneously: at each input character, every currently-active NFA state transitions forward, and the resulting set of states becomes the input to the next step.

The deduplication invariant makes this efficient. Each NFA state appears in the current set at most once, enforced by a generation counter: when adding state s to the current set, check if s->lastlist == current_generation. If yes, skip it; if no, set s->lastlist = current_generation and add it. Since the NFA has O(m) states, the current set never exceeds O(m) entries regardless of input length. Each character consumes O(m) work. Total: O(n × m).

The canonical C implementation from Cox’s writeup looks like this:

void addstate(List *l, State *s) {
    if (s->lastlist == listid)
        return;
    s->lastlist = listid;
    if (s->c == Split) {
        addstate(l, s->out);
        addstate(l, s->out1);
        return;
    }
    l->s[l->n++] = s;
}

void step(List *clist, int c, List *nlist) {
    listid++;
    nlist->n = 0;
    for (int i = 0; i < clist->n; i++) {
        State *s = clist->s[i];
        if (s->c == c)
            addstate(nlist, s->out);
    }
}

No heap allocation per step, no backtracking, no worst-case blowup on adversarial inputs.

The Harder Problem: All Matches

A lexer does not ask whether the entire input matches a pattern. It consumes the input left to right, at each position finding the longest possible match, emitting a token, advancing past the match, and repeating. This is “all non-overlapping longest matches,” and the naive extension of Thompson simulation does not automatically give linear time.

The naive approach: for each position i, restart an NFA simulation from position i over the remaining input, continue until the NFA can make no more progress, record the last accepting position seen. This produces the correct leftmost-longest match starting at i. Then advance i past the match and repeat.

The problem is that this can visit each input character multiple times. If the longest match from position 0 covers positions 0 through 5, you scan characters 1 through 5 during that first match attempt, then advance to position 6. In the worst case, with many overlapping failed attempts, you end up doing quadratic work.

The question the iev.ee article addresses is whether you can do all of this in a single forward pass, with each character examined a bounded number of times.

The Key Deduplication Argument

The central insight is an extension of the generation-counter trick, now applied across match start positions rather than just within a single NFA simulation.

Consider running an NFA simulation where each state-set entry is annotated not just with an NFA state, but with the input position at which the current match attempt started. At position p in the input, you inject the NFA start state with start-position = p. You also continue advancing all existing threads from earlier start positions. The state set grows to include entries (nfa_state, start_pos).

Naively, this allows the state set to grow without bound: O(m) states per start position, O(n) start positions, giving O(n × m) state-set size and thus O(n² × m) total work. The non-overlapping constraint eliminates most of this.

The critical observation: if NFA state q is currently reachable from two different start positions i < j, only the entry for start position j matters. For non-overlapping longest-match semantics, any match starting at j that passes through state q is preferred over a match starting at i that passes through state q, because the match starting at j begins later and will not overlap with whatever match was already committed before position j.

This means you can collapse: for each NFA state q, keep only the entry with the latest start position. The state set is again bounded by O(m). Each input character requires O(m) work. The total is O(n × m).

In pseudocode:

# state_map: nfa_state -> latest start_position that reached it
state_map = {}

for pos, ch in enumerate(input_string):
    # Inject new match start at this position
    start = nfa.start
    if start not in state_map or state_map[start] < pos:
        state_map[start] = pos

    # Record any accepting states as potential match endpoints
    for st, match_start in state_map.items():
        if nfa.is_accepting(st):
            update_best_match(match_start, pos)

    # Advance on current character
    new_map = {}
    for st, match_start in state_map.items():
        for next_st in nfa.transitions(st, ch):
            if next_st not in new_map or new_map[next_st] < match_start:
                new_map[next_st] = match_start

    state_map = new_map

    # If state_map is empty, emit the pending match and reset
    if not state_map:
        emit_best_match()

The state_map never exceeds m entries because each NFA state appears at most once, with the latest start position that reached it. This is the linearity guarantee.

POSIX Semantics and Why They Matter Here

POSIX defines leftmost-longest matching: among all possible matches, prefer the one starting earliest; among matches starting at the same position, prefer the longest. This is the semantics that lexers use and that the above algorithm implements.

Most popular regex libraries do not implement POSIX semantics. Python’s re, PCRE, Perl, and Ruby all use leftmost-first semantics: a|ab on the string "ab" matches "a", not "ab". POSIX requires "ab" because it is the longer match. For general text searching this distinction is often invisible, but for lexers it is fundamental. A lexer rule for >= must not be defeated by an earlier rule for >.

RE2 supports a POSIX mode (RE2::POSIX) that implements true leftmost-longest semantics, but it is slower than RE2’s default mode because it must track submatch boundaries precisely. The overall match length, distinct from submatch assignment, can be computed with the NFA simulation described above without full POSIX submatch overhead.

How Real Tools Approach This

Flex and most traditional lexer generators avoid the problem by compiling all patterns into a single combined DFA via subset construction. DFA simulation is O(|S|) per character (one table lookup per step), and longest match falls out naturally: run the DFA forward, stop when it enters a dead state, emit the last accepting state seen. The tradeoff is that DFA construction is O(2^m) in the worst case. For typical lexer grammars this is acceptable, but it precludes regex patterns with large state spaces.

Ragel similarly compiles state machines to DFAs and emits C, C++, or Go code with tight loops over input. It handles multiple patterns with declared priorities and produces output that is often faster per character than any NFA-based approach.

RE2 uses lazy DFA construction: DFA states (frozen sets of NFA states) are built on demand and cached. This gives near-O(|S|) performance on common patterns, with a fallback to NFA simulation when the DFA cache overflows. RE2’s FindAll returns all non-overlapping matches and is guaranteed linear time by its NFA foundation.

Python’s re.findall and PCRE-based engines restart a backtracking engine at each position. This is not guaranteed linear. Crafted inputs can cause quadratic or worse behavior; the standard example is (a+)+ on a string of a characters followed by a non-matching character.

Why the Proof Matters

The iev.ee article’s contribution is demonstrating that the “latest start wins” deduplication gives a valid non-overlapping longest-match output, not merely a smaller state set. These are separable claims: you can have an O(m)-bounded state set algorithm that nonetheless does not correctly implement lexer semantics. The correctness proof, that collapsing start positions to the latest one preserves the non-overlapping longest-match output, is the actual work the article does.

For anyone building a tokenizer in a language without a mature lexer generator, or implementing one from scratch for performance or portability reasons, understanding this algorithm means you can avoid DFA subset construction with its exponential preprocessing cost, while also avoiding a backtracking engine with unpredictable performance. The NFA simulation with latest-start deduplication is implementable in a few hundred lines and provably correct and linear.

Thompson’s 1968 paper established the NFA simulation foundation. Cox’s 2007 writeup brought it back into common knowledge. The iev.ee article fills a specific gap between the two: it addresses the variant of the problem that real tokenizers must solve and shows it fits into the same linear-time framework.

Was this interesting?