The Field That Keeps Getting Left Out of the BPF Verifier

The BPF verifier accepts or rejects eBPF programs before they run. Its guarantee is that an accepted program cannot crash, deadlock, or corrupt kernel memory, regardless of what input it processes. For most of what it tracks, this holds. For spinlocks, the same class of bug has surfaced repeatedly across kernel releases, and the article by Rohan Varma documents a concrete instance of it. The mechanism behind the bug is state pruning, and the fix is always a one-line addition to a comparison function. The reason it keeps happening is structural.

What the Verifier Is Doing

The verifier performs abstract interpretation over the program’s control flow graph. It simulates execution across an abstract domain: register types, value ranges, pointer provenance, stack contents, reference tracking. It visits every reachable instruction on every reachable path, checking at each step that the abstract state satisfies the verifier’s safety invariants.

Without optimization, this is exponential in the number of branch points. The verifier’s complexity limit is one million instructions processed (BPF_COMPLEXITY_LIMIT_INSNS). To stay within that budget for real programs, it uses state pruning: a cache of previously-verified abstract states indexed by instruction. When the verifier reaches an instruction it has visited before, it checks whether the current abstract state is subsumed by a cached safe state. If it is, the current path is pruned, on the assumption that the already-verified continuation covers it. The comparison that determines subsumption is states_equal(), a function in kernel/bpf/verifier.c.

Pruning is sound only if states_equal() considers every component of abstract state that affects downstream safety. Any field it omits is invisible to the comparison, meaning two states that differ only in that field will be treated as equivalent even when they are not.

How bpf_spin_lock Works

bpf_spin_lock arrived in Linux 5.1 via commit d83525ca62cf by Alexei Starovoitov. The API is minimal: a struct bpf_spin_lock { __u32 val; } embedded in a BPF map value, and two helpers, bpf_spin_lock() and bpf_spin_unlock(). On non-RT kernels, the helpers call __bpf_spin_lock_irqsave, which disables local IRQs on the acquiring CPU. Both helpers carry the notrace attribute to prevent re-entrance from ftrace or kprobes while a lock is held.

A typical usage looks like this:

struct value {
    struct bpf_spin_lock lock;
    int counter;
};

SEC("xdp")
int count_packets(struct xdp_md *ctx)
{
    struct value *v = bpf_map_lookup_elem(&my_map, &key);
    if (!v)
        return XDP_DROP;
    bpf_spin_lock(&v->lock);
    v->counter++;
    bpf_spin_unlock(&v->lock);
    return XDP_PASS;
}

The verifier enforces the following on all execution paths: a lock cannot be acquired twice without an intervening release; a lock cannot be released without being held; no lock can be held at program exit; no bpf_tail_call is permitted while a lock is held. These checks run through process_spin_lock() in kernel/bpf/verifier.c. Lock state is tracked in bpf_verifier_state via a field called active_spin_lock, which records the register ID of the currently-held lock pointer, or zero if no lock is held.

The check for held-at-exit is explicit:

if (env->cur_state->active_spin_lock) {
    verbose(env, "BPF program cannot exit with a held lock\n");
    return -EINVAL;
}

This check fires correctly when the verifier actually visits the exit instruction. The problem arises when it does not.

The Pruning Bug

The original states_equal() did not include active_spin_lock in its comparison. Consider two paths that converge at the same instruction:

Path A reaches instruction N with active_spin_lock == 0. The verifier continues, finds no held-at-exit violation, and caches this state as safe.
Path B reaches instruction N with active_spin_lock != 0, meaning it still holds a lock.

When the verifier processes Path B, states_equal() compares register types, stack contents, reference counts, and everything else it knows about, but not active_spin_lock. It returns true. Pruning fires. The verifier does not continue from instruction N on Path B; it assumes the cached result covers this path.

The program is accepted. At runtime, Path B executes, reaches exit with the lock held, and on a non-RT kernel this means IRQs remain disabled on that CPU. The CPU stops responding to interrupts. The result is a kernel hang or soft lockup with no recovery short of reboot.

The fix is a single comparison added to states_equal():

static bool states_equal(struct bpf_verifier_env *env,
                         struct bpf_verifier_state *old,
                         struct bpf_verifier_state *cur)
{
    if (old->active_spin_lock != cur->active_spin_lock)
        return false;
    /* ... register and frame comparisons ... */
}

In kernels where active_spin_lock was later refactored into a struct carrying both a pointer and an ID, the comparison expands accordingly:

if (!!cur->active_lock.ptr != !!old->active_lock.ptr)
    return false;
if (cur->active_lock.ptr &&
    cur->active_lock.id != old->active_lock.id)
    return false;

Adding the check costs verification time: paths that would have been pruned now continue, and programs with complex branching near lock boundaries sit closer to the complexity limit. But without it, the safety guarantee is wrong.

The Recurring Pattern

This is not a one-off oversight. The same omission has appeared with every significant new stateful feature added to bpf_verifier_state:

Feature	Kernel version	State field
`bpf_spin_lock`	5.1	`active_spin_lock`
Referenced objects	~4.18-5.0	reference tracking
BPF timers	5.15	timer ownership
RCU read locks	5.18	`active_rcu_lock`
Typed kernel pointers (kptrs)	5.19	typed reference tracking
`bpf_dynptr`	5.19	initialized/consumed state
BPF iterators	6.4	open/exhausted state

The pattern is: implement the feature, add the state field to bpf_func_state, implement enforcement in the relevant process_* function, ship it, discover a pruning interaction that states_equal() missed, fix it, add a regression test to tools/testing/selftests/bpf/. Eduard Zingerman has done substantial work on tightening these comparisons systematically across recent kernel versions.

Why It Keeps Happening

The root cause is that the invariant, every field in bpf_verifier_state that affects downstream safety must appear in states_equal(), is a manually maintained convention in a 25,000-line file with many contributors. No compiler check, no type system mechanism, no CI gate enforces it. A developer adds a new resource type, updates enforcement, and ships. The states_equal() update is a separate step that requires understanding the full abstract interpretation architecture, not just the feature being added.

This contrasts with approaches that prove safety structurally. The Rust borrow checker, for instance, encodes resource ownership and lifetime rules in the type system itself. Adding a new owned resource type means the type system either accepts or rejects uses of it based on existing rules; there is no separate comparison function to update. The proof is built into the structure of the language.

The BPF verifier proves safety procedurally: by enumerating paths and checking constraints at each step. Pruning makes this tractable, but it means the correctness of the pruning decision must be maintained explicitly as the abstract domain grows. In a codebase this size, with this many contributors and this rate of new feature additions, gaps appear.

The PREEMPT_RT Case

There is a structurally harder variant of the spinlock problem on CONFIG_PREEMPT_RT kernels. On RT kernels, bpf_spin_lock() does not disable IRQs; it acquires a sleeping lock backed by rtmutex. A BPF program holding this lock can be preempted. If another BPF program on the same CPU then attempts to acquire the same lock, the result is a potential deadlock.

The verifier cannot catch this. Its safety proofs are constructed against an execution model that assumes the spinlock is non-preemptible. The verifier has no model of the runtime scheduler, and the bpf_spin_lock API looks identical regardless of kernel configuration. The fix here cannot be in states_equal(); it requires reworking the underlying lock primitive on RT kernels to preserve non-preemptibility without IRQ-disable, so that the verifier’s model remains sound. The BPF plus PREEMPT_RT integration effort has been active on lore.kernel.org/bpf/ since 2021 and remains ongoing.

What This Means in Practice

For anyone writing BPF programs that use bpf_spin_lock, keep locked sections minimal and avoid helper calls inside them. Never cross a BPF-to-BPF subprogram call boundary while holding a lock; the verifier enforces this statically, and the constraint reflects the boundaries of what abstract interpretation can track cleanly without creating pruning surface.

For distributions: spinlock pruning fixes are backported to stable trees, but coverage varies. LTS kernels in cloud environments can lag by several verifier revisions. If you are running eBPF programs that use spinlocks on older kernels, the verifier’s acceptance of the program is a weaker guarantee than it appears.

The broader takeaway is that the BPF verifier’s safety guarantee is “the verifier is correct code we maintain carefully,” not a mechanically verified proof of the verifier itself. CAP_BPF and CAP_SYS_ADMIN restrict who can load programs, which limits exposure significantly. Proposals to expand unprivileged BPF raise the stakes for every pruning gap that exists.

The PREVAIL verifier, an academic project that applies polynomial-time abstract interpretation to BPF programs, has explored formalizing parts of the abstract domain more rigorously. But the upstream verifier is what ships, and it will keep acquiring new features faster than any formal treatment can cover them. That means the convention of updating states_equal() alongside every new state field will keep being the line of defense, and it will keep being missed occasionally, until the structure of the problem changes.