The Manually Maintained Contract at the Heart of BPF Verification

The BPF verifier is one of the most sophisticated pieces of safety infrastructure in the Linux kernel. Its job is to prove, before execution, that a BPF program cannot crash the kernel, access out-of-bounds memory, or violate any of the kernel’s assumptions. This article by Rohan Varma details one instance of a recurring failure mode that undermines these guarantees: a missing comparison in the verifier’s state equivalence check.

Understanding why this specific bug matters, and why it keeps reappearing in different forms, requires understanding how the verifier balances completeness against cost.

The Verification Problem

The BPF verifier performs abstract interpretation over a program’s control flow graph. For a program with N branches, the number of distinct paths is theoretically exponential. The verifier caps itself at 1,000,000 instructions processed (BPF_COMPLEXITY_LIMIT_INSNS), and it uses a technique called state pruning to stay well within that limit under typical workloads.

State pruning works like a cache. At each instruction, the verifier maintains an explored_states table keyed by instruction index. When it reaches an instruction for the second time via a different control flow path, it compares the current state against previously verified states. If the current state is subsumed by a previously verified safe state, verification stops for that path. The cached proof is considered to cover it.

This is both necessary and sound, provided that the equivalence check is complete. The soundness argument is: if state A was safe, and state B is equivalent to or more constrained than state A in every dimension that affects downstream behavior, then state B is also safe. The conclusion holds only if every dimension of state that matters is included in the comparison.

How bpf_spin_lock Works

bpf_spin_lock was introduced in Linux 5.1 to give BPF programs a way to protect shared map values from concurrent modification. The API is minimal: a 32-bit opaque struct embedded in a map value, with two helpers.

struct bpf_spin_lock {
    __u32 val;
};

Typical usage looks like this:

struct value {
    struct bpf_spin_lock lock;
    int counter;
};

SEC("xdp")
int my_prog(struct xdp_md *ctx)
{
    struct value *v = bpf_map_lookup_elem(&my_map, &key);
    if (!v)
        return XDP_DROP;
    bpf_spin_lock(&v->lock);
    v->counter++;
    bpf_spin_unlock(&v->lock);
    return XDP_PASS;
}

On non-PREEMPT_RT kernels, acquiring a BPF spinlock disables local IRQs via ___bpf_spin_lock. The runtime implementation carries the notrace attribute to prevent re-entrant instrumentation from ftrace or kprobes while a lock is held:

notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
{
    ___bpf_spin_lock(lock);
    return 0;
}

If a BPF program holds a spinlock when it exits, the CPU stays with IRQs disabled until something else re-enables them. No interrupt handler can fire on that CPU, no scheduler tick can run, no watchdog can bark. The result is a soft lockup with no recovery short of reboot.

To prevent this, the verifier enforces that no lock can be held at any program exit point. The relevant state tracking lives in active_spin_lock inside bpf_func_state:

// On bpf_spin_lock():
if (cur->active_spin_lock) {
    verbose(env, "Illegal nesting of bpf_spin_lock\n");
    return -EINVAL;
}
cur->active_spin_lock = reg->id;

// On program exit:
if (env->cur_state->active_spin_lock) {
    verbose(env, "BPF program cannot exit with a held lock\n");
    return -EINVAL;
}

This check works correctly when the verifier actually visits the exit instruction on every path. When state pruning fires first, it does not.

The Bug

The problem is in states_equal(), the function inside kernel/bpf/verifier.c that decides whether a current execution state is covered by a cached safe state. When bpf_spin_lock was introduced, its tracking field active_spin_lock was not added to the comparison logic.

The soundness gap works as follows. Two paths reach the same instruction N through different control flow:

Path A arrives at N with no lock held. The verifier continues, reaches the exit instruction without detecting a held lock, and records the state at N as safe in explored_states.

Path B arrives at N with a lock held. states_equal() compares Path B against the cached Path A state. Register types match. Stack slots match. Reference counts match. active_spin_lock is not compared. The function returns true. Pruning fires. The verifier accepts Path B without analyzing what happens after instruction N.

Path B’s program exits with a spinlock held. IRQs remain disabled on the executing CPU. In a multi-tenant environment where CAP_BPF is available to tenants, this becomes a denial-of-service vector: a crafted program can permanently wedge any CPU it runs on.

The fix is a comparison that was never added:

if (old->active_spin_lock != cur->active_spin_lock)
    return false;

In later kernel versions where active_spin_lock was refactored into a struct with a pointer and register identity:

if (!!cur->active_lock.ptr != !!old->active_lock.ptr)
    return false;
if (cur->active_lock.ptr &&
    cur->active_lock.id != old->active_lock.id)
    return false;

The disparity between the consequences (CPU wedge, potential kernel hang) and the fix (two lines of comparison code) is notable. The difficulty was entirely in knowing that the comparison was missing, not in writing it.

A Recurring Failure Mode

What makes this more than a single bug report is the pattern it represents. The BPF subsystem has added stateful resource types at a steady pace across recent kernel versions, and each of them needed to be added to states_equal() for pruning to remain sound:

Feature	Approximate kernel version	State field
Referenced objects (sockets, iterators)	~4.18	reference tracking
`bpf_spin_lock`	5.1	`active_spin_lock`
BPF timers	5.15	timer ownership
`bpf_rcu_read_lock`	5.18	`active_rcu_lock`
Typed kernel pointers (kptrs)	5.19	typed reference
`bpf_dynptr`	5.19	initialized/consumed state
BPF iterators	6.4	open/exhausted state

The invariant is: every field in bpf_verifier_state that affects downstream safety must appear in states_equal(). This is a manually maintained convention with no compiler enforcement, no type system mechanism, and no CI gate that would flag a new state field added without a corresponding update to the comparison function.

The kernel’s test suite for BPF verifier behavior is extensive, but constructing a test case that specifically exercises a pruning soundness gap requires understanding state pruning well enough to craft control flow paths that trigger cache hits under exactly the wrong conditions. Standard coverage testing misses this class of bug almost by definition, because the buggy path looks like it was already verified.

Eduard Zingerman has done systematic work on auditing and closing these gaps across multiple recent kernel releases. But the structural problem remains: each new resource type added to BPF is another instance of the same contract, and each one has to get states_equal() right at the time it ships, or wait for someone to find the gap empirically.

A related research project, PREVAIL (Polynomial-Runtime EBPF Verifier using an Abstract Interpretation Layer, ~2019), examined this class of issue directly. Properties that require temporal reasoning, tracking that a resource acquired at point A must be released before point B, fall outside what standard polynomial-time abstract interpretation can express without explicit domain extensions. This is not a gap that better tooling can close incrementally. It is a structural property of how the verifier reasons about programs.

The PREEMPT_RT Complication

There is a related but structurally different problem on PREEMPT_RT kernels. On CONFIG_PREEMPT_RT, bpf_spin_lock does not disable IRQs. It becomes a sleeping lock backed by rtmutex. A BPF program holding this lock can be preempted by the scheduler.

The verifier’s safety proof for spinlocks was built on the assumption that IRQ disable enforces non-preemptibility. On PREEMPT_RT, that assumption is false. A program the verifier accepted as safe can be preempted while holding a lock, and if another BPF program on the same CPU then tries to acquire the same lock, a deadlock follows that the verifier never predicted.

The verifier has no model of the runtime scheduler. Its abstract interpretation reasons about program state as a sequence of instructions, not about what the operating system does between them. Fixing this class of issue requires either reworking the lock primitive on PREEMPT_RT kernels to restore non-preemptibility through other means, or restricting bpf_spin_lock on those configurations. The BPF and PREEMPT_RT integration work has been active in the kernel community since 2021 and remains ongoing.

What the Pattern Reveals

The BPF verifier is often described as providing hard safety guarantees. For any given version of the kernel, with any given set of features, the verifier’s proofs are internally consistent. The issue is that the model the verifier reasons about has to be manually kept consistent with the feature set as that set grows.

The spinlock gap is one instance. The RCU lock gap is another. The dynptr gap is another. Each one followed the same path: implement the feature, add state tracking, ship, discover that states_equal() was not updated, fix with a small patch, add a regression test. The regression test catches regressions to the specific fix. It does not catch the next feature that goes through the same cycle.

The soundness of BPF’s safety story depends on a contract that has no enforcement mechanism beyond the attention of whoever reviews each new feature’s patch series. That is a reasonable place to be for a subsystem moving this fast, but it is worth being clear-eyed about what it means. The verifier proves what it was told to prove. Getting the telling right is the hard part.