When One RCU Is Not Enough: The Corner-Case Implementations Inside Linux

Read-Copy-Update has been part of the Linux kernel since 2002. In the decades since, it has earned a reputation as one of the most effective synchronization primitives ever built for a production OS, and also as one of the most conceptually tricky to get right. Paul McKenney, the engineer most responsible for RCU in Linux, has spent much of his career writing about both of those qualities. His ongoing “Stupid RCU Tricks” series covers the corner-case implementations that exist alongside the main Tree RCU, and the collection says something interesting: general-purpose OS kernels are not general-purpose enough to get away with one implementation of anything fundamental.

What RCU Actually Does

The basic contract of RCU is asymmetric. Readers acquire access to shared data with almost zero overhead, no locking, and no memory barriers on the read path in the common case. Writers update data by creating a new version, atomically publishing a pointer to it, and then waiting for a “grace period” before freeing the old version. The grace period ends when every CPU has passed through a quiescent state, meaning a point where it cannot hold a reference to the old data.

This asymmetry is the source of RCU’s performance. On a read-heavy workload, the cost of synchronization approaches zero for readers and is pushed entirely onto writers. Networking subsystems, filesystem caches, device driver tables, and the virtual filesystem layer all use RCU heavily because their access patterns are overwhelmingly read-dominated.

The challenge is defining what “quiescent state” means in a given execution context. For the main Tree RCU implementation, a quiescent state is any point where a CPU is in user space, in the idle loop, or in the kernel without holding an RCU read-side critical section. The kernel’s scheduler provides these points naturally; you just have to wait for each CPU to reach one. Tree RCU builds a hierarchical tree of CPUs and propagates quiescent state reports up that tree, which scales reasonably well to systems with hundreds of cores.

That definition works for the common case. The corner cases are where it breaks.

Tiny RCU: The Uniprocessor Special Case

The simplest corner case is also the most elegant. For uniprocessor builds (CONFIG_SMP disabled), the entire Tree RCU machinery is overkill. On a single-CPU system, any context switch is a quiescent state for every possible RCU read-side critical section, because there is only one CPU and it just stopped running reader code. There are no other CPUs to wait on.

The result is Tiny RCU, implemented in kernel/rcu/tiny.c. The complete implementation fits in roughly 200 lines of C. Grace periods complete at the next context switch. call_rcu() callbacks are queued and invoked at that point. There are no kthreads, no hierarchical trees, no per-CPU tracking structures. The entire complexity of Tree RCU collapses into an almost-trivial state machine when you remove the constraint of multiple independent CPUs.

Tiny RCU exists for embedded systems, build environments, and single-processor virtual machines. It keeps the RCU API consistent across configurations while avoiding the overhead of Tree RCU where it would be pointless.

SRCU: When Readers Need to Sleep

Tree RCU forbids sleeping inside a read-side critical section. This is not an arbitrary restriction. If a reader could sleep, it could hold up grace periods indefinitely, blocking writers and preventing memory from being freed. The quiescent-state detection model depends on readers being bounded in duration.

Some subsystems legitimately need readers to block. Filesystem operations, for example, may need to wait for I/O inside code that requires RCU-like protection of data structures. SRCU (Sleepable RCU) handles this by abandoning the CPU-based quiescent state model and replacing it with a counter-based model.

SRCU requires callers to declare an srcu_struct:

DEFINE_SRCU(my_srcu);

/* reader */
int idx = srcu_read_lock(&my_srcu);
/* can sleep here */
srcu_read_unlock(&my_srcu, idx);

/* writer */
synchronize_srcu(&my_srcu);

The idx value is important. SRCU maintains two sets of per-CPU counters, and the index tells the system which set the reader is counted in. Writers wait for all readers counted in the older set to complete, then flip to the newer set and wait again. This two-phase approach allows writers to make progress even when new readers are constantly arriving, without requiring any reader to reach a CPU-level quiescent state.

The cost is significant. SRCU does not scale as well as Tree RCU because it must track individual reader counts rather than just detecting CPU quiescent states. It also requires the srcu_struct to be passed explicitly everywhere, which prevents the “just call rcu_read_lock()” ergonomics of the main API. But for the subsystems that need it, sleeping inside a critical section is a correctness requirement, not a preference.

Tasks RCU: Grace Periods Without CPU Cooperation

Tree RCU works by waiting for CPUs to report quiescent states through the scheduler. But some code in the kernel runs in contexts where the scheduler cannot be easily relied upon for this signaling, or where the things being protected are not per-CPU data structures but rather functions and tasks.

Tasks RCU is actually a family of three distinct implementations, each targeting a different execution model.

Tasks RCU proper targets kernel functions that can be called from user-return or voluntary-scheduling paths. It defines a quiescent state as a voluntary context switch or an explicit call to rcu_tasks_qs(). The implementation uses a workqueue-based mechanism to periodically check whether all pre-existing tasks have passed through a quiescent state. This works for protecting things like function-level trampolines and security hooks, where the “readers” are entire tasks rather than individual critical sections.

Tasks-Rude RCU takes a more aggressive approach. It sends inter-processor interrupts (IPIs) to all CPUs to force them to report quiescent states. This is expensive but necessary in contexts where voluntary scheduling cannot be guaranteed quickly enough. “Rude” is an apt name: it interrupts whatever is running on each CPU to get the quiescent state reported immediately. The use cases are narrow but the mechanism is reliable.

Tasks-Trace RCU is the most technically interesting of the three. It was built specifically to support BPF (Berkeley Packet Filter) programs in the kernel. BPF programs can execute in NMI (non-maskable interrupt) handlers, which is a context where almost nothing is safe to call. Traditional RCU read-side critical sections are not NMI-safe because they touch per-CPU variables in ways that can race with NMI delivery.

Tasks-Trace RCU solves this by using a reader-side mechanism based on a per-task counter that can be updated atomically even from NMI context. The writer side uses a combination of memory barriers and IPI-based mechanisms to detect when all pre-existing readers have completed. The result is an RCU variant that BPF programs can use safely even when running in the most restricted interrupt contexts the kernel has.

Expedited RCU: Latency Over Throughput

Tree RCU optimizes for throughput. It waits for quiescent states to arise naturally, batching multiple grace period requests together, and does not disturb CPUs that are running normally. This produces excellent average-case performance but variable latency: a grace period might complete in microseconds or take tens of milliseconds depending on what the CPUs are doing.

For some operations, particularly module unloading and certain hot-path kernel updates, that latency variance is unacceptable. synchronize_rcu_expedited() solves this by sending IPIs to all CPUs immediately, forcing them to report quiescent states without waiting. Grace periods complete in microseconds rather than milliseconds.

The cost is proportional to the number of CPUs. On a 256-core server, an expedited grace period generates 255 IPIs, which disrupts whatever was running on each of those CPUs. The kernel tries to mitigate this by automatically converting expedited requests into normal ones when the system is under heavy load, and by batching multiple expedited requests together when they arrive simultaneously.

The Poll API: Non-Blocking Grace Period Detection

The standard synchronize_rcu() call blocks until a grace period completes. That is fine for most callers, but some lock-free data structures need to check whether a grace period has elapsed without blocking.

The poll API addresses this:

unsigned long cookie = start_poll_synchronize_rcu();

/* do other work */

if (poll_state_synchronize_rcu(cookie)) {
    /* grace period has elapsed, old data is safe to free */
    kfree(old_ptr);
}

start_poll_synchronize_rcu() returns a cookie representing the current grace period sequence number. poll_state_synchronize_rcu() checks whether a grace period has elapsed since that cookie was issued. No blocking, no waiting, no callbacks. The caller is responsible for eventually freeing the memory, but can do so at a time of its choosing.

This API is newer than the others and reflects a pattern that has emerged in lock-free kernel data structures: sometimes you have work to do anyway, and you want to check grace period completion opportunistically rather than structuring your code around a blocking call or a callback.

What the Collection Reveals

The existence of five or more distinct RCU implementations is not a design failure. It is an accurate reflection of how many distinct execution contexts a general-purpose kernel has to handle. Uniprocessor builds, sleepable readers, BPF in NMI context, latency-sensitive writers, and non-blocking reclaim are genuinely different problems that share the same conceptual framework but not the same implementation.

Paul McKenney has written extensively about this in his freely available book on parallel programming, co-authored with Mathieu Desnoyers and others. The central lesson across all of it is that synchronization primitives cannot be designed in isolation from their execution context. Tree RCU is not “the right implementation” with the others being compromises; each implementation is correct for its context, and the context determines everything.

For anyone building systems software outside the kernel, the takeaway is the same. The RCU pattern, read-copy-update with deferred reclaim, is portable to userspace and to other languages. Libraries like Folly’s RCU and crossbeam-epoch in Rust implement the core idea. But the diversity of Linux kernel RCU implementations is a reminder that “what counts as a quiescent state” depends entirely on the environment, and that getting that wrong in any direction produces either incorrect synchronization or unnecessary overhead.

The corner cases are where correctness actually lives.