· 8 min read ·

The Design Space Hidden Inside RCU's Simple API

Source: lobsters

Read-Copy-Update is one of those synchronization primitives that looks elegant until you read closely enough. The basic idea fits in two sentences: readers access data with no locking overhead, and writers publish new versions by swapping a pointer, then wait for all pre-existing readers to finish before freeing the old version. Simple. Except the Linux kernel ships multiple implementations of this concept, each with its own constraints, trade-offs, and failure modes. Paul McKenney’s ongoing “Stupid RCU Tricks” series catalogs the corner cases, and the corner cases are where the interesting engineering lives.

Why One Implementation Is Not Enough

The API a caller sees is deceptively uniform:

rcu_read_lock();
p = rcu_dereference(gp);
/* use p */
rcu_read_unlock();

On the write side:

new = kmalloc(sizeof(*new), GFP_KERNEL);
*new = *old;
new->field = updated_value;
rcu_assign_pointer(gp, new);
synchronize_rcu(); /* wait for readers */
kfree(old);

The synchronize_rcu() call is where the implementations diverge. What does “wait for all pre-existing readers” mean? It depends entirely on what constitutes a reader, what constitutes a quiescent state, and what guarantees the execution environment makes about preemption and sleeping.

The classic formulation works because in a non-preemptible kernel, any context switch is a quiescent state. A CPU that has context-switched cannot be in the middle of a read-side critical section, because rcu_read_lock() disables preemption. So you wait until every CPU has context-switched at least once, and you are done. The implementation is essentially a two-phase batch: collect callbacks after a counter increment, wait for all CPUs to report a quiescent state, invoke the callbacks.

That formulation breaks down in several places.

Tiny RCU: Uniprocessor as a Degenerate Case

On a uniprocessor system, the grace period detection problem collapses. There is only one CPU. Any time the caller of synchronize_rcu() gets to run at all, all previous readers have already finished. A context switch to the synchronize_rcu() caller is by definition a quiescent state for any prior read-side critical section.

The Tiny RCU implementation in kernel/rcu/tiny.c reflects this: the grace period logic is extraordinarily simple, running to only a few hundred lines total. The key function is synchronize_rcu(), which calls cond_resched() twice. That is enough. The scheduler must run between the two calls; any prior readers complete before the second call returns.

Tiny RCU exists because embedded systems and specialized configurations that compile for CONFIG_TINY_RCU save significant kernel memory and code complexity. The full Tree RCU implementation is thousands of lines supporting CPU hotplug, hierarchical grace period tracking, and expedited paths. None of that is needed on a UP system.

Tree RCU: Scaling Grace Period Detection

The main SMP implementation uses a hierarchical tree of rcu_node structures to track quiescent states across many CPUs without turning grace period detection into an O(N) walk of every CPU on the system. Each leaf node covers a set of CPUs; parent nodes aggregate quiescent state reports up the tree. The root node’s report means all CPUs have passed through at least one quiescent state.

The tree depth scales logarithmically with CPU count, which matters on systems with hundreds of CPUs. A naive flat scan of 512 CPUs holding a single global lock would serialize badly. The tree allows parallel quiescent state reporting with per-node locking.

CPU offline and hotplug introduce a class of bugs here. An offline CPU will never report a quiescent state. Tree RCU must not wait for a CPU that is not running. The handling for this involves marking offline CPUs as extended quiescent, a conceptually simple fix with non-trivial implementation because the offline transition itself races with grace period initialization.

Preemptible RCU: When Readers Can Be Preempted

The non-preemptible assumption is what lets rcu_read_lock() be a no-op beyond disabling preemption. In a fully preemptible kernel (CONFIG_PREEMPT_RT or CONFIG_PREEMPT), that assumption fails. A task can be preempted in the middle of a read-side critical section, context-switch away, and the CPU is now available for other work. The CPU context switch is no longer a quiescent state indicator; you need to track readers explicitly.

Preemptible RCU adds per-task state to track whether a task is currently in an RCU read-side critical section. When a task is preempted inside rcu_read_lock(), it gets added to a per-rcu_node linked list of “blocked readers.” The grace period cannot complete until all those blocked readers have either exited their critical sections or themselves been accounted for.

The read side cost goes from “disable preemption” to a slightly more involved path that checks and conditionally updates per-task state. Still fast, but no longer completely free. The write side cost increases because grace periods take longer when readers can sleep on a CPU.

The kernel documentation on RCU usage describes these trade-offs explicitly: PREEMPT_RCU is required when code runs in a preemptible context and needs RCU protection, but the latency properties change.

SRCU: Sleepable Read-Side Critical Sections

Preemptible RCU still requires that readers not sleep voluntarily. There are contexts where you need RCU-like semantics but the reader legitimately needs to block: waiting on I/O, taking a mutex, calling a function that might sleep. SRCU (Sleepable RCU) handles this case.

The API looks similar:

struct srcu_struct my_srcu;

/* reader */
idx = srcu_read_lock(&my_srcu);
p = srcu_dereference(gp, &my_srcu);
/* may sleep here */
do_something_that_blocks();
srcu_read_unlock(&my_srcu, idx);

/* writer */
synchronize_srcu(&my_srcu);

The key difference is that SRCU is domain-specific. You instantiate a struct srcu_struct for each independently managed domain, and grace periods are tracked per-domain rather than globally. This has an important implication: two separate SRCU domains can have overlapping grace periods, which is impossible with regular RCU where there is one global grace period state machine.

The implementation uses per-CPU counters with two “buckets.” Readers increment the current bucket’s counter on lock and decrement on unlock. The grace period flips the active bucket and waits for the old bucket’s counter to reach zero on all CPUs. Because readers can sleep, “waiting for the counter to reach zero” might take a long time; SRCU typically uses a workqueue-based mechanism to avoid stalling the writer.

SRCU is significantly more expensive than RCU on both the read and write sides, but it is the right tool for file system notification infrastructure, security module hooks, and other contexts where the reader may legitimately block. The SRCU source is worth reading alongside the Tree RCU source as a contrast in what the sleepable requirement costs.

Tasks RCU: Waiting for Voluntary Context Switches

Kernel tracing infrastructure has a different requirement again. Live patching and tracepoints need to ensure that no task is currently executing a particular function before it gets replaced. The relevant quiescent state is not “this CPU context-switched” but “this task voluntarily scheduled.”

Tasks RCU defines quiescent states as voluntary context switches (cond_resched(), schedule(), and similar calls) plus user-mode execution. A task that has called any of those since the start of a grace period is known to not be executing a patched function. This allows the tracing infrastructure to wait for all tasks to pass through a safe point without requiring them to actually context-switch off CPU.

The userspace-RCU library (liburcu) provides analogous implementations for userspace programs. It has its own set of variants: signal-based, membarrier-based, and QSBR (Quiescent-State-Based Reclamation), where threads explicitly call rcu_quiescent_state() to report their own quiescent states. QSBR is the most performant but requires thread cooperation; the signal-based variant requires no code changes in the reader threads but has higher overhead.

Expedited RCU: Paying More for Speed

Sometimes a grace period needs to complete quickly, even at significant CPU cost. Kernel module unloading, memory hotplug, and device removal need synchronous completion of in-flight RCU readers before they can proceed safely.

synchronize_rcu_expedited() forces each CPU to run a scheduler IPI, which causes a context switch on CPUs that have not yet reported a quiescent state. This accelerates grace period completion from potentially hundreds of milliseconds down to single-digit milliseconds, at the cost of a broadcast IPI and the associated inter-processor interrupt overhead.

Expedited RCU is intentionally not the default because the IPI cost at scale is substantial. On a 512-CPU system, a broadcast IPI interrupts all 511 other CPUs, causing cache thrash and scheduler disruption. The kernel tracks expedited versus normal grace period counts separately, and there is a mechanism to “deexpedite” by converting an expedited request into a normal grace period if the system is under heavy load.

Polled Grace Periods: Non-Blocking Writes

A more recent addition to the RCU API is polled grace period detection, introduced to support RCU usage in contexts that cannot block:

unsigned long cookie = get_state_synchronize_rcu();
/* ... time passes ... */
if (poll_state_synchronize_rcu(cookie)) {
    /* grace period has elapsed since cookie was taken */
    kfree(old_ptr);
}

This allows code that runs in atomic context or in a loop that cannot call synchronize_rcu() to still participate in RCU reclamation. The cookie captures a snapshot of the grace period counter; polling checks whether a full grace period has elapsed since that snapshot. No blocking, no callback registration overhead.

The kernel documentation for polled grace periods describes it as useful for reference-count-free reclamation in performance-critical paths where the overhead of callback registration is measurable.

The Actual Lesson

What makes McKenney’s catalog of corner-case implementations interesting is not the implementations themselves but what they reveal about the design space. RCU is not a single algorithm; it is a contract: readers pay almost nothing, writers pay to wait for readers. Every variant honors that contract under a different set of environmental constraints.

Tiny RCU honors it on UP by recognizing the grace period is free. Preemptible RCU honors it when the kernel scheduler can interrupt readers. SRCU honors it when readers may sleep. Tasks RCU honors it when the relevant quiescent state is voluntary scheduling. Expedited RCU honors it when latency matters more than throughput.

The common thread is that the hard part is always defining what a quiescent state means in context. Once you have a correct definition, the grace period detection follows mechanically. Getting the definition wrong, or applying the wrong variant to a context, produces subtle use-after-free bugs that are nearly impossible to reproduce under normal conditions.

The Linux kernel’s approach of shipping multiple named variants with different contracts, rather than a single universal mechanism with runtime configuration, is the right call. It forces callers to be explicit about what they are actually guaranteeing. When you reach for srcu_read_lock(), you are stating that you need sleepable read-side semantics and you accept the cost. When you reach for synchronize_rcu_expedited(), you are stating that you need fast completion and you accept the IPI cost. The type system enforces the contract more tightly than a single overloaded API could.

McKenney’s perfbook (Is Parallel Programming Hard, And If So, What Can You Do About It?) covers all of this and more, free online, and is the most thorough treatment of real-world concurrent systems programming available anywhere.

Was this interesting?