· 6 min read ·

Profiling Shows You Half the Story: Understanding the Off-CPU Blind Spot

Source: lobsters

A sampling profiler tells you where CPU time goes. That is a meaningful thing to know, but it is not the same as knowing where time goes. The distinction is the gap between on-CPU profiling and wall-clock reality, and many production performance problems live in that gap.

The Red Hat Performance Engineering post on profiler usage addresses something that is easy to miss when profiling feels like it is working: the tool is giving you accurate answers to a narrower question than you think you are asking. Understanding the structural constraints of your profiler changes what you do with its output.

How Sampling Profilers Work

A statistical sampling profiler interrupts the CPU at a fixed frequency, records the current instruction pointer and call stack, and accumulates those samples into a histogram. After enough samples, functions that consume more CPU time appear more often, and the histogram is a reasonable approximation of where cycles went.

perf record on Linux uses hardware performance monitoring unit (PMU) interrupts or timer signals, typically at 99 Hz by default. The frequency is chosen as a prime to avoid resonance with 100 Hz system timers, which would cause systematic sampling bias toward work that happens to align with those intervals. At 99 Hz, overhead is under 1% for most workloads. Raise it to 10 kHz and overhead climbs to 1-5%; at 100 kHz, the sampling machinery itself starts consuming a noticeable fraction of a CPU core.

Stack unwinding is where the hidden cost lives. Frame-pointer-based unwinding is O(stack depth), fast, and reliable. DWARF-based unwinding, required when the compiler omits frame pointers (which GCC and Clang both do at -O2 by default), consults .eh_frame sections and runs 5-20x slower per sample. This is why the perf documentation consistently recommends compiling profiling builds with -fno-omit-frame-pointer. At high sampling rates, the cost shows up as inflated counts for interrupt handler code rather than application code.

The Blind Spot: Off-CPU Time

On-CPU sampling cannot see time spent not running on a CPU. A thread blocked on a read() call waiting for disk, sleeping on a mutex through futex(), sitting in epoll_wait(), or preempted by the scheduler: none of these contribute samples to a standard perf record or py-spy capture. The profiler has no information about the thread during those intervals.

In latency-sensitive network services, particularly anything making database calls or external HTTP requests, the majority of wall-clock time is off-CPU. A Discord bot waiting for a database query to return is invisible to an on-CPU profiler for the entire duration of that wait. Optimize the JSON serialization code that perf identifies as a hotspot, and P99 latency stays the same because serialization was never the constraint.

Brendan Gregg documented this pattern and built tooling to address it. The offcputime tool from BCC uses BPF to attach to scheduler sched_switch events, recording the full call stack at the moment each thread is descheduled and measuring how long it remains off-CPU:

# Record off-CPU stacks for a specific process for 30 seconds
offcputime-bpfcc -df -p $(pgrep myapp) 30 > off_stacks.txt
# Generate a flame graph from the results
flamegraph.pl off_stacks.txt > offcpu_flame.svg

The resulting flame graph shows the time dimension that on-CPU profiling misses entirely. A tall futex_wait column rooted at a specific lock acquisition site is the off-CPU answer to a question the on-CPU profile could not answer. Lock contention, I/O waits, and scheduler delays each leave distinct call-stack signatures in off-CPU profiles.

perf also exposes sched:sched_switch tracepoints for this purpose:

perf record -e sched:sched_switch -a -g -- sleep 30
perf report

This is coarser than BPF-based off-CPU analysis but requires no BCC installation, only a kernel with tracepoints compiled in.

For JVM workloads, async-profiler’s -e wall mode takes a different approach: it samples all threads at wall-clock intervals, including sleeping ones. This does not give precise off-CPU durations, but it reliably surfaces threads that are consistently parked, directing attention toward lock contention, GC pauses, and I/O waits that on-CPU profiling would miss.

The JVM Safepoint Problem

The JVM adds a profiling pathology beyond the on-CPU versus off-CPU split. Most JVMTI-based Java profilers, including older versions of JProfiler and YourKit, can only take stack samples at safepoints. Safepoints are positions in bytecode where all threads can be safely paused for garbage collection. They occur at method call sites, loop back-edges, and a handful of other locations, but they do not exist inside tight compiled loops that the JIT has optimized to native machine code.

The consequence is that the hottest code in a CPU-bound Java application can be completely invisible to a safepoint-biased profiler. A loop running for 300 milliseconds without encountering a safepoint contributes zero samples. The profiler overrepresents instructions adjacent to safepoint checks and produces what looks like a coherent profile of a program that differs from the one actually running.

Alexey Shipilev at Red Hat quantified the gap: safepoint-biased profilers can attribute 0% of samples to the actual hottest code path while correctly-implemented profilers show it consuming over 90% of CPU time. The paper “Evaluating the Accuracy of Java Profilers” (Mytkowicz et al., PLDI 2010) found that four widely-used Java profilers gave contradictory results on identical benchmarks, all due to safepoint bias interacting with different JIT optimization decisions.

async-profiler exists specifically to solve this. It uses AsyncGetCallTrace, a non-public but stable HotSpot API, combined with perf_event_open PMU interrupts. This allows samples at arbitrary execution points regardless of safepoint state, producing accurate attribution to JIT-compiled hot loops:

# Profile a JVM process for 30 seconds, output HTML flame graph
./asprof -d 30 -f output.html $(pgrep java)

The difference between async-profiler’s output and a safepoint-biased profiler’s output on CPU-bound Java code can be the difference between optimizing the right function and optimizing one that the profiler hallucinated into prominence.

The Observer Effect

Any profiler changes the program it measures. For sampling tools this is usually small, but not always negligible. The interrupt handler and stack walker execute instructions that evict application code and data from L1 and L2 caches. At high sampling rates, this inflates apparent cache miss counts in the profile, making memory-bound code look worse than it actually is during unobserved execution.

JVM profilers have a more severe version of this. Tiered compilation in HotSpot promotes methods from interpreted to C1 to C2 compilation based on invocation counts and observed behavior. A profiler that adds overhead to method calls can change the JIT’s optimization decisions: a method that would otherwise be promoted to C2 stays in C1 because the overhead makes it appear less hot. The profiled run is slower for a reason entirely unrelated to the code’s actual performance characteristics.

Kalibera and Jones documented in their 2014 paper that adding a 1 kHz sampling profiler to a tight Java benchmark degraded throughput by 8-15%, enough to change which of two competing implementations appeared faster in the comparison. The profiler changed the answer to the question it was supposed to help answer.

Practical mitigation follows from understanding the mechanism: keep sampling rates as low as they can be while still giving statistically useful signal (99 Hz is sufficient for identifying hot paths in most workloads), profile on a separate production replica rather than a development build, and verify conclusions across multiple independent runs before acting on them. Hardware PMU event-based sampling (perf, Intel VTune) tends to disturb the OS scheduler less than timer-based SIGPROF delivery, which is one concrete reason to prefer it for latency-sensitive profiling.

Matching Tool to Problem

For CPU-bound work where a known algorithm runs slower than expected, on-CPU sampling with perf or async-profiler is the right starting point. The hot path is on-CPU and the profiler will find it.

For latency problems in services that wait on network or disk, off-CPU profiling is the correct first step. offcputime or perf sched will show what threads are waiting on and for how long. Lock contention appears as futex_wait call chains; I/O appears as vfs_read or io_submit chains sleeping in the off-CPU trace.

For JVM workloads requiring CPU-bound optimization, async-profiler in -e cpu mode is more reliable than any JVMTI-based tool because it avoids safepoint bias. For microarchitecture questions, whether a loop is stalling on cache misses, branch mispredictions, or instruction fetch bandwidth, Intel VTune’s Top-down Microarchitecture Analysis Method (TMAM) provides a systematic breakdown that perf stat can approximate but cannot match in depth. VTune has been free since 2019.

The thread running through all of these cases is that a profiler’s model of what it samples shapes what it can tell you. A flame graph that looks complete can be missing the majority of the latency picture if the application spends most of its time waiting rather than running. Reading profiler output without knowing which events the tool actually captures leads to confident conclusions about the wrong things, and optimizing based on those conclusions produces changes that do not show up in production metrics.

Was this interesting?