The On-CPU Illusion: Why Your Profiler Shows You Half the Story

The Model Behind the Measurement

A sampling profiler operates on a simple contract: at regular intervals, interrupt the running program and record where execution is. After thousands of samples, patterns emerge. Functions that appear frequently are consuming CPU time. The resulting flame graph is a compressed portrait of your program’s hot path, and it is genuinely useful.

The problem is not that this portrait is wrong. It is that it only renders what is happening when the CPU is executing your code. Everything else, the time spent waiting for a lock, stalled on a disk read, asleep in the scheduler, does not appear. On a program whose bottleneck is CPU-bound, that is fine. On everything else, you are looking at a picture of the right wall in the wrong room.

Red Hat’s performance engineering team documented this directly while working through a real optimization case. The profiler showed a clear hot path. The hot path was real. It was not the bottleneck. That gap, between what the profiler confidently shows and what actually limits your program’s throughput, is worth understanding in detail.

What Sampling Profilers Actually Measure

perf record on Linux, Instruments on macOS, and async-profiler on the JVM all share the same underlying model. At a configured frequency, typically 99 Hz or 999 Hz, a hardware interrupt fires and the kernel records the current instruction pointer and call stack. After collection, those stacks are aggregated. You see the functions that were running during the most samples.

The key phrase is “were running.” A sample is only taken when the CPU is executing your process. When your process is not scheduled, no samples accumulate. This is the on-CPU vs off-CPU distinction, and it determines whether a profiler is even capable of diagnosing your problem.

# Standard CPU sampling with perf - collects only on-CPU samples
perf record -g -F 99 -p $(pgrep myservice)
perf script | flamegraph.pl > cpu_profile.svg

If your service spends 80% of its time waiting on a database response and 20% processing the result, the flame graph will show you the processing code in exquisite detail. The 80% will be absent. Not distorted. Absent.

The Categories of Off-CPU Time

Off-CPU time falls into several categories with different causes and different diagnostic approaches.

I/O wait is the familiar case. When a thread issues a blocking read or write and the data is not in the page cache, the kernel puts the thread to sleep until the operation completes. The thread is off-CPU for the entire duration. This shows up as high iowait in tools like vmstat or iostat, but a CPU profiler will suggest your code is extremely efficient.

Lock contention is more insidious. When a thread tries to acquire a mutex held by another thread, it either busy-waits (on-CPU but doing nothing useful) or parks itself (off-CPU entirely). Futex-based locks in Linux use the latter approach once the spin threshold is exceeded. The waiting thread vanishes from CPU profiles entirely. The holding thread might not even show the lock acquisition, because it acquired the lock quickly.

Scheduler latency is subtler still. Even after a thread becomes runnable, it has to wait for the scheduler to actually run it. On a heavily loaded system with many runnable threads, this can add significant latency without appearing anywhere in a CPU profile.

Page faults straddle the line. Minor page faults are usually fast, but major page faults that require disk access put the thread to sleep. You can surface these with perf stat -e major-faults,minor-faults to at least quantify the problem before tracing its source.

Seeing Off-CPU Time

The tooling for off-CPU analysis has improved substantially. On Linux, BCC’s offcputime script uses eBPF to trace the finish_task_switch kernel function and record stacks at each sleep event:

# Collect off-CPU stacks for a process over 30 seconds
offcputime-bpfcc -p $(pgrep myservice) 30 > offcpu.stacks

# Or with bpftrace directly
bpftrace -e '
  kprobe:finish_task_switch {
    if (@start[prev->pid]) {
      @[kstack, ustack] = sum(nsecs - @start[prev->pid]);
      delete(@start[prev->pid]);
    }
    @start[curtask->pid] = nsecs;
  }'

The resulting stacks can be fed into the same FlameGraph scripts used for CPU profiles, giving you an off-CPU flame graph where width represents time spent sleeping rather than time spent computing. Brendan Gregg’s off-CPU analysis methodology formalizes this into a systematic workflow and distinguishes between blocked I/O, lock contention, and voluntary sleeps.

For JVM workloads, async-profiler handles both modes cleanly. With -e cpu, it samples on-CPU activity using perf events. With -e wall, it samples wall-clock time, including threads blocked in I/O or synchronization, giving a closer approximation of what your service is spending latency on:

# wall-clock mode captures blocked threads too
./profiler.sh -e wall -d 30 -f wall_profile.html $(pgrep java)

The wall-clock flame graph for a service with lock contention will often show a thick band of stack frames parked inside pthread_cond_wait or futex that simply does not exist in the CPU-only profile.

Hardware Counters: A Different Axis

Beyond the on-CPU vs off-CPU dimension, CPU profilers have a second blind spot: they do not naturally surface micro-architectural events. Your code can be running on-CPU the entire time and still be slow for reasons that call-stack sampling cannot express.

Cache misses are the common case. If your hot loop is traversing a large, pointer-heavy data structure, it will appear prominently in the flame graph, but the flame graph alone will not tell you that the loop is stalling at the CPU’s memory interface. perf stat can reveal this:

perf stat -e cycles,instructions,cache-misses,cache-references,branch-misses \
  ./myprogram

# Output suggesting a cache-bound workload:
#   cycles:             12,450,000,000
#   instructions:        4,100,000,000   # IPC of 0.33 - very low
#   cache-misses:            8,200,000
#   cache-references:       41,000,000   # 20% miss rate

An IPC (instructions per cycle) well below 1.0 on modern superscalar hardware is often a sign of memory latency. The code is on-CPU, the profiler can see it, but the profiler’s data model, which counts call-stack presence, does not distinguish between a function that executes quickly and a function that runs while the CPU is stalled waiting for a cache line to arrive from RAM.

perf record with PEBS (Precise Event Based Sampling) on Intel hardware can go further, attributing cache-miss samples to specific instructions rather than just functions:

perf record -e MEM_LOAD_RETIRED.L3_MISS:pp -g ./myprogram
perf report --sort=sym,dso

This pins the blame not just to a function but to the specific load instruction responsible for the misses, which is the information you need to decide whether to change data layout, add prefetch hints, or restructure access patterns.

Branch mispredictions follow the same pattern. The branch-misses hardware event counts mis-speculated branches, but identifying which branches are responsible requires combining perf record -e branch-misses with annotated disassembly. The profiler tells you the function; the hardware counter tells you the cost; the annotation tells you the instruction.

Building a Complete Diagnostic Picture

The practical implication is that performance engineering requires choosing the right tool for each hypothesis rather than applying one tool universally.

Start with a CPU profile to find on-CPU hot paths. If the profile explains the observed latency, fix the hot path and re-measure. If the profile shows modest CPU consumption but high end-to-end latency, the bottleneck is likely off-CPU. Switch to wall-clock sampling or off-CPU tracing to find where threads are sleeping. If the CPU profile shows a hot path with unexpectedly low throughput, check hardware counters for cache or branch miss rates.

Each of these tools answers a different question. A CPU profiler answers: where is the CPU executing my code? An off-CPU profiler answers: where is my code sleeping and why? Hardware counters answer: how efficiently is the CPU executing the instructions that are running? None of them answer all three questions simultaneously, and treating a CPU flame graph as a complete performance picture is how investigations stall at a plausible but incomplete explanation.

The Red Hat performance engineering post is a useful case study precisely because it does not reach for a dramatic conclusion. The profiler was not wrong. The engineers who read it were not incompetent. The tool did what it was built to do and nothing more. The gap between that and a complete analysis is a gap in methodology, not a gap in tool quality.

Profilers as Models

Every profiler is a model, and every model has scope. Sampling profilers model on-CPU execution. Off-CPU tools model scheduler sleep states. Hardware counters model micro-architectural resource contention. A complete performance investigation draws on all three and stays clear about which model produced which observation.

The flame graph is the right starting point. On CPU-bound workloads it may be the only point. On everything else, its confident visual language can create the impression that you have answered a question you have only partially asked. Knowing what your profiler cannot see is as important as knowing how to read what it shows.