· 7 min read ·

The On-CPU Illusion: What Your Profiler Is Actually Measuring

Source: lobsters

Performance work tends to start with the same move: attach a profiler, stare at a flame graph, find the widest frame, optimize it. The assumption baked into that workflow is that the profiler shows you where time is going. For a specific and limited class of problems, it does. For a surprisingly wide class of real-world slowness, it shows you almost nothing.

The Red Hat performance engineering team’s post on profiler methodology frames this clearly: profilers are hypothesis-generation tools, not oracles. That framing deserves unpacking, because the gap between what sampling profilers show and what most developers think they show is wide enough to send you chasing the wrong thing for hours.

What a Sampling Profiler Actually Measures

Most profilers in everyday use, perf record, async-profiler, py-spy, rbspy, the V8 profiler, are sampling profilers. At some configurable frequency (typically 99 to 999 Hz), they interrupt the running thread, capture the call stack, and record it. After enough samples, the distribution of stacks tells you which code the CPU was executing during the measurement window.

The key phrase is “the CPU was executing.” Sampling profilers only see on-CPU time. When a thread is blocked waiting on a network response, asleep waiting for a timer, queued waiting to acquire a mutex, or stalled waiting for a page fault to resolve, the profiler captures nothing for that thread. It is not running; there is nothing to sample.

This means a service that spends 80% of its wall-clock time blocked on database queries will produce a near-empty flame graph. The 20% of time it spends doing actual CPU work will be accurately profiled, but optimizing that 20% will at best yield a 20% improvement. The real bottleneck, the query latency, is invisible.

I/O wait, lock contention, and scheduler latency are arguably the three most common sources of production latency in networked services, and all three are outside the window that a CPU profiler can see.

Characterize Before You Hunt

The Linux perf toolset makes a useful distinction between perf stat and perf record that reflects sound methodology. perf stat reads hardware performance counters: cycles, instructions, cache misses, branch mispredictions, context switches. It produces aggregate numbers for the entire run, with near-zero overhead, because modern CPUs count these in hardware registers without any sampling interruption.

The output of perf stat tells you what kind of bottleneck you have before you invest in reading a flame graph:

$ perf stat -e cycles,instructions,cache-misses,branch-misses ./myapp

     12,345,678,901  cycles
      9,876,543,210  instructions       #  0.80  insns per cycle
        456,789,012  cache-misses
         23,456,789  branch-misses

Instructions per cycle (IPC) below 1.0 typically indicates a memory-bound workload: the CPU is spending most of its time waiting for data from RAM rather than executing instructions. IPC above 3.0 suggests a compute-bound workload with good cache behavior. A high cache-miss rate with low IPC points to memory access patterns as the target. A high branch-miss rate with moderate IPC points to branch prediction as the target.

Run perf stat first. Then, if the characterization points to on-CPU behavior, run perf record. If the characterization points to memory, locks, or I/O, reach for different tools.

Off-CPU Profiling: The Other Half

Off-CPU profiling captures exactly what CPU profiling misses: time spent with threads blocked, waiting, or sleeping. The BCC toolkit, built on Linux eBPF, provides offcputime, which hooks into the kernel scheduler’s context-switch event and records stack traces alongside the duration of each off-CPU period:

offcputime-bpfcc -p <pid> 30 > out.stacks
flamegrapher.pl out.stacks > off-cpu.svg

The resulting flame graph shows where threads are spending time waiting, not running. A wide pthread_mutex_lock frame here means lock contention. A wide epoll_wait means the thread is correctly idle waiting for I/O. A wide futex_wait inside malloc means allocator contention under multithreaded load.

Brendan Gregg’s methodology for complete latency analysis combines both: an on-CPU flame graph from perf record for CPU time, and an off-CPU flame graph from offcputime for wait time. Neither alone gives a complete picture of where wall-clock time is going.

For Java applications, async-profiler handles both modes cleanly. In wall-clock mode (-e wall), it samples all threads regardless of whether they are on-CPU or blocked, giving a combined view. In lock mode (-e lock), it specifically profiles Java monitor contention. These modes often reveal problems that the default CPU mode completely misses.

The JVM Safe-Point Bias Problem

Java has a specific profiling pathology that is worth understanding explicitly. Traditional JVMTI-based profilers, which covers most profilers that are not async-profiler, can only safely capture stack traces at JVM safe points. A safe point is a location in the bytecode where the JVM can pause all threads safely for garbage collection or other operations.

The problem is that safe points are not uniformly distributed in compiled code. The JIT compiler tends to place them at loop back-edges and method exits, not inside long loops. This means that a tight loop with no safe-point poll will appear to take zero time in a safe-point-biased profiler, because the profiler never gets to sample inside it. The parent caller looks expensive; the actual hot loop is invisible.

async-profiler was specifically built to fix this by using AsyncGetCallTrace, a non-safe-point API that allows stack capture at any point. The difference in profiles between a safe-point-biased profiler and async-profiler on the same workload can be dramatic, with entirely different functions appearing as bottlenecks.

For anyone profiling Java on the JVM, async-profiler is the correct default choice. The older approach produces structurally biased data.

Frame Pointers and Broken Call Graphs

On Linux, perf record captures call stacks by walking the stack frame chain. This requires that each function’s stack frame contain a pointer back to the caller’s frame, the frame pointer. For decades, the default optimization setting in GCC and Clang has been -fomit-frame-pointer, which reclaims a register (RBP on x86-64) at the cost of breaking stack unwinding for profilers.

The result is that on a typical Linux distribution, perf record -g produces call graphs full of [unknown] frames, because the frame pointer chain is broken at any function compiled without it.

The alternative, DWARF-based unwinding (--call-graph dwarf), can reconstruct call stacks from debug information, but it copies large amounts of stack memory on every sample, adding significant overhead and making it unsuitable for production use.

Red Hat has been pushing to re-enable frame pointers by default across the Linux ecosystem. Fedora 38 enabled -fno-omit-frame-pointer by default for all packages, and RHEL followed. The argument is that the performance cost of keeping the frame pointer (roughly 1-3% in most workloads) is worth the gain in profiling fidelity across the entire system stack. When frame pointers are present, perf record can walk the full call chain through application code, libc, and kernel code with low overhead and high accuracy.

If you are profiling on a system that still uses -fomit-frame-pointer, either use DWARF unwinding and accept the overhead, or rebuild the relevant binaries with frame pointers. Without accurate call stacks, flame graphs show only shallow, context-free attribution.

The Observer Effect

Profiling always perturbs the system being measured. The practical question is how much and in what direction.

For perf record at 999 Hz with frame pointers, the overhead is typically 1-5%. The interrupts are handled in hardware, the stack walk is fast with frame pointers, and the data is written to a ring buffer. This is low enough for short bursts in production.

SIGPROF-based profilers (gprof, older Python profilers) deliver a Unix signal to the profiled thread on each sample. In multithreaded applications, signal delivery adds contention and can measurably change scheduling behavior.

Valgrind’s Callgrind mode instruments every memory access and executes each instruction under simulation, producing exact call counts with 10-100x slowdown. The resulting profile is precise for CPU instruction counts but completely unrepresentative of real memory access patterns under the actual cache hierarchy, since Valgrind simulates cache behavior rather than using the real one.

eBPF-based tools like offcputime run BPF programs in kernel context, which is efficient, but attaching probes to extremely hot kernel functions (like kmalloc or copy_to_user) can add 10-30% overhead because the probe fires on every call. Scope probes carefully.

Putting It Together

A practical profiling methodology looks like this: start with perf stat to characterize the bottleneck type. If IPC is low and cache misses are high, the problem is in memory access patterns. If context switches are high, the problem may be in locking or excessive thread creation. If the counters look healthy, proceed to perf record to find the hot functions.

If latency is the complaint and perf record shows nothing obvious, the bottleneck is almost certainly off-CPU. Reach for offcputime or async-profiler wall mode, and look at where threads are spending time waiting rather than executing.

The profiler is a tool for narrowing the hypothesis space. It does not show you the answer; it shows you which part of the system to look at next. Treating a clean flame graph as evidence that performance is fine is the most common mistake in this space. A clean CPU flame graph means the CPU is not the bottleneck, which is useful information, but not a clean bill of health for the whole system.

Was this interesting?