· 6 min read ·

What CPU Profilers Don't See

Source: lobsters

When I started taking performance seriously in my own projects, I ran perf record or py-spy on everything and treated the output as ground truth. A flamegraph told me where time was going, I optimized the fat parts, and things got faster. That worked fine for compute-bound problems, but it stopped working entirely when the bottleneck was I/O.

Red Hat’s performance engineering team published a post that gets at exactly this gap: profilers are instruments with specific measurement domains, and the validity of any profile depends on understanding what the instrument can actually measure. The central issue is the distinction between on-CPU time and off-CPU time, and it is easy to overlook because the profiler output itself gives no indication of what it is not capturing.

What a Sampling Profiler Actually Measures

A sampling profiler works by interrupting a running process at a fixed interval, capturing the current call stack, and repeating thousands of times per second. At the end of a run, you have a statistical distribution of where the program spent its time, usually visualized as a flamegraph. Tools like perf, async-profiler, py-spy, and Go’s pprof all work this way.

The constraint is fundamental: the sampler only fires when the thread is scheduled on a CPU. If the thread is blocked waiting for a lock, sleeping in a read() syscall, or stalled on a page fault, no samples are collected. That time disappears from the profile entirely.

For a pure compute workload, this is fine. For anything that does network I/O, filesystem access, database queries, or lock-protected shared state, the CPU profile is incomplete by construction. You might look at a clean flamegraph, conclude that no obvious hotspot exists, and miss that 80% of request latency is spent waiting in the kernel.

The Off-CPU Gap in Practice

Consider a typical request handler in a service or a bot command: it receives input, queries a database or external API, processes the result, and returns a response. Under a CPU profiler, the database call appears as a narrow column because the thread hands off to the kernel and blocks almost immediately. The 50ms spent waiting for the query to come back contributes zero samples to the profile.

This creates a reliable failure mode: engineers see a flat CPU flamegraph, conclude the service is efficient, and miss the actual bottleneck. The profile is accurate about where CPU cycles went; it says nothing about wall-clock latency.

Off-CPU profiling fills this gap. The BCC offcputime utility and async-profiler in wall-clock mode both sample threads regardless of whether they are on-CPU, capturing the time spent blocked. The resulting flamegraph shows where threads were sleeping and for how long.

# BCC off-CPU profiler using eBPF
sudo offcputime-bpfcc -p $(pgrep my-service) 30 > out.stacks
flamegraph.pl --color=io < out.stacks > offcpu.svg
# async-profiler wall-clock mode for JVM applications
./asprof -e wall -d 30 -f wall-profile.html $(pgrep java)

For latency-sensitive services, the wall-clock flamegraph from async-profiler is often more informative than the CPU flamegraph, because it includes blocking overhead alongside compute time. Both profiles together give a complete picture: the CPU flamegraph shows where compute is going; the wall-clock flamegraph shows where latency is accumulating.

Observer Effect and Safe-Point Bias

Sampling profilers carry roughly 1-5% overhead at typical rates of 99-999 Hz, which is acceptable for production use. Instrumentation profilers, which inject code at every function entry and exit, carry 2-10x overhead and change program behavior enough that the profile stops representing the real workload. Cache effects shift, branch prediction patterns change, and JIT compilation decisions get influenced by the additional bookkeeping. The thing being measured is no longer the thing that runs in production.

JVM applications have a specific version of this problem even with sampling profilers. Traditional JVM profilers like VisualVM and older JProfiler versions sample only at JVM safe points: specific positions in compiled code where the garbage collector is permitted to run. Functions that execute quickly without reaching a safe point never get sampled. Functions with many safe points get over-represented. This safe-point bias is severe enough that profiles can point to entirely wrong hotspots, and developers optimizing based on those profiles improve the wrong code.

async-profiler avoids this by using the AsyncGetCallTrace API, which can interrupt the JVM at arbitrary code positions rather than only at safe points. The difference in output between a safe-point-biased profiler and async-profiler on the same workload can be significant; hotspots invisible in the biased profile become dominant once the sampling constraint is removed.

# async-profiler CPU mode, without safe-point bias
./asprof -e cpu -d 60 -f cpu-profile.html $(pgrep java)

The same principle applies in Python. py-spy is a sampling profiler for Python processes written in Rust, and it attaches to a running process without code modification and with overhead low enough for production use. Because it reads the Python interpreter state directly rather than instrumenting function calls, it avoids the measurement perturbation that traditional Python profilers like cProfile introduce.

What perf Captures That Application Profilers Miss

Linux perf crosses the user/kernel boundary in ways that most application-level profilers cannot. With full call graph capture enabled, you get user-space and kernel call stacks unified in the same profile, making syscall overhead, page fault handlers, and interrupt processing visible alongside application code.

# Capture CPU cycles with full call graphs including kernel frames
sudo perf record -F 99 -g -p $(pgrep my-service) -- sleep 30
sudo perf report --stdio

Beyond call graphs, perf stat provides hardware performance counter data: cache miss rates, branch mispredictions, TLB misses. A program might look CPU-bound in a stack-based profile while actually spending most of its cycles stalled waiting for memory to arrive from RAM. Stack traces cannot show this; hardware counters can.

sudo perf stat -e cycles,cache-misses,cache-references,branch-misses -p $(pgrep my-service)

A high cache miss rate means the bottleneck is memory access latency, not compute throughput. Optimization strategies for the two cases differ substantially: compute-bound workloads benefit from algorithmic improvements, better use of SIMD, and reducing instruction count; memory-bound workloads benefit from improved data locality, smaller working sets, and prefetch hints. Applying compute-bound strategies to a memory-bound program produces little measurable improvement, and the CPU flamegraph alone will not tell you which situation you are in.

Intel VTune and AMD uProf go further still, providing microarchitecture-level analysis that can distinguish between L1, L2, and L3 cache misses, identify NUMA effects from cross-socket memory access, and break down retirement stalls by cause. For systems programming where hardware efficiency matters, this level of visibility is sometimes the only way to understand what the CPU is actually doing.

Differential Flamegraphs for Before/After Comparisons

Comparing two separate flamegraphs by visual inspection is unreliable. Small changes in call stack layout, minor shifts in timing, and natural run-to-run variation make it easy to misattribute changes. Differential flamegraphs subtract one profile from another and show changes directly, making regressions and improvements immediately visible.

The technique uses Brendan Gregg’s FlameGraph tools, which have become the standard for this kind of analysis:

# Capture baseline and new profiles in collapsed format
perf script --input=baseline.data | stackcollapse-perf.pl > baseline.collapsed
perf script --input=new.data | stackcollapse-perf.pl > new.collapsed

# Generate differential flamegraph (red = regression, blue = improvement)
difffolded.pl baseline.collapsed new.collapsed | flamegraph.pl > diff.svg

Frames that consumed more time in the new profile appear red; frames that shrank appear blue. A change that shifts time from one function to another shows up as simultaneous red and blue columns, which is far easier to interpret than two separate profiles placed side by side.

Matching the Tool to the Question

No single profiler answers all performance questions, and the most common mistake is using only a CPU sampler and concluding that a clean profile means no problem exists.

A layered approach works better. Start with perf stat to characterize whether the workload is compute-bound, memory-bound, or blocked on I/O. Hardware counters give a fast answer before investing time in flamegraph analysis. Then use a low-overhead sampling profiler to identify hot functions when the workload is compute-bound. Add wall-clock profiling or off-CPU profiling when latency does not match what the CPU profile predicts. Use differential flamegraphs when comparing before and after a change rather than trying to compare two profiles by eye.

The Red Hat post reinforces something that takes time to internalize through real performance work: a profiler is an instrument with a measurement domain, and the validity of the output depends on matching the instrument to the question being asked. A CPU profile answers where CPU cycles went. A wall-clock profile answers where time went. Hardware counters answer what the CPU was doing during those cycles. Off-CPU profiling answers where threads were blocked and for how long.

Each answer is partial, and the performance picture only becomes complete when you know which partial view you are looking at.

Was this interesting?