Performance engineering has a persistent trap: the profiler says your program is fast, but your users say it’s slow. Both are telling the truth. The profiler is showing you what happened on the CPU; it has nothing to say about the time your program spent waiting for a lock, blocking on a disk read, or sitting in a kernel scheduler queue. That gap between what a profiler measures and what a user experiences is the subject of this recent post from Red Hat’s performance engineering team, and it’s worth exploring the full shape of that gap rather than just acknowledging it exists.
The On-CPU Illusion
A sampling profiler works by interrupting the program at a fixed frequency, capturing the current call stack, and repeating thousands of times per second. After the run, you aggregate those samples and produce a flame graph. The width of each frame represents how often that function appeared at the top of the stack, which is a reasonable proxy for how much CPU time it consumed.
The operative word is “CPU time.” A thread that is sleeping in epoll_wait, blocked on pthread_mutex_lock, or waiting for a page fault to resolve is not executing on any CPU core. It generates zero samples. From the profiler’s perspective, it does not exist. If your program spends 80% of its wall-clock time blocked on a contended mutex, your flame graph will show an 80% blank, not a 80% spike at the mutex call site.
Brendan Gregg formalized this distinction in his work on off-CPU flame graphs, framing it as: wall-clock latency equals on-CPU time plus off-CPU time. Profilers measure the first term only. To measure the second, you need a different instrument entirely.
The offcputime tool from BCC uses eBPF to trace the finish_task_switch kernel tracepoint, recording the call stack at the moment a thread is descheduled. The resulting off-CPU flame graphs show the same visual structure as standard flame graphs, but the x-axis represents time sleeping rather than CPU samples. A thread that blocks on read() for 200ms will show a wide frame at read in the off-CPU view and nothing at all in the on-CPU view.
Sampling Bias Is Not Random
Sampling profilers are described as having “statistical” accuracy, which sounds reassuring, as if errors will average out over enough samples. In practice, sampling bias can be systematic and misleading in specific ways.
The most discussed form is safe-point bias in JVM profilers. The JVM periodically reaches “safe points,” moments when all threads are paused for garbage collection bookkeeping. Many Java profilers take stack samples only at these safe points because that is the only moment the JVM guarantees a consistent, walkable stack. The problem is that JIT-compiled tight loops, the hottest code in many applications, may spend a long time between safe points. Those loops are systematically undersampled. The profiler shows the “average” behavior of the program, not its actual hotspots.
async-profiler addresses this by using the AsyncGetCallTrace API, an internal JVM mechanism that can capture a stack at any time, not just at safe points. The difference in results between a safe-point-biased profiler and async-profiler on the same workload can be dramatic. Functions that appear cold in one show up as dominant hotspots in the other.
A second form of bias appears with periodic workloads and fixed-frequency samplers. If a program has a loop that always executes in multiples of the sampling interval, the timer can fire consistently at the same phase of the loop, either over-representing or under-representing the specific instructions that happen to align with the interrupt. Linux perf mitigates this with a randomized timer interval around the nominal frequency, but fixed-interval tools (including gprof) are vulnerable to it.
Inlining Makes Hot Code Disappear
Modern compilers inline aggressively. When gcc or clang inlines a small helper function into its caller, the helper’s call frame vanishes from the binary entirely. There is no function prologue, no stack frame, just the inlined instructions woven into the parent function’s body.
For correctness this is fine; the behavior is identical. For profiling it creates a distortion: all samples that would have been attributed to the inlined function now attribute to the caller. If a tight sorting comparator gets inlined into a sort routine, the sort routine looks hot when the comparator is the real culprit. DWARF debug information records the original source locations even for inlined code, so profilers that do DWARF-based attribution (perf with -g dwarf, Valgrind’s callgrind) can recover the inlined frames. Profilers that use frame-pointer unwinding cannot.
You can verify this directly. Run perf record --call-graph dwarf on a binary compiled with optimizations and compare the output to the same binary compiled with -fno-inline. The flame graphs will look different not because the program changed behavior, but because inlining was hiding the structure of the call tree.
The Hardware Layer Profilers Miss
Even when a thread is fully on-CPU and generating samples, the profiler may misattribute the cost. Modern CPUs execute instructions out of order and speculatively; they stall when waiting for data from memory. A cache miss on a load instruction means the CPU is nominally “running” but making no forward progress for 100 to 300 cycles. Sampling profilers see this as time spent in the instruction that issued the load, but the cause is not that instruction, it is the memory access pattern that caused the cache miss upstream.
Hardware performance counters expose this layer. perf stat -e cache-misses,cache-references,LLC-load-misses,cycles,instructions gives you IPC (instructions per cycle) and cache miss rates alongside cycle counts. A program with an IPC of 0.3 when the theoretical peak is 4.0 has a severe memory bottleneck that CPU profiling alone will never explain.
perf mem goes further, recording the latency of individual memory accesses and the cache level at which they were satisfied. perf c2c specializes in detecting false sharing in multi-threaded programs, where two threads modify different variables that happen to live on the same cache line, causing constant invalidation traffic between cores without any actual data sharing.
This is a domain where eBPF-based tooling has grown significantly. Parca and Pyroscope provide continuous, always-on profiling in production environments at low overhead (typically 1 to 2% CPU) using eBPF for system-wide sampling without per-process instrumentation. The tradeoff is that eBPF stack unwinding from userspace requires frame pointers or DWARF unwinding support in the kernel, and many production binaries are compiled without frame pointers by default.
Flame Graph Misreadings
Flame graphs have become the default output format for profiler results, and they carry a set of visual conventions that invite misreading.
The x-axis is alphabetically sorted by function name, not ordered by time. The left half of the graph does not happen before the right half. Two wide towers side by side say nothing about whether those code paths execute concurrently, alternately, or at completely different times in the program’s lifetime. This is a frequent source of confusion when showing flame graphs to engineers who encounter them for the first time.
Color in the standard flamegraph.pl tool is arbitrary. Red frames are not hot; they are just functions whose names hash to a warm color. Differential flame graphs, which color frames by the change in sample counts between two runs (red for regression, blue for improvement), are far more useful for before-and-after comparisons but require two profiles to generate.
Recursion compresses. A deeply recursive function appears as a thin tall tower in a flame graph because the recursive frames stack vertically. The horizontal width still represents the total CPU time consumed by all recursive invocations, but the visual appearance differs from iterative code with the same cost, which would appear as a wide flat block.
What Good Profiling Practice Looks Like
The practical lesson from understanding profiler limitations is that performance analysis requires layered instrumentation, not a single tool.
Start with wall-clock time. If the program is slow, wall-clock time is the ground truth. On-CPU profiling is the second step, not the first. If on-CPU time accounts for most of wall-clock time, a flame graph will be informative. If it accounts for a small fraction, the problem is off-CPU: blocked I/O, lock contention, or scheduler latency.
For Java workloads, use async-profiler in wall-clock mode rather than CPU mode as the default starting point. Wall-clock mode samples all threads including blocked ones, giving a combined view that mirrors what the user experiences.
For C and C++ workloads, rebuild with -fno-omit-frame-pointer if you are using perf with frame-pointer unwinding. The overhead is 1 to 3% and the difference in unwinding reliability is significant. Alternatively, use --call-graph dwarf, which works without frame pointers but adds 5 to 15% profiling overhead.
For any workload where the on-CPU profile looks suspiciously cheap, run perf stat first and check IPC. An IPC below 1.0 on code that should be compute-bound suggests memory latency is the real bottleneck. Follow up with perf mem or hardware counter-based profiling before concluding anything about function-level hotspots.
The profiler is a precise instrument for a specific measurement. Knowing exactly what it measures, and what it does not, is the prerequisite for interpreting its output correctly.