The Profiler Lies Politely: What CPU Sampling Won't Tell You

A profiler is not a measuring instrument in the strict sense. It is a sampler, and samplers have a relationship with the truth that is statistical, approximate, and subject to a set of failure modes that are worth understanding in detail before you optimize anything based on what the flame graph shows you.

The Red Hat Performance Engineering team’s writeup on profiler performance engineering is a good entry point into this territory. The core observation the article makes is that profilers show you a particular slice of program behavior, and the slice has edges you can fall off. That framing is useful, and it’s worth going further into exactly where those edges are and what lies beyond them.

How Sampling Profilers Actually Work

When you run perf record -F 99 -g ./myapp, you are telling the kernel to interrupt the process 99 times per second and capture the current instruction pointer and call stack at each interruption. The result is a histogram of where the CPU was when the kernel happened to look. Functions that consume more CPU time appear more frequently in the sample set, and the flame graph is just a visualization of that histogram, stacked by call chain.

The 99 Hz default is not arbitrary. 100 Hz would synchronize with the typical 100 Hz kernel timer tick on many Linux systems, producing aliasing artifacts where the profiler samples at nearly the same phase as the scheduler. At 99 Hz, the sampling is deliberately slightly off the scheduler frequency to avoid this resonance effect. The aliasing problem is real: if your loop runs at a frequency that happens to beat against the sampling rate, you will consistently over- or under-sample certain code paths. Brendan Gregg documented this extensively in his flame graph work, and it remains relevant on any system where you are not varying the sampling frequency.

The choice of event also matters. perf record defaults to cycles events, which are driven by hardware performance monitoring unit (PMU) interrupts rather than timer signals. This is more precise but introduces a different artifact: CPUs skid. When the PMU raises an interrupt on a cycle event, the processor does not stop at the exact instruction that caused the overflow. It continues executing for some number of instructions before the interrupt is actually serviced. The result is that hot paths can appear shifted by a few instructions in the profile, which matters when you are trying to identify the specific line responsible for cache misses or branch mispredictions. The --precise-ip flag and PEBS (Precise Event Based Sampling) on Intel processors reduce this skid, but they are not universally available and add their own overhead.

The On-CPU/Off-CPU Divide

The most significant thing a CPU flame graph does not show you is time the program spent not on the CPU. A thread that is blocked waiting for a lock, sleeping in nanosleep, or stalled on a disk read is invisible to a sampling profiler that only captures on-CPU activity. If your application spends 70% of its wall-clock time blocked on I/O or lock contention and 30% executing, the flame graph represents 100% of the 30% slice and tells you nothing about the 70%.

This distinction between on-CPU time and wall-clock time is fundamental. You can have a profile showing that memcpy consumes 40% of CPU time, optimize it down to 20%, and see no improvement in end-to-end latency because the real bottleneck is a mutex held across a network call.

Off-CPU profiling fills this gap. The technique, described in detail in Brendan Gregg’s off-CPU analysis methodology, captures stack traces at the moment a thread leaves the CPU, along with the duration of the off-CPU period. The Linux kernel scheduler knows when threads block and unblock; the challenge is attaching probes to those events efficiently.

The classic approach uses perf sched or tracepoints:

perf record -e sched:sched_switch -a -g -- sleep 10
perf script | stackcollapse-perf.pl | flamegraph.pl > offcpu.svg

This works but carries significant overhead because sched:sched_switch fires on every context switch system-wide. A busy server can generate tens of thousands of context switches per second, and capturing a full stack trace at each one is expensive. The overhead can itself perturb scheduling behavior, which brings up the observer effect.

eBPF-based tools handle this more cleanly. bcc’s offcputime and the equivalent bpftrace one-liners attach kprobes to scheduler functions and aggregate stack traces in kernel space, sending only the summarized data to userspace:

offcputime-bpfcc -df -p $(pgrep myapp) 30 | flamegraph.pl > offcpu.svg

The in-kernel aggregation reduces the data volume substantially and keeps the tool overhead manageable. You end up with two complementary flame graphs: one showing where CPU time is spent, and one showing where threads spend time waiting. Together they account for all of wall-clock time.

Stack Unwinding Failures

A profiler is only as good as its ability to reconstruct the call stack at the moment of sampling. Stack unwinding is the process of walking back up the call frames to reconstruct who called whom, and it fails in ways that silently corrupt your profile.

The traditional mechanism on x86-64 uses frame pointers. Each function saves the caller’s frame pointer in rbp and sets up its own frame, creating a linked list that can be walked. The problem is that modern compilers omit frame pointers by default when optimizing, because rbp is a general-purpose register that can otherwise be used to avoid register spills. When frame pointers are missing, the profiler cannot walk the stack, and the result is truncated stack traces that show only the top few frames.

You can force frame pointer retention:

gcc -O2 -fno-omit-frame-pointer -o myapp myapp.c

Or at the perf record level with DWARF unwinding, which reads the .debug_frame or .eh_frame sections to reconstruct frames even without rbp:

perf record -F 99 --call-graph dwarf -g ./myapp

DWARF unwinding is accurate but expensive. Capturing enough stack data for DWARF unwinding requires copying several kilobytes of stack per sample, and the unwinding itself happens in userspace during post-processing. For high-frequency sampling or short-lived processes, this overhead is visible.

Linux 6.x introduced ORC (Oops Rewind Capability) unwinding for kernel stacks, which pre-computes unwinding information at compile time into a compact table format. The kernel uses ORC for its own stack traces; userspace applications still rely on frame pointers or DWARF. The JVM has historically been problematic here because JIT-compiled code does not emit standard DWARF information. Tools like async-profiler work around this by using AsyncGetCallTrace, a JVMTI interface that can walk Java frames even through JIT-compiled code.

Missed or broken stacks appear in flame graphs as wide bars at the top labeled [unknown] or as suspiciously shallow profiles. If you see a profile where most of the samples have stacks two or three frames deep, unwinding is failing.

The Observer Effect

Measuring a program’s performance takes CPU time, memory bandwidth, and cache space. These are not free, and the resources consumed by the profiler come out of the same budget as the workload.

For most production profiling, sampling overhead is small enough to ignore. At 99 Hz with minimal stack capture, perf record adds perhaps 1-5% overhead in typical workloads. But the overhead is not uniform. Programs that make many short function calls see higher overhead because each sample requires more stack walking. Programs with large stacks require more memory copies per sample. Programs sensitive to cache pressure can see their hot data evicted by the profiler’s own working set.

The DWARF unwinding mode is where this becomes significant. Each sample may copy 8-16 KB of stack data, and at 999 Hz (a common higher-frequency setting) on a multi-threaded application, you can easily push tens of MB/s of data through the sampling buffer. On a system with memory bandwidth pressure, this is not invisible.

There is a more subtle version of this problem. Some performance bugs are timing-sensitive. A race condition that causes a slowdown only when two threads execute a particular sequence within a narrow window may disappear entirely when the profiler adds enough overhead to change relative timing. This is the profiler equivalent of a heisenbug, and the only real mitigation is to cross-reference profiler data with metrics collected by lower-overhead mechanisms like hardware counters.

perf stat is the right tool for this cross-referencing:

perf stat -e cycles,instructions,cache-misses,cache-references,branch-misses ./myapp

This uses PMU counters in counting mode rather than sampling mode. The overhead is minimal because the kernel simply reads hardware registers at program start and end. You get aggregate counts without stack traces, which is not enough to identify hot paths but is enough to confirm whether optimization attempts are moving the hardware-level metrics in the expected direction.

Inlining and the Disappearing Function Problem

Compiler optimization can make functions disappear from profiles entirely. When the compiler inlines a function, the inlined code is attributed to the caller, not to the original function. A tight inner loop that gets inlined into its call site will show zero samples in the profile under its own name, even if it represents 30% of CPU time.

This cuts both ways. A function that appears as a major hotspot in a profile might be large because it is the destination of many inlined callees, not because its own code is expensive. The profiler does not distinguish between instructions that were originally in the function and instructions that were inlined from elsewhere.

GCC and Clang both emit inline information in DWARF, and perf annotate can use this to show inline attribution when debug information is available:

perf record -g --call-graph dwarf ./myapp
perf report --inline

Without debug information, you are optimizing the compiled output, not the source code, and the mapping between the two requires either keeping debug builds handy or using tools like eu-addr2line to resolve addresses back to source locations.

What Modern eBPF Tools Add

The combination of Linux eBPF and tools like bpftrace, bcc, and the newer Pyroscope continuous profiling platform changes what is observable at production scale without unacceptable overhead.

eBPF programs run in the kernel in a sandboxed JIT-compiled environment and can attach to virtually any kernel or userspace probe point. The key advantage over traditional instrumentation is that aggregation happens in kernel space. A bpftrace program that tracks lock contention, for example, increments a kernel-side hash map keyed by stack trace and does not need to copy every event to userspace:

kprobe:mutex_lock_slowpath
{
    @stacks[ustack()] = count();
}

This produces a frequency count of userspace stacks that hit the slow path of mutex acquisition, which is off-CPU time caused specifically by lock contention. This is something a standard CPU flame graph cannot show at all.

The Continuous Profiling for Linux survey has documented how production-grade continuous profiling with eBPF can stay below 1% overhead while providing always-on flame graphs, which represents a genuine change in what kind of profiling is practical outside a lab environment.

The practical upshot of all of this is a checklist of questions to apply before acting on a profile.

First, check whether the profile represents on-CPU time or wall-clock time. If your application is latency-sensitive and you have not looked at off-CPU data, you may be optimizing the wrong thing entirely.

Second, verify that stack unwinding is working. Shallow, truncated stacks or large [unknown] sections indicate that frame pointers are missing or DWARF information is unavailable. Profiles with broken stacks are not reliable.

Third, consider whether the workload under profiling matches production. A profiler changes timing slightly, and CPU affinity, NUMA layout, and sampling frequency all affect what patterns the profiler catches. Results from a development machine rarely transfer without recalibration.

Fourth, cross-validate with perf stat hardware counters. If you optimize a hot function and the instruction count drops but cache-misses stays flat or increases, you have changed the compute-bound part of the program while leaving the memory-bound part untouched.

Fifth, be skeptical of functions that appear uniformly across many call paths. malloc, memcpy, and pthread_mutex_lock frequently appear as major consumers in profiles not because they are inefficient but because everything calls them. Attribution at the call site, not the callee, is usually more actionable.

Profiling is genuinely useful. The tools the Linux ecosystem provides, from perf to bpftrace to async-profiler for the JVM, cover a wide range of what can go wrong in a running system. But using them well requires understanding what each tool is actually measuring and, more specifically, what it is not measuring. The flame graph shows you a shadow of your program’s behavior. Understanding the geometry of that shadow is the first step toward interpreting it correctly.