What Profilers Show and the Performance Problems That Stay Hidden

Performance engineering often begins and ends with perf stat or a flame graph, and that is where a lot of diagnostic work quietly goes wrong. The profiler runs, the hot path lights up red, you optimize it, and latency barely improves. The problem was never where the profiler said it was.

The Red Hat performance engineering team recently wrote about this, describing how profiler output can lead investigations in the wrong direction. Their experience reflects a structural limitation that affects every statistical, sampling-based profiler: they only measure time spent executing on CPU.

The Sampling Model and Its Consequences

Statistical profilers like Linux perf work by sending periodic interrupt signals, typically at 99 Hz by default (chosen to avoid harmonic interference with systems running at 100 Hz), and recording the current instruction pointer plus a stack trace at each sample. After collection, the tool counts how often each function appears at the top of the stack and reports those counts as a proxy for CPU time consumed.

This is a fundamentally sound approach for measuring CPU utilization. The math is simple: if a function appears in 300 out of 1000 samples, it consumed roughly 30% of CPU time during the measurement window. The issue is everything the model excludes.

When a process calls read() on a socket and blocks waiting for data, it is descheduled. The kernel moves it off the CPU. While it waits, perf collects zero samples for it. The function that triggered the blocking call is invisible in the output because it never held the CPU during a sample. If that function runs millions of times and each call blocks for 50 microseconds, it can dominate wall-clock latency while appearing nowhere in a CPU flame graph.

This is called off-CPU time, and it covers everything from I/O waits and lock contention to page faults, scheduler delays, and explicit sleeps. For many production systems, off-CPU time is the primary performance problem.

Frame Pointers, Stack Unwinding, and the Gaps They Create

Before you can even trust what a CPU profiler shows, you need to contend with stack unwinding. When perf captures a sample, it records the current instruction pointer and attempts to walk the call stack to produce a full trace. There are three common methods: frame pointer walking, DWARF unwinding, and Last Branch Records.

Frame pointer walking is fast. The CPU has a register (rbp on x86-64) that points to the current stack frame, and each frame contains a pointer to the previous one. Walking the chain gives you the full call stack in nanoseconds. The complication: GCC and Clang both omit frame pointer saves by default under -O2 to free up a register for general use. The result is broken stacks, where perf reports [unknown] frames or gives a truncated trace that stops partway up the call chain.

The fix is -fno-omit-frame-pointer at compile time. Red Hat has been pushing this as a default in Fedora and RHEL packages for exactly this reason, and the performance cost is small for most workloads, typically under 1%. Java’s async-profiler solves the same problem differently by using a signal handler that walks the JVM’s internal frame structures rather than relying on native frame pointers.

DWARF unwinding reads the .eh_frame section that compilers emit for exception handling and uses it to reconstruct call stacks correctly even in optimized binaries. It is significantly slower than frame pointer walking, making it impractical for high-frequency sampling. Last Branch Records capture branch history in CPU hardware and work well for shallow stacks but are limited to roughly 32 frames on most Intel processors.

Flame Graphs and What They Actually Represent

Brendan Gregg’s flame graph visualization aggregates stack traces into a width-proportional display where wider frames consumed more samples. They are genuinely useful for navigating complex CPU profiles and identifying hot paths that would be invisible in flat function lists.

But the label matters: a CPU flame graph represents CPU time, not wall-clock time, not latency, and not user-perceived response time. An application spending 95% of its time waiting on a database query will produce a thin, sparse flame graph that looks like it has almost nothing to optimize, because the process is barely on CPU at all.

Wall-clock profiling addresses this partially. Tools like async-profiler’s -e wall mode or py-spy’s wall-clock sampling capture samples regardless of whether the process is on CPU. This shows you what the thread is doing when it is slow, including what stack led to a blocking call. The trade-off is that wall-clock samples from a highly concurrent system can be harder to interpret because multiple threads accumulate time simultaneously and the samples are not weighted by CPU impact.

Seeing Off-CPU Time with eBPF

The most complete approach to off-CPU analysis uses eBPF, specifically tools built on the BPF Compiler Collection. The offcputime tool instruments scheduler sched_switch tracepoints to record when a thread is descheduled and when it resumes, then attributes the elapsed off-CPU time to the stack trace captured at the moment of descheduling. The output feeds directly into Gregg’s flame graph scripts to produce an off-CPU flame graph.

# Collect off-CPU time for a specific PID for 30 seconds
sudo /usr/share/bcc/tools/offcputime -p $(pgrep myservice) 30 > offcpu.txt
./flamegraph.pl --color=io --title='Off-CPU Time' < offcpu.txt > offcpu.svg

The combination of a CPU flame graph and an off-CPU flame graph gives you a much more complete picture: you can see both where CPU cycles go and where time is spent waiting. The more modern approach uses bpftrace directly, which compiles to eBPF bytecode and avoids the Python-based overhead of the older BCC tools. A one-liner that captures off-CPU stacks looks like this:

bpftrace -e 'tracepoint:sched:sched_switch { @[kstack, ustack, comm] = sum((uint64)(nsecs - @start[args.prev_pid])); @start[args.next_pid] = nsecs; }'

For JVM applications, async-profiler supports a mixed-mode wall-clock profiling that sees both Java frames and native frames including JIT-compiled methods. This matters because JVM garbage collection pauses are another category of off-CPU time that CPU profilers miss entirely, and GC-induced latency spikes can easily masquerade as general application slowness.

The Observer Effect

Every profiling tool perturbs the system it measures. The question is by how much and whether the perturbation changes the conclusion.

Instrumentation-based profilers like gprof compiled with -pg add function entry and exit hooks at compile time. The overhead is proportional to call frequency: a function called ten million times per second can see overhead of 20 to 30% from instrumentation alone, which compresses lock contention windows and shifts relative timing between components. You are no longer measuring the original system.

Statistical sampling profilers have much lower overhead, but not zero. Linux perf at 99 Hz is cheap; at 10,000 Hz, the interrupt overhead starts to matter for latency-sensitive workloads. eBPF programs run in kernel space with JIT compilation and bounded execution, making them considerably cheaper than older approaches like strace, which pays a context switch per syscall and can slow a busy process by 10 to 100 times. But even eBPF adds measurable overhead when tracing high-frequency events like scheduler switches on a heavily loaded system.

The practical implication is straightforward: profile at rates and with tools whose overhead is small relative to the effects you are trying to observe. If you are debugging a 100-microsecond latency spike, running strace will produce a 10x slower process where the spike disappears entirely. A targeted eBPF script or perf probe with a narrow scope is the right instrument instead.

Building a Diagnostic Workflow

The approach described in the Red Hat post starts with the symptom rather than the profiler. Is the problem high CPU utilization, high latency, or low throughput? These questions point to different tools.

High CPU utilization is a CPU profiler problem: start with perf stat to get hardware counters (instructions per cycle, cache miss rate, branch mispredictions), then use perf record with flame graphs to identify hot functions. A low IPC combined with a high LLC miss rate points to memory access patterns and cache locality issues rather than algorithmic complexity.

High latency with normal CPU utilization is almost always an off-CPU problem: lock contention, I/O, memory allocation stalls, or scheduler delays. Start with offcputime or wall-clock profiling, then narrow down using perf lock for mutex contention, biolatency from BCC for block I/O latency distribution, or tcplife for per-connection network overhead.

Once you identify the category, the Linux performance observability ecosystem provides purpose-built tools for almost every subsystem. perf mem profiles memory access patterns with hardware-assisted sampling. perf sched latency measures scheduling delays. llcstat from BCC shows LLC hit and miss rates per process. Each gives a cleaner signal than a general-purpose profiler applied to the wrong problem.

The deeper lesson from Red Hat’s work is that profiling is a method, not a single tool. Starting with the wrong profiler produces confident but incorrect conclusions. Understanding what each tool measures, what it excludes, and what perturbation it introduces is what separates a complete performance investigation from one that optimizes the code the profiler highlighted while leaving the actual bottleneck entirely untouched.