· 7 min read ·

The Profiler's Blind Spots: Skid, Off-CPU Time, and Memory-Bound Workloads

Source: lobsters

A profiler is one of the first tools a developer reaches for when something is slow. The feedback loop feels tight: run your workload, sample call stacks at some frequency, visualize the result as a flame graph, find the wide bars, fix those. Red Hat’s performance engineering blog recently explored what this process actually shows and, more importantly, what it does not. That second part deserves more attention than it usually gets.

How Sampling Profilers Work

Most CPU profilers used in production, including Linux perf, the BCC/eBPF tool collection, and language-level profilers like async-profiler for the JVM, work by sampling. At a configured interval they interrupt the CPU, capture the current instruction pointer and a call stack, then let execution continue. Aggregate enough samples and patterns emerge: functions that appear frequently in the stack are consuming CPU time.

The statistical premise is sound. If a function accounts for 30% of your program’s CPU time, it should appear in roughly 30% of your samples. With enough samples you get a reasonable picture of where compute goes.

The problem is that “enough samples” and “reasonable picture” both conceal significant complexity.

Instruction Skid

Hardware performance counters count hardware events: cache misses, branch mispredictions, retired instructions. When a counter reaches its overflow threshold, the CPU generates an interrupt and the profiler records the instruction pointer.

The CPU does not stop at the exact instruction that triggered the event. Modern out-of-order processors execute instructions speculatively, in parallel, across multiple pipeline stages at once. By the time the interrupt is delivered, the instruction pointer may have advanced several to dozens of instructions past the one that caused the count to overflow. This displacement is called instruction skid.

In practice, this means the instruction perf marks as hot may not be the actual bottleneck. Intel’s Precise Event Based Sampling (PEBS) and AMD’s Instruction Based Sampling (IBS) were designed to address this. With PEBS, the CPU records architectural state at the exact instruction that caused the counter overflow into a dedicated hardware buffer; the profiler reads from that buffer rather than the current instruction pointer. The result is much more precise event attribution.

To use PEBS with perf, select an event with the :p or :pp modifier:

# Standard sampling, subject to skid
perf record -e cache-misses -g ./myprogram

# PEBS-precise sampling, reduced skid
perf record -e cache-misses:pp -g ./myprogram

The :pp suffix requests the highest precision level available on the hardware. Not all events support PEBS, and capability varies by CPU generation, but for L1/L2/L3 cache load misses and retired instructions it is usually available on recent Intel processors. AMD’s IBS covers a different event set and operates at the instruction dispatch level, which provides its own useful precision characteristics.

The Observer Effect

Any profiler adds overhead, and overhead changes the behavior being measured. At the default perf record sampling rate of 1000 Hz the overhead is low, often under 2% for CPU-bound workloads. Increase the rate to catch short-lived hotspots, or sample a high-frequency hardware event like all memory loads, and the overhead compounds quickly.

A subtler form of observer effect comes from the profiler competing for the same hardware resources as the code under test. The profiler’s signal handler shares the instruction cache. PEBS output buffers consume TLB entries. On a multi-socket system, profiling infrastructure running on one NUMA node can generate cross-node memory traffic that inflates latency measurements elsewhere.

For kernel profiling or high-throughput workloads, perf record can produce multi-gigabyte trace files rapidly on a busy system, and the I/O involved in writing those files perturbs I/O-bound measurements. The -m flag controls the in-kernel ring buffer size per CPU:

# Limit ring buffer to 4 MB per CPU
perf record -m 4 -e cycles:pp -g ./myprogram

Smaller buffers mean more frequent flushes, which adds overhead of a different kind. There is no configuration that eliminates the trade-off; you pick the overhead profile that least contaminates your specific measurement.

Off-CPU Time Is Invisible

The most fundamental limitation of CPU sampling profilers is that they only record where the CPU is executing your code. They cannot record where your threads are blocked.

If a thread is waiting on a mutex, sleeping in a read() call, or blocked on a network receive, it does not appear in CPU samples. A workload that spends 80% of its time blocked on I/O will produce a nearly empty flame graph. The profiler shows the remaining 20% of CPU work in detail and gives no signal about the actual bottleneck.

Off-CPU analysis requires different tooling. Linux’s eBPF subsystem makes this tractable without modifying the kernel or the application. By attaching probes to the scheduler’s context-switch events, you can measure how long threads spend off-CPU and capture the call stack at the point where the thread went to sleep. Brendan Gregg documented the off-CPU flame graph methodology in detail. A simplified version with bpftrace looks like this:

bpftrace -e '
tracepoint:sched:sched_switch {
  if (args->prev_state) {
    @start[args->prev_pid] = nsecs;
    @stacks[args->prev_pid] = ustack;
  }
}
tracepoint:sched:sched_wakeup {
  $start = @start[args->pid];
  if ($start) {
    @offcpu_us[args->pid, @stacks[args->pid]] =
      hist((nsecs - $start) / 1000);
    delete(@start[args->pid]);
    delete(@stacks[args->pid]);
  }
}'

This pairs with CPU profiling to give a complete picture: the CPU flame graph shows where time is spent computing; the off-CPU flame graph shows where time is spent waiting. Running both together is often the fastest way to determine whether a latency problem is compute-bound or contention-bound.

What Flame Graphs Hide About Memory-Bound Code

Flame graphs are an excellent tool for identifying hot code paths. They are also effective at producing misleading conclusions about memory-bound workloads.

Consider a tight loop that processes a large array. The loop’s instructions are cheap in isolation, but if the array does not fit in cache, each access causes a last-level cache miss with 100 to 300 cycle latency on modern DRAM. The CPU stalls waiting for memory, and sampling profilers attribute those stall cycles to the instruction that issued the load. The flame graph shows the loop as the hot path, which is accurate so far as it goes, but rewriting the loop logic will not help. The fix involves restructuring data access patterns, reducing working set size, or prefetching, none of which the flame graph suggests.

Hardware event profiling helps distinguish this case. Sample on cache miss events rather than on time:

# Sample on L3 load-miss events
perf record -e LLC-load-misses:pp -g ./myprogram
perf report

If the count of L3 misses correlates with the performance problem, you are dealing with a memory-bound workload. Comparing instructions per cycle (IPC) against wall clock time is another diagnostic: a workload with low IPC despite high CPU utilization is almost always memory-bound. You can get IPC directly from perf stat:

perf stat -e cycles,instructions ./myprogram
# IPC = instructions / cycles

Intel’s Top-Down Microarchitecture Analysis (TMA) methodology formalizes this categorization. By sampling a specific set of PMU events simultaneously, TMA classifies CPU time into Front End Bound, Back End Bound (memory stalls), Bad Speculation, and Retiring (useful work) categories. A reasonable approximation runs directly on Linux with perf stat:

perf stat -M TopdownL1 ./myprogram

The output gives you a coarse breakdown of where cycles are consumed at the microarchitectural level, before you spend time chasing flame graph hotspots that point at the wrong layer of the stack.

Profiling Is One Input in a Larger Diagnostic Process

A profiler shows where the CPU is and is not spending cycles. It does not tell you whether the workload’s design is sound, whether the algorithm has the right complexity class, or whether system configuration is contributing to the problem.

A workload bottlenecked on NUMA memory topology will not obviously reveal that in a CPU flame graph; you need numastat and memory locality analysis. A workload bottlenecked on filesystem metadata operations needs opensnoop or funccount from the BCC toolkit, not perf record. A latency problem caused by scheduler preemption needs kernel scheduling event analysis with trace-cmd or a tracepoint-based eBPF script. These are entirely different tools targeting entirely different parts of the system.

The performance engineering workflow that works treats profiling as a way to validate a hypothesis, not to generate one. You correlate profiler data against system-level metrics: CPU utilization, memory bandwidth from perf stat -e mem-bandwidth, network throughput, scheduler run queue latency histograms. When the profiler points at a function, the useful question is whether that function is slow because of computation, memory access patterns, lock contention, or something that does not appear in the profiler at all.

The framing from Red Hat’s performance team is useful here. The profiler is a tool with a specific aperture: it sees CPU time and call stacks, and nothing else unless you explicitly instrument for it. Understanding those limits precisely is what lets you know when to set the profiler down and reach for a different instrument, and that judgment is most of what separates effective performance engineering from staring at flame graphs until something looks suspicious.

Was this interesting?