What Your Profiler Isn't Telling You

Most developers treat a flamegraph like a complete picture. You run a profiler, scroll through the output, find the widest stack frame, fix it, and declare a win. That workflow is not wrong, but it is dangerously incomplete. The Red Hat Performance Engineering blog has a sharp breakdown of this exact gap: profilers tell a precise but partial story, and the part they omit is often where the real problem lives.

The key insight, easy to miss if you’re just learning to profile, is that a CPU profiler only sees your program while it is running on a CPU. Everything your program does while waiting, sleeping, blocked on a lock, or stuck in a system call is invisible. If your service handles a request in 200ms and spends 180ms waiting for a database response, a CPU-only profile will show you a blazing-fast application. The profiler is not lying; it is just answering a different question than the one you’re asking.

Sampling Isn’t Uniform

The most common profiler design is a sampling profiler: at a fixed interval, the profiler interrupts the running process, captures the current call stack, and continues. At 99 Hz (a common default in perf and async-profiler), that’s roughly one sample every 10 milliseconds. Any function that completes faster than that interval can be called thousands of times without ever appearing in a profile.

This is not a bug in the tools; it’s a fundamental property of statistical sampling. The interpretation is probabilistic: a function appearing in 30% of samples consumed roughly 30% of CPU time during the profiling window. The rounding error on a single measurement is significant. The rounding error when you’re comparing two implementations of a tight inner loop is severe. For microbenchmarking, you need a dedicated tool like JMH or criterion, not a sampling profiler.

The JVM Safepoint Problem

For Java specifically, sampling profilers had a deep structural problem for years: the JVM’s safety model. HotSpot needs occasional “safepoints”, moments where all threads are paused so the JVM can do housekeeping like garbage collection, deoptimization, and code patching. Traditional JVM profilers could only safely inspect thread stacks at these safepoints.

This creates systematic bias. Safepoints occur at predictable places in generated code: method calls, loop back-edges, exception handling. Code that runs in a tight, safepoint-free loop (certain intrinsics, counted loops with JIT optimizations applied) appears less frequently in profiles than it actually runs. You can stare at a JVM profile showing nothing suspicious while 40% of your CPU time burns in an optimized loop that never happens to be at a safepoint when the profiler checks.

async-profiler fixed this by using the AsyncGetCallTrace (AGCT) JVMTI function, which can sample outside safepoints. This was a significant advance, and it’s why async-profiler produces noticeably different (and more accurate) profiles than older JVMTI-based profilers like YourKit or JProfiler operating in sampling mode. The difference is visible on workloads with heavily optimized inner loops.

Nitsan Wakart’s deep analysis of safepoint bias is still the most thorough treatment of this problem. It’s from 2016, but the underlying mechanics haven’t changed.

Off-CPU Time: The Invisible Majority

For latency-sensitive work, the biggest profiler blind spot is off-CPU time. When a thread is blocked waiting for a lock, reading from a socket, flushing to disk, or just sleeping, a CPU profiler records zero samples. From the profiler’s perspective, that time did not happen.

Brendan Gregg’s off-CPU analysis addresses this directly. The approach uses kernel tracing to capture thread scheduling events: when a thread is descheduled and why, and when it resumes. The resulting off-CPU flamegraph looks similar to a CPU flamegraph but represents time spent waiting rather than executing. Combining both gives you a wall-clock picture.

perf on Linux can capture this with perf sched or by tracing sched:sched_switch events, though the data volume is enormous on busy systems. For JVM workloads, async-profiler’s wall clock mode is a practical alternative: instead of sampling only on-CPU threads, it samples all threads at regular intervals regardless of CPU state. The wall clock profile shows you where threads are spending time end-to-end.

The tradeoff is overhead. Sampling blocked threads requires waking them briefly to capture their stack, which increases profiler cost. In practice, wall mode at a low sampling rate (50-100 Hz) is the right default for diagnosing latency problems; CPU mode at higher rates is better for throughput tuning.

Flamegraph Reading Pitfalls

Flamegraphs are the standard visualization for sampling profiler output, and they are genuinely useful. But a few properties are worth keeping in mind.

The width of a frame represents samples, not time per invocation. A wide frame at the top of many stacks means that function was on the CPU frequently; it does not tell you whether it was called once with long duration or a million times with short duration. Call counts require instrumentation, not sampling.

The ordering of frames at the same depth is alphabetical, not temporal. Left-to-right is not chronological. A flamegraph does not show you the sequence in which functions were called; it shows you where time was aggregated across the entire profiling window. If your code alternates between two equally expensive phases, the flamegraph shows them side by side with no indication of their interleaving.

Differential flamegraphs are a technique for comparing before-and-after profiles. Red frames grew, blue frames shrank. This sidesteps the interpretation problem of “is this frame wide because of a regression or because the workload changed” by making the comparison explicit. They’re underused.

The Observer Effect

Every profiler imposes overhead, and that overhead changes what you’re measuring. This is not hypothetical. At 99 Hz with a JVM profiler, the overhead is typically low enough to be acceptable. At 1000 Hz, or with instrumentation mode enabled, or with memory allocation tracking turned on, the overhead can be 10-30%. A program spending most of its time allocating objects will look completely different under an allocation-heavy profiling mode: the profiler’s own allocation tracking becomes a major consumer.

The more subtle version of this problem appears with cache effects. Profiler code, running in the same process, competes for CPU caches with your application. If your application’s working set fits in L2 cache, and the profiler repeatedly evicts that working set, the profiled version may run substantially slower than the unprofiiled version, and the hot spots may shift to reflect cache misses that don’t exist in production.

This is why production profiling with low overhead (perf at 49 Hz, async-profiler at 99 Hz, continuous JFR with event thresholds) is a different discipline from development profiling with higher fidelity. The former tells you about real production behavior; the latter helps you understand a specific code path in isolation.

What Good Profiling Practice Looks Like

The practical takeaway from understanding these gaps is to use multiple tools at different layers and triangulate.

Start with system-level tools: vmstat, iostat, and top tell you whether you have a CPU problem, an I/O problem, a memory pressure problem, or a scheduling problem. A CPU utilization of 5% with high latency is not a CPU profiling problem.

For JVM work specifically: async-profiler in CPU mode shows on-CPU hotspots without safepoint bias; async-profiler in wall mode shows latency distribution including blocked time; Java Flight Recorder with its low-overhead continuous mode captures GC events, lock contention histograms, and class loading alongside CPU samples. These three together cover most of what you need.

For native code: perf stat gives you hardware counters (cache misses, branch mispredictions, instructions per cycle) that sampling alone doesn’t surface. A function appearing wide in a flamegraph with a high cache miss rate is a fundamentally different problem from the same function with good cache behavior but high instruction count. Intel VTune and AMD uProf give you microarchitectural detail that perf sampling doesn’t.

The piece from the Red Hat Performance team makes one point that’s easy to underweight: profiler output is evidence, not diagnosis. A wide frame in a flamegraph tells you where time went; it does not tell you why it went there or whether reducing it is feasible. The performance engineer’s job starts after the profiler output, not at the flamegraph.

Understanding what each tool measures, and at least as importantly what it cannot see, is the prerequisite for using them correctly. A flamegraph of only on-CPU time, interpreted as a complete picture of your program’s behavior, can send you optimizing the wrong thing entirely while the actual bottleneck never shows up in any profile at all.