· 7 min read ·

What Profilers Can't See: The Systematic Blind Spots in Performance Analysis

Source: lobsters

Sampling profilers work by interrupting execution at regular intervals and recording the current call stack. Do this enough times, and you have a statistical picture of where CPU time is spent. Flame graphs visualize exactly this. They are honest tools, but their honesty is narrow: they show you what the CPU was running, not what the program was waiting for.

That distinction matters most when your performance problem is a latency problem rather than a throughput problem. A web service with a 200ms p95 response time might have a flame graph showing JSON serialization as the hot path. You optimize it, cutting its CPU time in half. The p95 barely moves. The reason is that the 200ms included time the request spent blocked on a database query, sitting in a lock queue, waiting for a GC pause to finish, or sleeping in the kernel scheduler. None of that appears in a CPU flame graph, because the CPU is not doing anything observable during those periods.

This is the core problem that Red Hat’s performance engineering team explores in their writeup. The article is worth reading on its own terms, but the underlying issue extends well beyond any specific example.

Safepoint Bias in the JVM

Java workloads carry an additional limitation that most developers encounter before they understand why. Many traditional JVM profilers, including early versions of JProfiler and YourKit and anything built on the JVMTI GetAllStackTraces API, sample at safepoints. Safepoints are positions in the compiled bytecode where the JVM can safely stop all threads, inspect the heap, and walk call stacks. They exist for GC, deoptimization, and similar operations.

The problem is that safepoints are not uniformly distributed. Tight computational loops may run many iterations without hitting one, especially after the JIT’s tiered compiler has decided not to insert safepoint polls aggressively. A method that spends the majority of its wall-clock time in such a loop looks nearly invisible to a safepoint-based profiler, because the profiler can only catch threads at positions they frequently visit.

This is called the safepoint bias problem, and Nitsan Wakart documented it in detail years ago with concrete examples showing methods that consumed most of actual execution time but appeared as near-zero in profiler output. The fix arrived in the form of AsyncGetCallTrace (ASGCT), a non-standard API originally introduced by Sun, which allows profilers to sample outside safepoints by delivering a SIGPROF signal to a running thread and walking the stack in the signal handler.

async-profiler by Andrei Pangin is the most capable tool built on ASGCT. It also handles the failure modes around ASGCT itself, which can return error codes like ticks_unknown_Java or ticks_unknown_not_Java in certain JVM states. async-profiler logs these rather than silently dropping them. When a profile shows a meaningful fraction of unknown samples, that is information about where the JVM’s introspection breaks down, not noise to be dismissed.

The tool exposes the difference between CPU time and wall-clock time directly through its event flag:

# CPU profiling with ASGCT, no safepoint bias
./profiler.sh -e cpu -d 30 -f output.html <pid>

# Wall-clock profiling: samples all threads, including blocked ones
./profiler.sh -e wall -d 30 -f output.html <pid>

The difference between these two outputs is often substantial. The CPU profile shows what the processor was running. The wall-clock profile shows where time went, including all the blocking, sleeping, and waiting that the CPU profile elides entirely.

Off-CPU Analysis

For native code and as a complement to JVM profiling, off-CPU analysis fills the structural gap that CPU sampling leaves. Brendan Gregg formalized the framework: a thread is either on-CPU and executing, or off-CPU and blocked for some reason. A complete performance picture requires both views.

Linux’s BCC toolkit includes offcputime, which uses eBPF to hook into the sched_switch kernel tracepoint and records how long each thread spent off-CPU, with full userspace and kernel call stacks attached:

# Record off-CPU stacks for 30 seconds for a specific PID
/usr/share/bcc/tools/offcputime -p <pid> 30 > offcpu.stacks

# Visualize as a flame graph using the standard flamegraph.pl
./flamegraph.pl --color=io --title="Off-CPU Time" < offcpu.stacks > offcpu.svg

The resulting flame graph shows what blocked threads were waiting for. A spike terminating in futex_wait_queue_me points to lock contention. One terminating in ep_poll or do_sys_poll is I/O waiting. schedule_hrtimeout_range appears for explicit sleeps and timer waits. These paths are completely invisible to any profiler that only samples while the CPU is executing your code.

For Java specifically, GC pauses sit in a category of their own. The JVM stops all application threads during stop-the-world phases, and those pauses do not produce meaningful off-CPU stacks because every thread is in the same state simultaneously. The right tools here are JVM-level: jstat -gcutil <pid> 1000 gives live GC utilization, and -Xlog:gc*:file=gc.log:time,uptime,level,tags produces detailed pause records with timestamps that can be correlated with observed latency spikes. When a p95 anomaly aligns precisely with a GC log entry, that diagnosis is stronger than anything a profiler could have shown.

Sampling Frequency and What It Hides

The frequency at which a profiler samples determines what events are statistically likely to appear. At 100Hz, a function that runs for 3ms and is called 10 times per second contributes 3% of CPU time, but you expect roughly 0.3 samples per second, which disappears as rounding error in a 30-second profile. Increasing to 1000Hz brings that into view, but introduces its own costs.

Signal delivery for ASGCT-based profilers and for perf record is not free. At high rates, the profiler’s own memory accesses pollute the CPU cache, altering the performance profile of cache-sensitive code in ways that make the profile mislead. A workload that has high L2 cache contention under normal conditions may appear to have different characteristics when a profiler is actively walking call stacks at 5000Hz.

perf stat sidesteps the sampling problem by reading hardware performance counters without taking periodic stack snapshots:

perf stat -e cache-misses,cache-references,instructions,cycles,\
branch-misses,L1-dcache-load-misses \
    -p <pid> -- sleep 30

These are aggregate counts for the measurement window, not per-function attributions, but they serve a critical validation role. If a profiler shows a suspected memory allocation hotspot and perf stat reports high L3 miss rates for the same period, the two measurements reinforce each other. If the cache numbers are clean despite the apparent hotspot, that is reason to question the profiler’s attribution rather than accept it.

The Observer Effect

Instrumentation-based profilers, which insert probes at function entry and exit rather than sampling, face a more severe version of the observer effect. Adding probes to a tight inner loop can add 2x to 10x overhead, which changes branch predictor state, disrupts cache line behavior, and in the worst case alters the fundamental timing relationships between threads. The profiler is not measuring the system; it is measuring a modified version of the system that only approximates the original.

Sampling profilers have lower overhead but are not neutral. They also tend to be used differently in production versus development. A profiler attached to a production service is typically configured conservatively: low sample rate, limited stack depth, perhaps only sampling on request. The configuration that avoids perturbing production is often precisely the configuration that misses the short-lived spike causing the p99 latency issue.

JIT inlining adds another layer of attribution complexity. When the JVM inlines a method into its caller, the inlined code disappears as a separate stack frame. Depending on how the profiler reconstructs frames from JIT metadata, the inlined method may be attributed to its parent, to a synthetic frame, or simply lost. The most obvious optimization candidates, small frequently called methods, are also the ones the JIT inlines most aggressively. async-profiler supports reconstructing inlined frames from JIT compilation metadata when the JVM is running with adequate debug info, but the reconstruction is incomplete in some compiler states.

Building a Complete Picture

The consistent finding from systematic performance work, including the kind Red Hat’s team does on OpenJDK and RHEL workloads, is that hard problems require correlating data across multiple tools. No single profiler sees the whole system.

A practical workflow for a Java service on Linux:

  1. Start with async-profiler in wall-clock mode to see where time goes including blocked threads
  2. Switch to CPU mode to identify on-CPU hotspots in isolation from off-CPU noise
  3. Run offcputime concurrently to see what blocked threads are waiting for at the OS level
  4. Check jstat and GC logs to account for pause time that profilers cannot attribute to application code
  5. Use perf stat for hardware counter validation on specific theories before investing in optimization

For native workloads, perf record --call-graph dwarf provides accurate stack traces including inlined frames, paired with perf c2c for detecting false-sharing in multi-threaded code. bpftrace allows writing targeted one-liners that trace specific kernel events with userspace call stacks attached, without the overhead of recording everything.

# False sharing detection for multi-threaded native code
perf c2c record -g -- <binary>
perf c2c report --stdio

# bpftrace: trace futex contention with userspace stacks
bpftrace -e 'tracepoint:syscalls:sys_enter_futex
    /comm == "myapp"/ { @[ustack()] = count(); }'

A profiler result showing nothing interesting is not evidence that the code is fast. It may mean the bottleneck lives outside what that tool can measure. Understanding a profiler’s scope, what it samples, at what frequency, and what it cannot see, matters as much as knowing how to read the output it produces.

Was this interesting?