Sampling Profilers, Safepoint Bias, and the Case for Off-CPU Analysis
Source: lobsters
When performance is bad, the first tool most engineers reach for is a profiler. Run it, generate a flame graph, find the widest frame, and fix the hot path. The workflow works often enough that many teams stop there, treating a clean-looking profile as evidence that things are fine.
That inference is where the trouble starts. The Red Hat performance engineering team documents this gap in detail: a profiler is accurate within its measurement domain, but entirely silent outside it. Knowing where that boundary sits is a prerequisite to interpreting any profile correctly.
What On-CPU Profilers Actually Measure
Most profilers, whether you’re using perf, async-profiler, or a commercial JVM tool, work by periodically sampling a thread’s call stack. The sampling can be timer-driven (every N milliseconds via SIGPROF) or event-driven (every N CPU cycles via a hardware performance monitoring unit counter). Either way, the profiler only records a sample when a thread is running on a processor.
This means the profiler captures on-CPU time: the time a thread spends actively computing. It discards everything else. If a thread is waiting for a network response, blocked on a mutex, sleeping, or sitting in the kernel’s run queue waiting to be scheduled, it produces zero samples. The flame graph has nothing to show for that time.
The consequences are concrete. A service under moderate load with a mix of I/O operations and shared state might spend 40% of its wall-clock latency in blocked states. The CPU flame graph for that service will look either normal or diffuse, with no obvious hot path, while every request is taking twice as long as it should. The profiler gave accurate data about what it measured; it just measured the wrong thing for the question being asked.
Safepoint Bias in the JVM
JVM applications have a second layer of profiling complexity on top of the on-CPU/off-CPU problem: safepoint bias. It’s less well understood, and it can make a profiler’s output actively misleading rather than merely incomplete.
The JVM needs to pause all threads periodically for garbage collection and other internal operations. It can only do this at safepoints, locations in the bytecode where the runtime has enough state to safely suspend a thread. Safepoints are not uniformly distributed. A method that calls other JVM methods hits safepoints frequently. A tight numerical loop with no method calls can execute thousands of iterations between safepoints.
Traditional JVM profilers collect stack traces by requesting that threads stop at their next safepoint. The profiler then samples whichever thread reached a safepoint first. This introduces systematic bias: code that executes in safepoint-dense regions gets over-sampled, and code in safepoint-sparse regions gets under-sampled or missed entirely. The distortion can be large enough to make a genuinely hot function nearly invisible in the output.
Honest Profiling, a 2014 paper by Mytkowicz, Diwan, and Hauswirth, quantified this effect and showed that safepoint bias can produce profiles that are almost entirely wrong in terms of what code appears hot. The problem was well-documented and largely unresolved for years in the Java ecosystem.
async-profiler addresses it by using the Linux perf_events API, which interrupts threads via hardware CPU counters at arbitrary execution points, not at safepoint boundaries. The difference in output between a safepoint-biased profile and a perf_events profile can be substantial; functions that were barely visible in a traditional profile sometimes dominate the async-profiler output because the traditional profiler was sampling around them.
Using async-profiler in CPU mode against a running JVM process:
./asprof -d 30 -e cpu -f /tmp/cpu_profile.html <pid>
And in wall-clock mode, which captures both on-CPU execution and blocked time:
./asprof -d 30 -e wall -f /tmp/wall_profile.html <pid>
Wall-clock mode is the more useful setting when latency is the concern rather than throughput, because it shows the full distribution of where time goes, including threads blocked on I/O and lock acquisition. The flame graph in wall-clock mode will include frames for blocking operations, which is precisely the data that on-CPU profiling discards.
For production JVM services, JDK Flight Recorder (JFR) provides a built-in alternative that captures CPU samples alongside explicit events for lock acquisition, file I/O, socket I/O, and garbage collection pauses in a single recording. JFR’s lock profiling in particular is worth enabling; it records the stack traces of threads waiting to acquire monitors, which gives a direct view of contention that no sampling profiler can infer.
Off-CPU Analysis for Native Code
For native applications and kernel-level analysis, the equivalent of wall-clock profiling is off-CPU tracing with eBPF tools. The offcputime tool from BCC attaches to the kernel’s sched_switch tracepoint and records how long each thread spends not running, grouped by the call stack at the moment the thread went off-CPU.
offcputime-bpfcc -p <pid> 30
The output is a set of stack traces annotated with accumulated off-CPU time. Feeding this into Brendan Gregg’s flamegraph.pl produces an off-CPU flame graph where width represents blocking time rather than compute time. Where a CPU flame graph shows which functions consume cycles, an off-CPU flame graph shows which call paths lead to blocking, and for how long.
bpftrace provides a lower-level way to capture the same data with more flexibility over what gets recorded:
bpftrace -e '
tracepoint:sched:sched_switch
/args->prev_state != 0 && args->prev_pid == $1/
{
@start[args->prev_pid] = nsecs;
@stack[args->prev_pid] = ustack();
}
tracepoint:sched:sched_wakeup
/args->pid == $1 && @start[args->pid]/
{
@offcpu = hist(nsecs - @start[args->pid]);
delete(@start[args->pid]);
}' <pid>
The constraint with eBPF-based tracing is that it requires kernel-level access, typically CAP_SYS_ADMIN or the CAP_BPF and CAP_PERFMON capabilities introduced in Linux 5.8. In container environments or managed Kubernetes clusters, this access is often unavailable, which means many teams never reach this layer of analysis even when the bottleneck is exactly here.
Combining the Views
The practical workflow for performance investigation looks roughly like this:
- Collect a CPU flame graph. Check whether CPU utilization is near 100% during the slow period. If it is, the flame graph is probably showing the real problem.
- If CPU is moderate or low while latency is high, the bottleneck is likely off-CPU. Check
vmstat 1for context switch rate (cs) and I/O wait percentage (wa). Highcsoften indicates lock contention; highwaindicates I/O blocking. - Switch to wall-clock profiling for JVM applications or off-CPU tracing for native code. The new profile will show the blocking structure that the CPU profile omitted.
- For JVM applications, add JFR lock profiling to identify the specific monitors driving contention.
Brendan Gregg’s Linux Performance page maps the full observability tool stack against the layers of the system, which is a useful reference for identifying which tool addresses which measurement domain.
The Interpretation Problem
A flame graph with no obvious hot path is not evidence that a system is well-optimized. It may mean that CPU time is diffuse, or it may mean that the profiler’s measurement domain doesn’t cover where the time is going. A flame graph with a clear hot path is useful, but the percentage of CPU samples that a function holds does not directly translate to the percentage of wall-clock latency that fixing it would recover. If 60% of request time is spent off-CPU, even eliminating the hottest on-CPU function will only improve total latency by a fraction of what the flame graph implied.
Profilers are precise tools with specific scopes. Using them well means knowing what each mode measures, what it excludes by design, and when the excluded data is where the problem lives. The on-CPU profile is the starting point; for latency-sensitive systems, it is rarely the complete answer.