Performance work lives and dies by measurement. The profiler is the first tool engineers reach for when something is slow, and most of the time it gives useful signal. But a recent Red Hat performance engineering post makes a point worth sitting with: profilers have structural limitations that are easy to overlook, and those limitations can make slow code look fast, hot paths look cold, and lead you to optimize things that don’t matter.
This is not a critique of profilers. It’s a reminder that every measurement tool encodes assumptions, and understanding those assumptions is what separates a productive profiling session from a frustrating one.
How Sampling Profilers Work
Most profilers in common use are sampling profilers. At fixed intervals, they interrupt the running process, capture the current call stack, and record it. After enough samples, the aggregate shows which functions appear most often, which is treated as a proxy for where the program spends its time.
The interval matters. Linux perf defaults to around 4000 samples per second when using hardware performance counters, or roughly once every 250 microseconds. async-profiler defaults to 10ms intervals. These intervals are short enough to catch functions that run for milliseconds, but a function that runs for 50 microseconds and gets called a million times may appear in only a handful of samples, or none at all.
The sampling mechanism also matters. On Linux, perf record can use the perf_event_open syscall with hardware counters like CPU cycles or instructions retired. This ties sampling to actual hardware events rather than wall time, which has its own implications for what you see.
Safepoint Bias in the JVM
The JVM’s built-in profiling infrastructure, exposed through JVMTI, has a well-documented problem: it can only capture stack traces at safepoints. A safepoint is a point in execution where the JVM can pause all threads for garbage collection or other internal operations. Not every bytecode instruction is a safepoint candidate, and JIT-compiled code may execute long stretches between safepoints.
The practical consequence is that JVMTI-based profilers, including older versions of JVisualVM and some commercial profilers, systematically undersample code that runs between safepoints. Tight loops over primitive arrays, for instance, may have very few safepoints and will appear to take less CPU time than they actually do. The profiler isn’t lying exactly, it’s just only able to observe a biased subset of execution states.
This was described in depth by Nitsan Wakart in his safepoint bias analysis, and it explains why async-profiler was built. Andrei Pangin’s async-profiler uses AsyncGetCallTrace, an undocumented but stable JVM API, to capture stack traces at arbitrary points independent of safepoints. The resulting profiles are significantly more accurate for CPU-bound JVM code.
Java Flight Recorder (JFR), available in OpenJDK since Java 11, uses a similar approach and adds very low overhead: typically under 1% for most workloads. It also records a broader range of events beyond CPU sampling, including lock contention, GC pauses, and I/O. For JVM profiling, JFR is the baseline to start from.
The Off-CPU Blind Spot
CPU sampling tells you where the program spends time on-CPU. It tells you almost nothing about where it spends time off-CPU.
Off-CPU time is any time a thread is not running: waiting on I/O, blocked on a mutex or semaphore, sleeping, waiting for a network response, waiting for a page fault to resolve. In many real-world services, off-CPU time dominates. A web service that processes a request in 5ms of CPU work but waits 95ms for a database query will show a completely wrong picture in a CPU-only profile. The profiler will point at your serialization code or your business logic, and you will optimize those things while the actual latency is sitting in a blocking database call.
Brendan Gregg coined the term off-CPU analysis and built tooling around it, primarily using Linux tracing infrastructure. The key insight is that to see off-CPU time you need to trace scheduler events, not sample the CPU. When a thread is descheduled, record the stack; when it’s rescheduled, record the duration it was off-CPU. The resulting flame graph shows not where CPU time is spent but where wall-clock time is spent, including all blocking.
With eBPF, this kind of tracing is now practical in production. Tools like bpftrace and the scripts in BCC can attach to scheduler tracepoints and collect off-CPU stacks with low overhead. The offcputime BCC script is a direct implementation of this:
# Sample off-CPU time for PID 1234 for 30 seconds
sudo offcputime-bpfcc -p 1234 30
For Java, async-profiler supports wall-clock profiling mode (-e wall) which samples all threads, sleeping or not, giving a picture closer to actual latency than CPU-only sampling.
Stack Unwinding and Native Code
On x86-64 Linux, stack unwinding relies on frame pointers or DWARF debug information. For a long time, many distributions compiled system libraries without frame pointers (the -fomit-frame-pointer optimization frees up a general-purpose register), which means perf cannot always reconstruct full call stacks.
The practical symptom is seeing [unknown] frames in flame graphs, or call stacks that appear truncated. If you profile a Java application that calls into native code, or a C++ application that uses shared libraries compiled without frame pointers, you may see significant portions of hot paths attributed to unknown frames.
perf record --call-graph dwarf uses DWARF unwinding, which works even without frame pointers but adds overhead and produces larger data files. Modern kernels and distributions have been moving back toward enabling frame pointers by default: Fedora 38 and Ubuntu 22.04 both ship system packages compiled with frame pointers enabled. This is a meaningful improvement for profiling system-level code.
For JVM applications, async-profiler handles stack unwinding differently: it uses JVM internals to unwind Java frames and can correlate them with native frames using perf-map-agent or built-in JVM symbol export. Getting accurate mixed-mode profiles that show both Java and native frames in the same flame graph is possible but requires some setup.
The Observer Effect
Every profiler perturbs the system it observes. The perturbation ranges from negligible to significant depending on the mechanism.
Instrumentation-based profilers, which insert code at every function entry and exit, can add 10-30% overhead or more. This changes the relative timing between functions, which changes what the profiler shows. Functions that are fast enough to be inlined at runtime may not be inlined under profiler instrumentation. Lock contention patterns change when threads run slower. The profile you see reflects a system that doesn’t behave like your production system.
Sampling profilers are generally lower overhead, but not zero. The signal delivery mechanism on Linux (SIGPROF for some profilers) can interfere with certain kinds of I/O and adds latency variability. JFR is designed to be low overhead specifically because it’s meant for continuous profiling in production, and Red Hat has contributed work to keep its overhead under control across JDK releases.
The more important observer effect is subtler: profiling changes what code the JIT compiles and how it compiles it. The JVM makes optimization decisions based on runtime feedback. Running under a profiler for a warmup period before measuring may give the JIT enough feedback to produce optimized code. Running straight into measurement may not. This affects which version of a function you’re actually profiling.
Reading Flame Graphs Correctly
Flame graphs, introduced by Brendan Gregg and now the standard visualization for profiling data, show call stacks with width proportional to sample count. The width of a frame represents how often it appeared on-CPU (or off-CPU, for off-CPU flame graphs). A wide frame at the top of a stack, with no children, is where time is actually being spent.
Two common misreadings: first, a function that appears wide but has children that together account for its full width is not itself slow, it’s just a common caller of slow things. The useful target is the wide leaf frames. Second, the height of the stack (how many frames are stacked) tells you about call depth, not about how much time is spent.
Flame graphs also don’t show time ordering. A function that runs in two separate hot phases, once early and once late, looks the same as one that runs continuously. For latency investigations this matters: a function that runs quickly most of the time but occasionally takes 10x longer will look fine in a flame graph because the average sample count is low.
For latency percentile analysis, you need something different: histograms, or tools like HDRHistogram that capture the full distribution. Gil Tene’s work on coordinated omission in benchmarking is directly relevant here: when you measure only the cases you happened to observe, you miss the long tail, and the long tail is often where your latency problem lives.
What to Actually Do
None of this means profilers are unreliable. It means you need to be deliberate about what question you’re asking and whether your profiling tool can answer it.
If you’re investigating CPU throughput on the JVM, use async-profiler or JFR in CPU sampling mode. Avoid JVMTI-based profilers for anything where tight loops matter.
If you’re investigating request latency, start with wall-clock profiling, not CPU profiling. Async-profiler’s -e wall mode samples all threads and gives a view of where time is actually going, including blocking.
If you suspect lock contention or I/O as the bottleneck, profile off-CPU time using eBPF tools or async-profiler’s lock profiling events. JFR’s lock event recording is also useful here and is available without any external tooling.
For native code on Linux, make sure you’re collecting full stack traces. Check whether your libraries have frame pointers or DWARF info available. Consider using perf record --call-graph fp if frame pointers are present, or --call-graph dwarf if not.
And when a profiler shows you a surprising result, or no result at all in a place you expected to find one, consider whether the tool’s measurement model can see what you’re looking for. The profiler is not always wrong. But understanding when it might be is what makes the difference between performance engineering and performance guessing.