Profiling native code has always involved an awkward trade-off between power and accessibility. The powerful tools, perf on Linux and Instruments on macOS, require significant setup, platform-specific knowledge, and in the case of Instruments, an entire Xcode installation. The accessible tools, like cargo-flamegraph, produce static SVG output that is useful but limited. samply closes that gap by pairing serious kernel-level sampling with the Firefox Profiler’s web UI, and the result is worth understanding both as a tool and as an architectural decision.
What samply actually does
The workflow is intentionally minimal. You install samply via cargo install samply, then prefix any command with samply record:
samply record ./my-binary --args
For Rust projects, the cargo integration goes further:
cargo samply run --release
cargo samply bench my_benchmark
cargo samply test expensive_test
When the program finishes (or you interrupt it), samply starts a local HTTP server, opens profiler.firefox.com in your browser pointed at that server, and you immediately have an interactive flame graph, call tree, timeline with thread swimlanes, and marker track. No post-processing step. No format conversion. No manual symbol loading.
The key design decision underlying all of this is that samply emits the Firefox Profiler JSON format (also called the Gecko Profile Format) and delegates the entire visualization layer to profiler.firefox.com, which is a mature, open-source React application Mozilla has been developing for years to profile Firefox itself.
The Firefox Profiler as a general-purpose UI
Markus Stange, samply’s author, spent years at Mozilla working on the Gecko Profiler and the Firefox Profiler web application. His insight was that profiler.firefox.com is genuinely the best interactive profiling UI available for native code, and that nothing in its design is actually Firefox-specific. The format supports multiple threads, multiple processes, user-defined markers, hardware counters, and inlined frame expansion. It compresses well. It has a real symbol resolution protocol. Making it available for arbitrary native binaries just required building the sampling and symbolication backend.
The Firefox Profiler JSON format uses a columnar (struct-of-arrays) layout that is efficient to parse in JavaScript and compresses very well. Each thread’s data is stored as separate typed arrays for stack indices, timestamps, frame locations, and string references:
{
"threads": [{
"name": "Main Thread",
"samples": {
"stack": [0, 1, 2, 1, 3],
"time": [0.0, 1.0, 2.0, 3.0, 4.0]
},
"stackTable": {
"prefix": [null, 0, 1, 0],
"frame": [0, 1, 2, 3]
},
"stringTable": ["my_func", "inner_func", "hot_path", "other_func"]
}],
"libs": [{ "name": "my-binary", "start": "0x...", "debugPath": "..." }]
}
Because samply speaks this format natively, profiles can be uploaded to profiler.firefox.com and shared via URL, or saved to disk with --save-only for CI environments. Any tool that writes this format gets the full Firefox Profiler UI for free.
Stange also authored the Rust crates that form samply’s internal stack: framehop for cross-platform stack unwinding, wholesym for symbol resolution across Mach-O, ELF, and PE/PDB formats, and fxprof-processed-profile for constructing Firefox Profiler JSON programmatically. Each is published independently, which means other tools can use them without taking samply as a dependency.
How sampling works on each platform
The simplicity of samply record hides a significant amount of platform-specific work.
On macOS, samply uses Mach kernel APIs. It calls task_for_pid() to obtain the task port for the target process, task_threads() to enumerate threads, and thread_suspend() / thread_get_state() to capture register state at each sample point. Frame pointer unwinding gives fast, reliable stacks when the binary is compiled with frame pointers preserved. For code without frame pointers, framehop falls back to DWARF CFI unwinding and also handles macOS’s Compact Unwind format, stored in the __unwind_info Mach-O section. This is important because Apple system frameworks use Compact Unwind extensively, and any tool that skips it will produce broken stacks through system calls.
For profiling without sudo, samply injects the com.apple.security.get-task-allow entitlement via codesign when it launches a binary itself. This is what cargo samply does automatically, and it is why the workflow does not require elevated privileges for most development use cases.
On Linux, samply uses perf_event_open() with PERF_TYPE_SOFTWARE / PERF_COUNT_SW_CPU_CLOCK and PERF_SAMPLE_CALLCHAIN to collect kernel-provided stack traces via a memory-mapped ring buffer. This gives access to kernel frames as well as userspace frames in the same sample, which means you can see the full path from your Rust code through a syscall into the kernel. The usual perf_event_paranoid constraint applies; setting it to -1 or running as root is required for kernel stack access:
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
Stack quality on Linux depends heavily on whether your binary was compiled with frame pointers. For Rust, cargo samply injects -C force-frame-pointers=yes into RUSTFLAGS automatically. For C and C++, you may need to add -fno-omit-frame-pointer to your build. Without this, framehop falls back to DWARF CFI, which works but is slower to unwind.
On Windows, samply uses ETW (Event Tracing for Windows) with the kernel profiling provider, which requires Administrator privileges. For symbol resolution it can use PDB files and Microsoft’s public symbol server at msdl.microsoft.com, giving full stacks through Windows API calls. Windows support is the newest of the three platforms and the most constrained by the privilege requirements of ETW.
Comparing samply to the existing alternatives
cargo-flamegraph is the tool samply most directly replaces in the Rust ecosystem. Flamegraph wraps perf or dtrace and feeds the output through the inferno crate to produce an SVG. The SVG is self-contained and easy to share, which is an advantage. But it is a single static image: no timeline, no thread separation, no marker track, no way to filter to a specific time range. Inlined frames are collapsed unless your version of perf has specific DWARF support. samply’s output addresses all of these limitations.
perf on Linux is more powerful than samply in some respects, particularly for hardware counter profiling, eBPF integration, and system-wide profiling across all processes. It is also significantly more complex to operate. perf record, perf report, and flame graph generation from perf.data involve multiple commands and external scripts. samply imports perf.data files directly if you already have them, which covers the case where you want perf’s collection capabilities with Firefox Profiler’s visualization.
Instruments on macOS produces excellent profiles and its UI is genuinely good, but it requires Xcode, does not work from the command line in a scriptable way, and cannot profile Linux or Windows builds. samply’s macOS sampling uses the same underlying Mach APIs and, in practice, produces comparable quality results for native binaries.
The inlining detail that matters
One specific technical advantage of samply over frame-pointer-only profilers is that it parses DWARF DW_TAG_inlined_subroutine entries to reconstruct inlined call frames. When the compiler inlines a function, frame-pointer walking sees only the outer frame; DWARF inlining information tells you which logical function was executing at each instruction address. The Firefox Profiler displays these as separate frames in the call tree and flame graph with an annotation indicating they were inlined.
For Rust code compiled in release mode with heavy inlining, this makes a substantial difference in profile readability. Without inlining reconstruction, you see a flat profile dominated by the outermost function that was not inlined, with none of the inner structure visible. With it, you see the full call hierarchy as it existed in the source, even for code that was entirely inlined by the optimizer.
To get inlining information in a release profile, add this to Cargo.toml:
[profile.release]
debug = 1
Level 1 debug info includes line numbers and inlining records without the full local variable data, keeping the binary size reasonable.
Where samply fits in practice
For Rust development specifically, samply has become the most ergonomic path from “I need to find a hot path” to “I can see exactly what the program is doing.” The cargo samply subcommand handles build flags, entitlements on macOS, and browser launch in one command. The Firefox Profiler’s UI is familiar to anyone who has profiled a web application in Firefox, and its features (timeline scrubbing, per-thread flame graphs, marker annotations, profile diffing) are genuinely useful for understanding real performance problems.
The broader value is that Stange decomposed the profiling problem cleanly. The sampling backends, the unwinder, the symbol resolver, and the visualization layer are all separate components with defined interfaces. Other tools in the ecosystem, including custom profilers and performance testing frameworks, can use framehop or wholesym or fxprof-processed-profile independently. That modularity is the kind of infrastructure investment that tends to compound over time.
samply’s GitHub repository has installation instructions for all three platforms and covers the configuration options in detail.