Profiling has a platform problem. On macOS you reach for Instruments. On Linux you reach for perf and then spend time fighting flamegraph scripts. On Windows you open up ETW or VTune. Each of these is a different tool, a different UI, a different mental model for the same underlying question: where is my program spending its time?
samply, built by Mozilla engineer Markus Stange, takes a different approach. It is a command-line sampling profiler that works on macOS, Linux, and Windows, and it delegates the visualization and analysis entirely to the Firefox Profiler. The result is a consistent workflow across all three platforms: run one command, get one UI, analyze with one set of tools.
How Sampling Profilers Work
Before getting into samply specifically, it helps to understand what a sampling profiler does versus an instrumentation profiler. Instrumentation-based tools like traditional gprof require the compiler to insert measurement code at every function entry and exit. This gives precise call counts but introduces overhead that can distort the program’s behavior, especially in tight loops.
Sampling profilers instead interrupt the running program at a fixed frequency, typically somewhere between 100 and 1000 Hz, and record the current call stack. Over thousands of samples, patterns emerge. Functions where the program spends most of its time show up frequently in the sample set. The overhead is low because you are only capturing a snapshot periodically, and the statistical picture is accurate enough for most optimization work.
The OS mechanism for doing this varies by platform, and this is where samply does interesting engineering work underneath a simple interface.
Platform Internals
On macOS, samply uses the Mach APIs, specifically task_threads to enumerate threads and thread_get_state to read register state including the instruction pointer and frame pointer. This is the same mechanism that Apple’s own Instruments.app uses internally. macOS also provides the KDEBUG tracing infrastructure and task_info APIs that samply taps into for additional context like thread names and process lifecycle events.
On Linux, samply uses perf_event_open, the kernel interface that backs the perf command-line tool. This syscall sets up a ring buffer that the kernel fills with sample records at the requested frequency. samply reads those records, resolves symbols using DWARF debug information, and builds the profile. One useful feature here: samply can also load existing perf.data files, so if you already have a capture from a production system, you can feed it into the Firefox Profiler UI without recapturing.
On Windows, samply uses ETW (Event Tracing for Windows) under the hood, the same infrastructure that Windows Performance Analyzer relies on. ETW is a kernel-level tracing system that has been part of Windows since XP; it is low overhead and supports sampling at high frequencies.
The Firefox Profiler as a Backend
The front-end story is where samply diverges most sharply from conventional Unix tooling. Rather than generating a flamegraph SVG or dumping a text report, samply opens the Firefox Profiler in your browser and loads the captured data directly.
The Firefox Profiler (profiler.firefox.com) was originally built to analyze Firefox’s own performance, but its data format and UI are general enough to accept profiles from other sources. The format is a well-documented JSON schema that includes thread information, samples, call stacks, markers (for annotating specific events), and symbol tables. samply serializes its captured data into this format and either serves it to a local instance of the profiler UI or uploads it to the hosted version at profiler.firefox.com.
The profiler UI is genuinely good. It shows a timeline with per-thread activity, a flame graph, a call tree sorted by self time or total time, and a reverse call tree for finding which callers are responsible for hot functions. You can filter by thread, zoom into time ranges, and view the source of functions with debug info available. For Rust programs in particular, samply handles demangling well and can display inlined function information when the binary is compiled with sufficient debug symbols.
The key thing this buys you is not features you could not get elsewhere. It is consistency. The same UI, keyboard shortcuts, and analysis workflow work regardless of whether you are profiling a C++ server on Linux, a Rust CLI tool on macOS, or a Windows application. If you work across multiple platforms or hand profiles to teammates on different OSes, this matters.
Basic Usage
The interface is straightforward. To profile a program:
samply record ./my-program --some-arg
samply launches the program, samples it while it runs, and opens the Firefox Profiler once the process exits. For programs that run until interrupted:
samply record ./server
# Ctrl+C when done
You can attach to an already-running process by PID:
samply record --pid 12345
And on Linux, load an existing perf capture:
samply load perf.data
The sampling rate defaults to something reasonable (around 1000 Hz) but can be tuned. The --rate flag takes a frequency in Hz. Higher rates give more resolution for short-lived hotspots but increase overhead and file size.
For Rust programs, the best results come from building with debug symbols included even in release mode:
[profile.release]
debug = true
This keeps symbol and inlining information that samply can use when resolving stacks, without affecting runtime performance.
Comparison to the Native Alternatives
On macOS, the main alternative is Instruments. Instruments is powerful, supports a wide range of instrument types beyond CPU profiling (memory, file I/O, GPU), and integrates tightly with the OS. But it has a steep learning curve, a complex UI, and it is macOS-only. For pure CPU profiling on a command-line workflow, samply is considerably faster to get into. samply record ./thing and you have results in seconds; with Instruments you are navigating a multi-pane Xcode-adjacent UI before you have even started a capture.
On Linux, perf itself is the standard tool. It is flexible and powerful, but analyzing perf output without reaching for Brendan Gregg’s flamegraph scripts or a GUI like Hotspot requires comfort with its text-based output. Hotspot is probably the closest Linux-native equivalent to samply’s Firefox Profiler integration, but it is a Qt application that you install separately, not part of the profiling tool itself. samply bundles the analysis step into the profiling command.
On Windows, ETW-based profiling via Windows Performance Analyzer is capable but carries significant setup friction and a UI that reflects decades of enterprise tooling decisions. For developers doing cross-platform work who are comfortable with a browser-based analysis UI, samply is a meaningful improvement in time-to-insight.
Symbol Resolution and Debug Info
One of the more technically interesting parts of samply is its symbol resolution pipeline. Raw samples contain instruction addresses. Turning those into readable function names requires resolving symbols, which means reading the binary’s symbol table or DWARF debug information.
samply handles this on-the-fly using the object and addr2line crates, which parse ELF, Mach-O, and PE formats. For system libraries on macOS, it can pull symbols from the system dyld shared cache. For binaries with separate .dSYM bundles or split DWARF, samply knows to look in the conventional locations.
For inlined functions, which are common in optimized Rust and C++ code, samply surfaces inline frames as distinct entries in the call tree. A call to a small utility function that got inlined three levels deep will still show up attributed correctly, provided the debug information was retained. This is not universally available in competing tools’ default configurations, and it meaningfully improves the accuracy of profiling results for languages that inline aggressively.
Where It Fits
samply works best for profiling the startup-to-exit lifetime of a command-line program, or for capturing a window of a long-running server by attaching and then detaching. It is not designed for continuous production monitoring or for profiling inside a container without host access to configure perf_event_open.
The tool is written in Rust, which means a single statically-linked binary with no runtime dependencies on most platforms. Installation via cargo install samply gets you running quickly. The GitHub repository at github.com/mstange/samply is actively maintained, with Markus Stange continuing to improve symbol resolution, add platform-specific features, and keep pace with changes in the Firefox Profiler format.
For anyone doing performance work across multiple platforms, or for Rust developers in particular who want a low-friction profiling workflow without installing Instruments or configuring perf permissions, samply is worth keeping in the toolbox.