Two Runtimes, One Problem: How dial9 and Go's FlightRecorder Approach Production Async Debugging
Source: lobsters
Go 1.25 shipped runtime/trace.FlightRecorder in March 2026, capping a three-release engineering effort that started in Go 1.21. Tokio’s dial9 was announced around the same week. Both tools do the same thing: capture async runtime events into a bounded ring buffer that you can dump when something goes wrong. The timing is coincidental, but the parallel development is not. The same gap existed in both ecosystems, and both teams arrived at the same conceptual solution. What is interesting is how different the implementation paths are, and what those differences reveal about the design constraints each runtime imposed.
The Same Gap, Two Contexts
Production async services fail in ways that are structurally invisible. A Tokio service starts showing tail latency spikes at 3am and recovers before anyone attaches a debugger. A Go HTTP server processes a request that normally completes in 5ms but occasionally takes 500ms with no obvious explanation. In both cases, the standard observability stack tells you something went wrong, not what the runtime was doing in the moments before.
For Go, the existing tools were runtime/trace (too expensive to run continuously), pprof (statistical aggregates, no causal information), and expvar-style counters (current state, no history). For Tokio, the equivalent inventory is tokio-console (live gRPC stream, requires a connected consumer), RuntimeMetrics (aggregate counters, no event sequence), and tracing subscribers (unbounded output rate at production event volumes). Both ecosystems had development-time tools and aggregate production metrics, but nothing that captured event-level runtime history in a way safe enough to leave running indefinitely.
A flight recorder solves this with three properties: it runs continuously without being asked, it uses a circular buffer so memory usage stays constant, and it can be dumped at any point without having been connected beforehand.
Why Go Needed Three Releases
Go could not just add a ring buffer to runtime/trace and ship it. Two fundamental problems had to be solved first.
The first was overhead. Before Go 1.21, enabling the execution tracer cost 10 to 20 percent CPU, primarily due to expensive stack unwinding at every trace event. Go 1.21 introduced frame-pointer unwinding, which follows a linked list of frame pointers rather than scanning stack metadata. Overhead dropped to approximately 1 to 2 percent. A continuously running flight recorder at 15% overhead is not a tool anyone deploys in production. At 1 to 2%, it sits alongside heap profiling and block profiling as an acceptable steady-state cost.
The second problem was structural. Go’s execution trace format contains cross-references: a goroutine ID is assigned at creation and referenced throughout the trace. A ring buffer that discards old data would discard the event that assigned meaning to an ID, making every subsequent reference to that goroutine unreadable. You cannot simply truncate the beginning of a trace stream and expect the remainder to be interpretable.
Go 1.22 rewrote the trace format to support periodic splitting at checkpoints. Each checkpoint produces a self-contained segment that includes all ID-to-name mappings and state snapshots needed to interpret it in isolation. The flight recorder maintains a deque of these segments, discarding old ones from the head and appending new ones at the tail. Because each segment is self-contained, discarding old data does not corrupt newer data. The ring buffer semantics only work because of this property.
The result is runtime/trace.FlightRecorder in Go 1.25:
fr := trace.NewFlightRecorder(trace.FlightRecorderConfig{
MinAge: 5 * time.Second,
MaxBytes: 10 << 20,
})
fr.Start()
defer fr.Stop()
// later, when anomaly detected:
f, _ := os.Create("snapshot.trace")
fr.WriteTo(f)
The total engineering investment was Go 1.21 (overhead reduction), Go 1.22 (trace format rewrite), and Go 1.25 (the actual flight recorder API built on those foundations).
How Tokio’s Path Was Different
Tokio did not need to rearchitect the runtime to ship dial9. The tracing crate provides a composable subscriber model, and Tokio already emits structured runtime-level events through it when compiled with the tokio_unstable flag. A flight recorder for Tokio is, architecturally, a tracing layer that routes those events into a ring buffer rather than forwarding them to a socket or writing them to disk.
This is a meaningful structural advantage. dial9 ships as an ordinary crate, requires no custom runtime build, and composes with the existing subscriber stack:
use dial9::FlightRecorder;
use tracing_subscriber::prelude::*;
fn main() {
let recorder = FlightRecorder::builder()
.capacity(65_536)
.install_panic_hook(true)
.dump_path("/var/log/myservice/tokio-flight.bin")
.build();
tracing_subscriber::registry()
.with(recorder.layer())
.with(tracing_subscriber::fmt::layer())
.init();
tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap()
.block_on(async_main());
}
The install_panic_hook option registers a hook that calls recorder.dump() before the default panic handler, which means every panic automatically produces a runtime trace covering the preceding seconds. This is the most immediately useful production configuration.
Because dial9 registers as a standard tracing layer, it only consumes events bearing Tokio’s own internal targets. Your application’s own spans pass through unaffected. You can stack it alongside tracing-opentelemetry, tracing-subscriber::fmt, or any other layer without interference.
The Design Decisions That Keep Overhead Under 1%
The practical question with any always-on production tool is what it costs per event. For dial9, the hot path is a ring buffer write: load an atomic write index, write an event struct to a pre-allocated slice, increment the index with wrapping arithmetic. This is a handful of nanoseconds per event on modern hardware.
Three design choices keep this bounded.
Per-worker-thread ring buffers. Tokio’s multi-thread runtime runs multiple worker threads. If all threads wrote to a single shared buffer, every event write would require cross-thread synchronization. Per-worker-thread buffers eliminate contention on the write path entirely. Reads at dump time require a brief stop to snapshot each buffer, but writes, which happen at event rate, are contention-free.
Fixed-size 32-byte event records. Variable-length records require either allocation on the write path or a copying scheme that adds complexity. Fixed-size records mean a write is a bounds check and a struct copy into pre-allocated memory. No heap allocation. No lock beyond the per-thread write index.
Sampling for high-frequency events. Individual task polls below a configurable threshold (default: 1ms) are counted but not individually recorded. Polls above the threshold get full records with source location and duration. This is the same trade-off Java Flight Recorder makes with method profiling: statistical sampling is sufficient for the common case, and exact records are reserved for events that are individually significant. The result is that high-throughput services generating thousands of sub-millisecond polls per second do not saturate the buffer with low-value records, preserving capacity for the longer polls that are diagnostically relevant.
The reported overhead is under 1%, putting it below JFR’s default configuration (under 1%) and comparable to Go’s FlightRecorder (1 to 2%). The difference from Go’s starting point is notable: Go had to do two releases of runtime engineering to get from 15% overhead to 1 to 2%. Tokio’s tracing subscriber model meant dial9 could achieve similar overhead without runtime modifications, because the instrumentation was already designed to be composable and opt-in.
What Async Runtimes Uniquely Expose
Both tools share the flight recorder concept, but they record different things. Go’s execution tracer is causal and comprehensive: it records every goroutine state transition, every scheduler event, every GC phase change. The output tells you exactly what every goroutine was doing at every moment during the recording window.
dial9 captures Tokio’s specific event vocabulary: task lifecycle transitions (spawned, first polled, polled, suspended, woken, dropped), poll durations, waker interactions, scheduler activity such as task steals across worker threads, and I/O readiness notifications from the reactor.
The metric that matters most for async services is wakeup-to-poll latency: the time between when a task’s waker fires and when the task is actually polled. A healthy Tokio service has wakeup-to-poll latency measured in microseconds. When it climbs into the tens or hundreds of milliseconds, the runtime is scheduling work faster than worker threads can process it. This shows up clearly in the dial9 event sequence, where a task’s wake event at time T is followed by a poll event at T+180ms. That pattern is scheduler starvation, and it is invisible to RuntimeMetrics (which gives means and percentiles, not per-task event sequences) and to application-level tracing (which never sees the gap because the application code is not running during it).
This is the category of information that makes a flight recorder specifically valuable for async runtimes, rather than just for runtimes generally. The scheduling gaps between futures, the dropped wakers that leave tasks permanently suspended, the specific tasks whose long polls block worker threads: these are events that exist only at the runtime layer, and they only become visible when you have a continuous record of runtime activity to examine after the fact.
The Async-Signal-Safe Constraint
One engineering detail worth noting is that the dump path has strict requirements. When dial9 writes a recording on panic, it is operating in a context where the normal rules do not apply: you cannot call malloc, cannot take arbitrary mutexes, and must complete the dump reliably even if the heap is in an inconsistent state.
Async-signal-safe code on Unix must write to a file descriptor in a single write(2) call or use pre-allocated output buffers that avoid heap allocation. This constrains how the dump serialization can work: the binary format must be designed around a pre-allocated output buffer that can be flushed without dynamic allocation. It is a genuine engineering constraint that the fixed-size 32-byte record format serves directly: a contiguous pre-allocated slice of fixed-size records can be written to a file descriptor as a single buffer without any transformation.
Go’s FlightRecorder.WriteTo does not face the same constraint in the same way because it is designed to be called from user code in a goroutine, not from a signal handler. The panic hook use case in dial9, where the dump fires from a panic hook before process exit, is closer to the signal handler model and imposes requirements that Go’s design does not.
What Each Approach Trades Away
Go’s flight recorder captures the full execution trace, including GC events, scheduler preemptions, and all goroutine transitions. The output loads into go tool trace and shows a complete picture. The cost is that Go had to do substantial runtime engineering to make the format segmentable, and the data rate is 2 to 10 MB per second, which means a 5-second window with a 50 MiB cap is a typical configuration.
dial9 captures Tokio’s runtime event vocabulary and nothing outside it. It does not know about your application’s own spans unless those spans happen to be emitted under Tokio’s internal targets. The companion CLI reconstructs task timelines and wakeup graphs from the binary dump, but it operates on a different level of abstraction than Go’s full execution trace viewer.
The trade-offs reflect the architectural difference. Go’s flight recorder is a runtime feature built into the standard library after three releases of foundational work. dial9 is a crate that composes with the existing tracing ecosystem, which means it ships faster and integrates more naturally with the Rust library model, but it is bounded by what the tracing subscriber interface exposes.
For the specific class of problems these tools target, both approaches work. The bugs they make visible, scheduler starvation, dropped wakers, async deadlocks, poll time distribution per task, are the ones that survive restarts, resist reproduction, and cost the most engineering time to diagnose from first principles. A tool that eliminates that cost at sub-1% overhead is worth adding to the production stack regardless of which runtime you are running.