· 6 min read ·

Post-mortem Debugging for Async Rust: dial9 and the Flight Recorder Approach

Source: lobsters

Production async Rust has a specific failure mode that is genuinely hard to handle: something goes wrong, the process crashes or hangs, and your tracing output tells you what your application code was doing, but not what the runtime was doing. You can see that your HTTP handler panicked; you cannot easily tell whether it was poll-starved for 200ms before it got a chance to run, whether a waker was dropped on the floor, or what sequence of I/O events preceded the failure. That gap is what dial9 is designed to close.

The existing observability story

Tokio’s observability ecosystem centers on the tracing crate, which provides structured, context-aware instrumentation through a composable subscriber model. Applications emit spans and events at any granularity, and Tokio itself emits runtime-level events under the tokio target when compiled with the tokio_unstable flag. tokio-console builds on this: it connects to a running process over a gRPC stream and shows a live task-level view, including poll durations, waker counts, and which tasks are currently blocked.

tokio-console is excellent for interactive debugging during development. In production it carries real trade-offs: the subscriber accumulates per-task data, serializes it continuously over a network connection, and assumes a consumer is attached. Running it with no consumer wastes resources, and running it under production load has measurable overhead, especially with high task counts. More fundamentally, it is designed for live inspection, not post-mortem analysis. If your service hangs at 3am and then recovers, tokio-console cannot tell you what happened.

Tokio also exposes a RuntimeMetrics API that gives you aggregate counters: tasks spawned, tasks dropped, mean poll duration, steal counts across worker threads. These are useful for dashboards and alerting. They are not useful for reconstructing a specific sequence of events.

What a flight recorder does differently

The term comes from aviation. A flight data recorder does not transmit telemetry continuously; it writes to a fixed-size ring buffer, overwriting the oldest entries as new ones arrive. When something goes wrong, you recover the buffer and get the last N seconds of data. Overhead is bounded and constant regardless of whether anyone is watching.

Applied to a runtime, this means capturing task lifecycle events (spawn, first poll, wake, idle, drop), I/O readiness notifications from the reactor, timer firings, and thread park and unpark transitions, all in a circular in-process buffer. Under normal operation the buffer rotates in memory without any external I/O. On a panic, on a Unix signal, or on a user-triggered dump, you flush it and get a structured timeline of what the runtime did in the moments before the event.

This pattern is well-established elsewhere. Java Flight Recorder, which became open source as part of JEP 328 in JDK 11, uses exactly this approach and is now standard tooling for JVM production diagnostics. The Linux kernel’s ftrace ring buffer operates on the same principle at the OS level. The concept maps cleanly to async runtimes because those runtimes already have well-defined event boundaries: every poll, every wake, every I/O callback is a discrete, timestamped occurrence.

How dial9 integrates with Tokio

dial9 registers as a tracing layer in the subscriber stack and routes Tokio’s runtime-level events into a ring buffer rather than forwarding them over a socket or serializing them to disk continuously. The buffer is in-process, allocated up front, and written to with minimal locking. Event records are compact: a task ID, an event type, a timestamp, and a small amount of metadata.

A setup might look something like this:

use dial9::FlightRecorder;
use tracing_subscriber::prelude::*;

fn main() {
    let recorder = FlightRecorder::builder()
        .capacity(65_536)
        .install_panic_hook(true)
        .dump_path("/var/log/myservice/tokio-flight.bin")
        .build();

    tracing_subscriber::registry()
        .with(recorder.layer())
        .with(tracing_subscriber::fmt::layer())
        .init();

    tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(async_main());
}

The install_panic_hook option is the most immediately useful production feature: it registers a panic hook that calls recorder.dump() before the default handler runs, writing buffered events to the configured path. Combined with your existing process supervision and log aggregation, this means every panic automatically produces a runtime trace covering the preceding seconds of activity.

Because dial9 integrates as a standard tracing layer, it composes with everything else in the ecosystem. You can stack it alongside tracing-subscriber::fmt, tracing-opentelemetry, or any other layer without conflict. It only consumes events bearing Tokio’s own internal targets, so it adds no overhead to your application’s own spans.

What it catches that other tools miss

The category of bugs dial9 targets is worth illustrating concretely. Consider a service processing work from a channel. Under sustained load, worker tasks start taking longer per item. Your application tracing shows slow processing times. What you cannot see from application-level spans alone is whether the task is spending that time inside your code or waiting to be scheduled because all Tokio worker threads are occupied.

With a flight recorder, the timeline is explicit: the task’s wake event occurred at T, but its next poll event did not occur until T+180ms. That is scheduler starvation, and it shows up clearly once you have the event sequence. Without the recorder, distinguishing starvation from genuine processing slowness requires either a reproducer or instrumenting every await point manually.

A second category is dropped wakers. When a future is cancelled, its waker may be dropped without ever firing. In certain patterns involving manual Future implementations or complex select! branches, this leaves a task permanently suspended: from the outside it looks like it simply stopped doing work, and no amount of application-level logging captures the event because the application code never ran. A flight recorder that captures waker drops alongside task events makes this pattern visible as a gap in the task’s lifecycle.

Poll-time distribution is a third area. Tokio’s RuntimeMetrics can tell you the mean and some percentiles across all tasks, but it cannot tell you which specific task polled for 400ms and blocked the thread. A per-task event log can.

Comparison to other approaches

async-backtrace captures task call trees on demand but requires instrumentation at every await point and is oriented toward understanding task structure rather than recording temporal sequences. tracing-forest produces beautifully structured output for nested span trees, which is useful for request tracing but operates entirely at the application layer.

perf and bpftrace can observe a Tokio process from outside the process boundary, but they require kernel tracing infrastructure, elevated privileges, and considerable effort to correlate kernel-level events with Tokio’s task model. dial9 sits inside the process, uses Tokio’s own event vocabulary, and requires no system-level access to deploy.

The closest prior art in the Rust ecosystem is probably tracing-appender combined with a rolling file writer, but that approach writes every event to disk continuously, which both changes the performance profile and produces large volumes of data you need to filter after the fact. A ring buffer with a bounded capacity and on-demand flush is a fundamentally different trade-off.

The production case

The argument for any always-on, bounded-overhead diagnostic tool is the same: the bugs hardest to diagnose are the ones that cannot be reproduced. Scheduler starvation, dropped wakers, and spurious task hangs tend to manifest under load, briefly, and not in ways that survive a restart. Having the recorder running costs a fixed amount of heap memory and a small per-event write cost. Recovering from a production incident without it costs substantially more in engineering time.

For services already on Tokio and already using tracing, adding dial9 is largely additive. It layers onto the existing subscriber stack, operates without external dependencies, and produces output only when you need it. The dial9 announcement fills a specific and long-standing gap in Tokio’s production story: live inspection via tokio-console, aggregate metrics via RuntimeMetrics, structured logs via tracing, and now a post-mortem record of the runtime events that none of those other tools persist across a failure.

Was this interesting?