· 7 min read ·

dial9: Filling the Production Observability Gap in Tokio

Source: lobsters

The Tokio project announced dial9 on March 18, describing it as a “flight recorder” for the Tokio async runtime. The name is a nod to Plan 9 from Bell Labs, where dial() is the primitive for connecting to network services. The metaphor runs deeper than the name: dial9 is about connecting to the state of your runtime after the fact, when the live view is already gone.

This is a gap that has been visible in the Tokio observability story for a while. The tooling that exists is good, but it is oriented toward development-time introspection rather than production incident diagnosis.

What Tokio Observability Looks Like Today

The Tokio ecosystem has several observability tools, each doing a different job.

tokio-console is the developer-facing debugger. It works by installing a console-subscriber layer on top of the tracing crate, which intercepts runtime events emitted by Tokio (task spawns, poll completions, waker activity) and streams them over a live gRPC connection to a terminal UI. You get a real-time view of running tasks, their poll durations, and waker counts. For development, it is genuinely useful. For production, it has two problems: the live gRPC connection requirement means you need to open a port to a running process, and the overhead when enabled is non-trivial enough that you would not want it always on. Most importantly, it shows you current state. By the time you connect to a misbehaving production service, the interesting state is often already gone.

tokio-metrics occupies a different position. It provides aggregate runtime metrics: poll counts, poll durations, queue depths. These are pull-based statistics, well-suited for feeding into a dashboard or alerting on SLO breaches. What they cannot give you is a timeline of events leading up to a problem. A metric that tells you average poll duration spiked for thirty seconds does not tell you which task caused it or what it was waiting on.

The tracing crate itself is foundational. Tokio emits instrumentation events when compiled with the right feature flags, and a subscriber can do whatever it wants with them. But building a ring-buffer recorder on top of raw tracing events is not a trivial undertaking, and the overhead of generic tracing infrastructure under high event rates can be significant.

So the situation before dial9 was: good development tooling, decent aggregate metrics for dashboards, and a conspicuous gap around retrospective production diagnosis.

Why Async Runtimes Make This Harder

The observability gap in async Rust is structurally different from the one you face with synchronous multithreaded code, and it is worth being precise about why.

In a conventional multithreaded application, each thread of execution maps to an OS thread. Profilers, tracers, and debuggers understand OS threads natively. perf, gdb, strace, and everything built on ptrace can give you stack traces per-thread, scheduling events, and system call timing, all without any application-level instrumentation.

In an async runtime like Tokio, tasks are multiplexed over a small thread pool. A task that is waiting on a future is not consuming CPU, but it is also not represented as a blocked OS thread that the kernel knows about. The concept of “blocked” means something different: the task’s future returned Poll::Pending, the waker was registered somewhere, and now the task sits in a scheduler data structure waiting for the waker to be called. If the waker is never called, the task is effectively leaked. If a single task polls for too long without yielding, it monopolizes a worker thread and starves every other task on that thread.

These failure modes, task starvation, slow polls, waker leaks, and deadlocks between tasks, are the most common causes of production incidents in Tokio services. They are also the ones that OS-level tooling is worst equipped to diagnose. When your service starts accumulating latency and you attach perf, you see a thread pool doing work, but the task-level structure is invisible.

The Flight Recorder Pattern

The concept dial9 borrows is well established elsewhere. Java Flight Recorder (JFR), which became open source in OpenJDK 11, is the clearest reference point. JFR maintains a ring buffer in the JVM process, writing events continuously: method compilations, GC phases, thread park and unpark, lock contention, memory allocation. The buffer is sized to hold some configurable window of recent history, perhaps a few minutes. Events overwrite from the front when the buffer is full. Overhead in practice is under one percent of CPU for most workloads. When something goes wrong, you trigger a dump, which flushes the ring buffer to a file that jfr or JDK Mission Control can analyze. You get a detailed timeline of what happened before the incident.

Go’s runtime/trace follows the same idea from a different angle. The net/http/pprof package exposes an HTTP endpoint that starts a trace capture, records goroutine scheduling events, GC phases, blocking calls, and heap events for a configurable duration, then returns a binary file analyzable with go tool trace. It is more of a manual capture than an always-on recorder, but the underlying event model covers exactly the runtime structure that matters for diagnosing goroutine starvation or scheduler imbalance.

The key property in both cases is retrospective capture. You do not need to know a problem is happening before the tool starts recording. It is already recording. When you discover there was a problem, you retrieve the history.

What dial9 Records

dial9 applies this pattern to Tokio’s task model. It maintains a lock-free ring buffer in the process, recording events with minimal per-event cost. The events it captures correspond directly to the failure modes that matter in async Rust:

  • Task spawn and drop, with task identifiers that persist across the recording
  • Poll start and poll end times, which lets you reconstruct which tasks were polling for how long
  • Waker creation, cloning, and invocation, so you can detect wakers that were created but never called
  • I/O registrations, correlating tasks with the file descriptors or network connections they are waiting on
  • Scheduler decisions, including which worker thread picked up a task and when

The ring buffer is sized to hold a configurable amount of recent history. When a trigger fires, dial9 serializes the buffer to a file. Triggers include: a panic in any task, a poll exceeding a configurable threshold, a manual signal, or a programmatic call from application code. The output format is designed for offline analysis rather than live streaming, which is what makes always-on operation practical.

What This Costs

The overhead question is the one that determines whether a tool like this is production-viable. tokio-console is not always-on in production precisely because the overhead of the tracing subscriber model at high event rates is too high.

dial9’s approach differs in two ways. First, the ring buffer write path is designed to be as cheap as possible: a sequence number increment, a timestamp read, and a bounded write into pre-allocated memory. There are no allocations in the hot path and no contention on a global lock. Second, not every event is recorded at full fidelity. Waker clones, which can be extremely frequent in some patterns, are counted rather than individually recorded unless a slow-poll trigger has fired. The tradeoff is that granularity degrades gracefully under load rather than imposing a fixed cost proportional to event volume.

The comparison to JFR is instructive here. JFR’s overhead is low enough that Oracle recommends running it continuously in production. The design decisions that make that possible are the same ones dial9 is borrowing: preallocated buffers, timestamped writes without serialization in the hot path, and deferred analysis at dump time rather than at record time.

The Gap This Fills

The mental model for where dial9 sits is straightforward. tokio-console is your development debugger. tokio-metrics feeds your dashboards and alerts. dial9 is what you turn to when an alert fires and you need to understand what led up to it.

The scenario dial9 is designed for is common enough that most Tokio developers have hit it: a service starts accumulating latency or drops requests, the on-call engineer gets paged, and by the time they can investigate, the service has either recovered on its own or been restarted. What was happening in the thirty seconds before the restart? tokio-console cannot tell you because you did not have it connected. The aggregate metrics tell you something was wrong but not what. The log files tell you which requests timed out but not why the tasks serving them were delayed.

With dial9 running, that thirty-second window is in the ring buffer. The dump gives you a task-level timeline: which tasks were polling, for how long, what they were waiting on, and whether their wakers were called. That is enough to distinguish between a slow database query blocking the task, a timer that never fired, and a task that was spawned and then starved because another task held a worker thread for too long.

Broader Context

dial9 is not the first attempt at runtime-level flight recording for an async Rust environment, but it is the first with Tokio team involvement, which matters for integration quality and long-term maintenance. The design is informed by how TaskHooks, added in Tokio 1.x, provide the low-level lifecycle callbacks that make efficient instrumentation possible without patching the runtime itself.

The comparison with Java and Go is worth sitting with. Both of those runtimes have had production-grade retrospective tracing built into the standard distribution for years. Go’s goroutine scheduler model is conceptually similar enough to Tokio’s task model that go tool trace’s output serves as a useful reference for what dial9’s analysis tooling needs to surface. JFR’s always-on design philosophy and its tight integration with the JVM’s allocation and GC events are the architectural template.

Rust and async Rust in particular are increasingly used in production infrastructure. The observability tooling has been lagging the maturity of the runtime itself. I build Discord bots that run on Tokio, and the debugging experience when something quietly stops responding has always involved guesswork: restart it, add more logging, hope the problem recurs. dial9 is a meaningful step toward closing that gap, and for anyone running Tokio services in production and dealing with intermittent latency spikes or task starvation that disappears before you can investigate, the announcement post is worth reading carefully.

Was this interesting?