C++'s Long Road to Structured Async: What Senders Get Right

The history of async programming in C++ is a series of good ideas that each solved part of the problem and handed the rest back to the programmer. Callbacks gave you performance but not safety. std::future gave you results but not laziness. Coroutines gave you readable code but not a standard execution model. Senders, arriving in P2300 and voted into the C++26 working draft at the Kona 2023 WG21 meeting, are the attempt to close what those prior approaches left open.

Eric Niebler’s February 2024 post “What are Senders Good For, Anyway?” is the clearest defense of the sender model against the reasonable objection that C++ already has coroutines. The argument is subtle and worth working through carefully, because the two mechanisms operate at different layers.

The Problem Each Previous Approach Left Open

Raw callbacks are fast and composable badly. When you pass a callback to an async operation, you lose structured lifetime: the callback might fire after the object that captured a reference is already destroyed. There is no standard cancellation, no error propagation, no compile-time knowledge of what the callback will receive. Callback hell is not a code style problem; it is a direct consequence of the abstraction’s structural limits.

std::future and std::promise, introduced in C++11, solved result-passing but created new problems. A future always involves a heap allocation for its shared state, and always involves synchronization, even on a single thread. You cannot combine two futures without blocking. .get() blocks the calling thread. There is no cancellation. The whole model is eager: work starts when you create the promise, not when you are ready to consume the result. std::async has implementation-defined behavior and is widely regarded as broken in practice.

Coroutines in C++20 are genuinely good. The co_await mechanism transforms sequential-looking code into efficient state machines. The compiler rewrites your coroutine into a struct with a resume() method, and suspension points become switch cases in that struct. But coroutines are a transformation mechanism, not a complete async model. The language defines how suspension and resumption work; the library defines everything else.

This matters more than it initially appears. Every async library that uses coroutines, Asio, cppcoro, folly, various liburing-based frameworks, invents its own Task<T>, its own awaitable types, its own concept of where execution resumes. A function returning folly::coro::Task<int> cannot be awaited in an Asio coroutine without adapter glue. The language gave everyone the same hammer, and everyone built incompatible nails.

What Senders Actually Are

The core insight in P2300 is the separation between describing async work and executing it. A sender is an object that represents work that has not started yet. It carries a description of what to do but does nothing until connected to a receiver.

The protocol has three pieces:

// connect() links a sender to a receiver, producing an operation state
auto op = connect(my_sender, my_receiver);
// start() begins execution
start(op);
// On completion, the operation state calls exactly one of:
// set_value(receiver, values...)  -- success
// set_error(receiver, error)      -- failure
// set_stopped(receiver)           -- cancellation

Nothing runs until start(). This laziness is what makes composition work. You can build an entire pipeline as a tree of sender objects, inspect its types at compile time, pass it around, and only when you are ready do you commit to running it:

auto pipeline =
    schedule(thread_pool)
  | then([] { return load_data(); })
  | transfer(io_scheduler)
  | then([](auto data) { return write_to_disk(data); });

// pipeline is a description. No threads were started.
// The completion types are known at compile time.
auto [result] = sync_wait(std::move(pipeline)).value();

Completion signatures are part of the sender’s type, encoded via get_completion_signatures(). Before any work runs, the compiler knows what types can emerge from a pipeline, enabling type-safe composition and static reasoning by algorithms like bulk. The reference implementation is stdexec, maintained by the P2300 authors. It is header-only, requires C++20, and is used in production at NVIDIA.

The Zero-Overhead Angle

One of the most practically significant properties of senders is that they are allocation-free by default. The operation state produced by connect() is a concrete type whose size is known at compile time. It lives on the stack or embedded in a parent object:

// Entirely stack-allocated:
auto op = connect(
    just(42) | then([](int i) { return i * i; }),
    my_receiver{}
);
start(op); // no malloc, no synchronization

Coroutine frames are heap-allocated. The HALO optimization can elide this in favorable conditions, but it is a best-effort optimization, not a guarantee. In embedded systems or real-time code where allocation is forbidden or latency is tightly bounded, this distinction is decisive. With -O2 and sufficient inlining, a chain of sender adaptors typically compiles down to code equivalent to writing the function calls inline.

What Senders Give You That Coroutines Cannot

Coroutines need something to co_await. The co_await expression calls operator co_await or await_transform on the coroutine’s promise type, returning an awaitable with await_ready(), await_suspend(), and await_resume(). Every async library defines what this machinery does in terms of its own executor model. They all solve the same problem differently because there is no standard vocabulary type.

Senders provide that vocabulary type. A sender can be co_await-ed in any coroutine framework that implements as_awaitable(). A coroutine can expose itself as a sender via as_sender(). The two mechanisms compose cleanly when senders form the base layer:

Task<Result> process(scheduler sched) {
    // when_all runs both concurrently; co_await awaits the sender
    auto [a, b] = co_await when_all(
        schedule(sched) | then(fetch_a),
        schedule(sched) | then(fetch_b)
    );
    co_return combine(a, b);
}

The coroutine provides readable sequential code. The sender machinery handles concurrency, scheduler context, and cancellation propagation. Neither mechanism alone could do this cleanly.

There are also things senders can express that coroutines cannot. bulk parallelism is the clearest example:

// Run N independent units of work in parallel
auto work = bulk(schedule(pool), 1000, [](std::size_t i, auto& ctx) {
    process_item(i, ctx);
});

With coroutines, you would manually spawn N coroutines and track them. bulk as a sender algorithm can be specialized per scheduler: a thread pool scheduler might use std::for_each internally; a GPU scheduler maps directly to a CUDA kernel launch. The abstraction carries meaningful semantics that manually spawned coroutines do not.

Structured Concurrency as the Core Safety Property

Structured concurrency is the property that async work has a strict parent-child lifetime relationship. Child work cannot outlive its parent scope. The term was coined by Nathaniel J. Smith and popularized by the Python trio library, but the idea is a direct parallel to structured programming’s elimination of arbitrary goto: you always know where execution can go and it always comes back.

With raw callbacks and fire-and-forget futures, async work can outlive the objects it references. A callback fires after a local variable is destroyed. A background task reads memory that has been freed. These bugs are timing-dependent and resist reproduction.

Senders enforce structure because you hold the operation state, and the operation state owns the running computation. when_all does not complete until all child senders complete, and if one fails, it cancels the others before returning the error. Cancellation propagates via stop tokens threaded through the environment, a context object that flows alongside the receiver, without manual parameter threading at every call site.

The GPU Case Makes Everything Click

The most compelling argument for senders as a distinct layer is heterogeneous computing, and it is not a coincidence that NVIDIA maintains stdexec.

Consider a pipeline that starts on the CPU, hands data to a GPU for parallel processing, then returns to the CPU for I/O:

cudaStream_t stream;
auto gpu_sched = cuda::stream_scheduler(stream);

auto pipeline =
    schedule(cpu_pool)
  | then([] { return load_from_disk(); })   // CPU thread
  | transfer(gpu_sched)                      // move to CUDA stream
  | bulk(N, [](int i, auto& buf) {          // CUDA kernel launch
        gpu_process(i, buf);
    })
  | transfer(cpu_pool)                       // back to CPU
  | then([](auto result) { return save(result); });

The scheduler is an explicit, composable part of the pipeline. transfer() expresses where work moves. bulk() on a GPU scheduler means a kernel launch. The types flow through statically. No heap allocation. No synchronization overhead beyond what the CUDA stream already requires.

Expressing this with coroutines requires library-specific CUDA integration in each coroutine framework. Expressing it with futures requires heap allocations at each stage and synchronization at each transfer point. Senders compose naturally because the scheduler is first-class in the model, not an implicit ambient context.

Where Things Stand

P2300R10 was voted into the C++26 working draft at Kona in November 2023. The standard library will include the core protocol, environment machinery (get_scheduler, get_stop_token), and a full algorithm set: just, schedule, then, upon_error, upon_stopped, let_value, let_error, transfer, when_all, bulk, split, ensure_started, on, and sync_wait. There will be any_sender_of<Ts...> for explicit type erasure, and coroutine interop via as_awaitable.

No major standard library ships std::execution yet. stdexec is the production reference until compiler vendors catch up, and it works on GCC, Clang, and MSVC with C++20. Asio has been adding P2300 compatibility and is expected to integrate more deeply as the standard solidifies.

The Networking TS executor model that competed with P2300 for years has effectively been superseded. After a decade of debate about executors, the committee made a decision. If you work in C++ and care about async correctness, cross-library scheduler interoperability, or performance without allocation, P2300 is the thing to understand before it ships.