C++ Senders: The Async Abstraction That Earns Its Complexity

C++ has had async since std::async landed in C++11. Fifteen years later, the committee is still arguing about the right model. That looks like dysfunction from the outside, but the debate has been productive: every round of proposals exposed a real flaw in the previous one. C++ Senders, codified in P2300 (std::execution), are the result of that process. Eric Niebler’s 2024 article makes the case for why they matter. The short answer is that they solve a specific cluster of problems that nothing else in the standard addresses together.

The Problem With Every Previous Approach

Callbacks are the original async primitive. They are fast, require no allocations, and compile to nothing you didn’t write yourself. They are also uncompilable into anything maintainable once you need more than two or three sequential async steps. There is no standard way to cancel, no standard way to propagate errors, and composition requires manual nesting.

std::future fixed composability at the cost of performance. Every .then() continuation requires a heap-allocated shared state, an atomic reference count, and a virtual dispatch. Benchmarks consistently show 10x to 100x more overhead per continuation compared to callbacks. The Concurrency TS tried to formalize .then() chaining but stalled for years over the executor model debate.

C++20 coroutines looked like the answer. They handle sequential async logic elegantly, support co_await on any awaitable type, and can avoid heap allocation via HALO (Heap Allocation eLision Optimization). But they have a structural gap: a coroutine invocation creates a frame and starts running immediately. You cannot pass a description of a coroutine call around as a value without executing it. Composition of concurrent operations requires library support. There is no built-in way to say “run this on that scheduler” or “cancel everything if any child fails.”

Senders are designed to fill that gap without giving up what coroutines do well.

What a Sender Is

A sender is a lazy description of async work. It does nothing until you explicitly connect it to a receiver and start it. The connect() call returns an operation state, which is a stack-allocated object holding all the state for the pending operation. start() begins execution. Eventually, the operation completes by calling exactly one of three functions on the receiver:

set_value(receiver, values...) on success
set_error(receiver, error) on failure
set_stopped(receiver) on cancellation

Three channels is the key design decision. Most async models have two: success and failure. Rust’s Future folds everything into Poll<Result<T, E>>. JavaScript Promise has .then and .catch. Cancellation in both is bolted on after the fact. In the sender model, cancellation is a first-class completion channel with its own type signature, propagated structurally through stop tokens that receivers carry automatically.

The practical difference: when you cancel a tree of concurrent senders, each branch receives a stop request through its receiver, can clean up cooperatively, and calls set_stopped. The parent cannot complete until all children finish. There is no race between “did the child finish or was it cancelled?” This is what structured concurrency means at the library level, without requiring language support.

The full completion signature of a sender is statically known at compile time:

// A sender that completes with int or fails with std::exception_ptr,
// and can be stopped
using MySender = some_sender<
    stdexec::completion_signatures<
        stdexec::set_value_t(int),
        stdexec::set_error_t(std::exception_ptr),
        stdexec::set_stopped_t()
    >
>;

The compiler knows every way the operation can complete. Algorithms that compose senders can compute the completion signatures of the composed result at compile time. This enables zero-cost dispatch and static verification that you have handled every completion channel.

Composition Without Overhead

The standard sender algorithms in P2300 compose via pipe syntax, the same way C++20 ranges work:

namespace ex = stdexec;

auto work =
    ex::schedule(pool.get_scheduler())
    | ex::then([]{ return read_file("data.txt"); })
    | ex::transfer(cpu_scheduler)
    | ex::then([](std::string data){ return process(data); });

auto [result] = stdexec::sync_wait(std::move(work)).value();

The type of work is a deeply nested template instantiation: something like then_sender<transfer_sender<then_sender<schedule_sender<pool_scheduler>, F>, cpu_scheduler>, G>. When you call sync_wait, it calls connect(), which inlines the entire chain. The resulting operation state is a single stack-allocated struct. The assembly is equivalent to hand-written callbacks.

Parallel fan-out uses when_all:

auto parallel = ex::when_all(
    ex::schedule(pool) | ex::then([]{ return compute_a(); }),
    ex::schedule(pool) | ex::then([]{ return compute_b(); }),
    ex::schedule(pool) | ex::then([]{ return compute_c(); })
) | ex::then([](auto a, auto b, auto c) { return a + b + c; });

If compute_b throws, when_all sends a stop request to the remaining children, waits for them to finish, and delivers the error to the parent. The programmer does not write this logic; it is part of the when_all contract.

let_value handles the case where the continuation itself returns a sender (equivalent to flatMap):

auto work = ex::schedule(sched)
    | ex::then([]{ return get_user_id(); })
    | ex::let_value([](int id){
        return fetch_user_from_db(id); // returns a sender
    });

bulk expresses parallel loops that can run on any backend:

auto work = ex::schedule(pool)
    | ex::bulk(1024, [&](int i){ data[i] *= 2; });

The same bulk call on NVIDIA’s nvexec scheduler compiles the loop body into a CUDA kernel and submits a CUDA graph. This is the most compelling practical use case: a single algorithmic description that runs on CPU thread pools or GPU streams depending on the scheduler you hand it, with identical performance to hand-written CUDA for the GPU path.

How Senders Relate to Coroutines

Senders and coroutines are not competing. P2300 includes as_awaitable(), which lets any sender be co_await-ed inside a coroutine, and as_sender(), which wraps a coroutine task as a sender. The intended combination is: write sequential logic as coroutines, express concurrent structure as senders.

A coroutine task written on top of stdexec inherits structured cancellation for free: the coroutine’s stop token flows from the receiver, and co_await on a child sender propagates stop requests down. This is the pattern that libraries like Asio’s experimental coroutine support and proposed designs for std::task target.

The cases where senders are preferable to bare coroutines are: anywhere you need a value-semantic description of work before execution begins, anywhere you need execution-context portability via schedulers, anywhere you need structured concurrency without writing the synchronization yourself, and anywhere you are targeting GPU or HPC backends that need to inspect the operation graph before starting it.

Comparison With Rust Futures

Rust’s async model is also lazy and zero-allocation. The differences are instructive.

Rust futures have one output type (Poll<Result<T, E>>). Cancellation is by dropping the future. There is no stopped channel separate from errors. When you drop a Rust future that is running in a task, the runtime drops the task, and destructors run. This works, but it means “the operation was cancelled” and “the operation failed” go through the same code path. Libraries like tokio_util::sync::CancellationToken address this ergonomically but it is not structural.

C++ senders know all completion channels at compile time, including the stopped channel. The compiler verifies that you handle all three. Cancellation and error handling are distinct branches with distinct types. For systems where clean shutdown matters, this static distinction is worth having.

Rust’s poll-based model also means that a future must be externally driven: a runtime calls .poll() repeatedly. C++ senders push completion to the receiver via callbacks, which means the sender controls when it calls back. This is a fundamental model difference: pull (Rust) vs. push (C++ senders). For GPU scheduling, the push model is essential; you cannot poll a CUDA kernel.

Current State

P2300 did not make C++26 at the Wrocław 2024 meeting. The reasons were reasonable: the environment and query system was still stabilizing, and the committee wanted a cleaner coroutine integration story before standardizing. C++29 or a standalone Technical Specification is the current target.

This does not affect practical use. stdexec, the reference implementation maintained by NVIDIA and Sandia National Laboratories, is production-quality today. NVIDIA ships it for GPU workloads. Meta uses libunifex, the precursor that heavily influenced P2300. Asio has sender interop in recent versions.

The main friction in daily use is compile-time cost and error messages. Deeply nested template types produce diagnostics that require practice to read. Concepts help, but a chain of ten sender adaptors still generates substantial error output when something goes wrong. This is a real cost, not an aesthetic complaint. The stdexec team has invested in reducing it, and it has improved substantially since the early drafts.

Why This Matters

Senders are not a silver bullet for all async code. For simple async I/O in a web server, a well-tuned coroutine-based library is probably easier to work with and close enough in performance. But for the specific problems that senders target, nothing else in the C++ ecosystem handles them together: zero-allocation composition, structured cancellation, execution-context portability, and static verification of all completion channels.

The GPU use case is the most concrete proof. The same bulk algorithm, the same when_all, the same transfer to switch between schedulers, all working through a CPU thread pool or a CUDA graph submission. That portability is only possible because senders describe work as a value before starting it. The scheduler sees the entire graph, can optimize it, and submits it in one shot.

For anyone working in systems programming, HPC, or anything touching GPU pipelines, senders are worth understanding now even without a standard. The API in stdexec is stable, the reference implementation is solid, and the mental model will transfer directly to whatever lands in C++29.