· 6 min read ·

Three Generations of C++ Async and the Structural Problems Senders Fix

Source: lobsters

C++ has had three serious attempts at async programming, and each one solved something while leaving structural problems unresolved. Understanding what each generation broke explains why P2300, the Senders/Receivers proposal targeting C++26, looks the way it does. Eric Niebler’s 2024 defense of senders focuses on why senders are needed alongside C++20 coroutines. The fuller case requires tracing further back, to futures, and further forward, to the stdexec reference implementation, to understand why this particular design settled where it did.

What Futures Got Wrong

C++11 added std::future and std::promise as the standard library’s first async primitives. The model was simple: a producer sets a value into a promise, and a consumer retrieves it from the future.

std::promise<int> p;
std::future<int> f = p.get_future();
std::thread([&p]{ p.set_value(42); }).detach();
int result = f.get(); // blocks the calling thread

For producer-consumer synchronization between threads, this works. For composable async pipelines, three structural problems made it impractical.

Shared state requires heap allocation. Every std::future carries a reference-counted shared state object on the heap. When the Concurrency TS proposed .then() continuations, each step in the chain required another heap allocation. In a system doing thousands of async operations per second, allocator pressure compounds quickly.

Futures are eager. Calling std::async starts work immediately. There is no way to describe a pipeline of operations and then choose where to run it. This makes generic algorithms over schedulers impossible: the work is already running before you have a chance to redirect it to a thread pool, a GPU stream, or an embedded RTOS scheduler.

Cancellation was not designed in. std::future has no mechanism for cooperative cancellation. Stopping an in-flight operation requires external synchronization, a shared flag, and careful coordination, none of which the standard library provides. The design of std::future simply had no slot for representing a cancellable in-flight operation.

The Concurrency TS’s .then() proposal sat in limbo for nearly a decade before being superseded by P2300. The issues above were not fixable by extending the future model; they were built into its foundations.

What Coroutines Fixed and Left Broken

C++20 coroutines address the readability problem with async code. Sequences of async operations that previously required chains of callbacks or .then() calls become readable as ordinary sequential code.

// With a coroutine-aware task type
task<int> compute() {
    int a = co_await async_read_a();
    int b = co_await async_read_b();
    co_return a + b;
}

This is the primary benefit of coroutines: sequential async logic looks synchronous while remaining non-blocking. But coroutines are a language transform, not a library design, and they leave several structural problems unresolved.

Coroutine frames are heap-allocated by default. Compilers can elide this allocation in some cases via HALO (Heap Allocation eLision Optimization), but the guarantee does not exist in the general case. For embedded targets where dynamic allocation is forbidden, or for high-throughput server code where allocation pressure matters, this is a real cost without a standard mitigation.

Coroutines also have no built-in cancellation model. C++20 added std::stop_token (from P0660), but coroutines have no standard way to integrate with it. Libraries like Lewis Baker’s cppcoro invented their own mechanisms, but nothing was standardized.

Finally, coroutines are not generic over their execution context. A coroutine co_awaiting a result will resume on whatever thread completes the awaitable, with no standard mechanism for specifying which scheduler to resume on. Writing a generic algorithm that runs correctly on a thread pool, a GPU, and an embedded event loop requires reimplementing scheduling logic for each target.

The Operation State: The Structural Insight

The central insight in P2300 is the separation between describing work and starting it, mediated by an object called the operation state.

When you connect a sender (the description of work) to a receiver (the continuation that handles the result), you get an operation state. Calling start on the operation state is when execution begins.

auto op = stdexec::connect(
    stdexec::just(42) | stdexec::then([](int x){ return x * 2; }),
    my_receiver{}
);
// Nothing has run. 'op' owns all state needed to execute.

stdexec::start(op);
// Execution begins. 'op' must stay alive until completion.

The operation state has a stable address requirement: once started, it must not be moved. This constraint enables something consequential: the operation state can live on the stack, or as a member of a parent operation state, with no heap allocation. The entire chain of then() calls can be laid out as a compile-time-defined state machine with statically known size.

This contrasts with futures, where the shared state is necessarily heap-allocated because it may outlive the stack frame that created it. It contrasts with coroutines, where the frame is heap-allocated because its lifetime extends across suspension points. Senders, by requiring the caller to own the operation state and keep it alive, make allocation optional rather than mandatory.

Rust’s Future trait independently arrived at a similar design. Constructing a Rust Future does not start execution; the future is a value that an executor polls. The future itself can be stored on the stack or heap as the executor decides. The convergence reflects the same underlying logic: async operations that run without allocation must be representable as fixed-size values with a clear ownership model. Two major systems languages reached that conclusion from different starting points.

Three Exit Channels

P2300 formalizes something that prior async models treated informally: every async operation has exactly three possible completion paths, and all three deserve first-class treatment.

// A receiver must handle all three channels:
stdexec::set_value(receiver, result...);    // success
stdexec::set_error(receiver, error);        // failure (any error type)
stdexec::set_stopped(receiver);             // cancellation

The distinction between set_error and set_stopped matters in practice. In a future-based or exception-based model, cancellation is usually represented as a special error value or a thrown exception. This forces error-handling code to distinguish between a genuine failure and a cooperative stop request using a convention rather than a structural guarantee. The three-channel model makes this distinction compile-time-enforced rather than runtime-conventional.

The three channels integrate with std::stop_token naturally. An operation that receives a stop request calls set_stopped on its receiver; the parent operation propagates this upward. Structured concurrency falls out of this design: a when_all algorithm starts multiple child operations and waits for all of them to complete through any of the three channels. When one child fails or is cancelled, the others are signaled to stop, and the parent does not complete until all children have called exactly one completion function. Child operations cannot outlive their parent, which eliminates a whole category of lifetime bugs that fire-and-forget async patterns produce.

Generic Over Schedulers

The motivation for scheduler genericity extends beyond CPU thread pools. NVIDIA’s stdexec includes a CUDA scheduler that maps stdexec::bulk directly to GPU kernel launches.

// The same bulk algorithm targets CPU threads or CUDA threads
// depending on which scheduler is passed
auto work = stdexec::bulk(
    stdexec::schedule(cuda_scheduler),
    1024 * 1024,
    [data](size_t idx) {
        data[idx] = process(data[idx]);
    }
);

The same algorithm source, compiled against nvexec::stream_context, produces a CUDA kernel. The same source, compiled against exec::static_thread_pool, produces CPU-parallel work. The scheduling policy is a template parameter rather than baked into the algorithm, and because the sender is lazy, the work description can be redirected before any execution begins. This is the practical payoff of the value-based design: futures execute on whatever thread was given to std::async, and coroutines resume on whatever thread completed the awaitable, so neither model allows the caller to redirect computation to a different execution context after the fact.

Where Things Stand

P2300 is targeting C++26 under std::execution. The stdexec reference implementation is available now, maintained by NVIDIA, and tracks the proposal closely. Meta’s libunifex, which predates stdexec and influenced P2300’s design, offers production-tested implementations with io_uring and epoll integration on Linux.

The core sender/receiver model and fundamental algorithms (schedule, then, transfer, when_all, bulk, sync_wait) have reached design stability across multiple standardization cycles. Open questions remain around I/O integration and type-erased sender APIs for runtime-dynamic pipelines, but the structural foundation has been settled by years of iteration across multiple implementations and extensive committee review.

Niebler’s article frames senders and coroutines as complementary: coroutines for sequential logic, senders for concurrent composition. The three generations of history show why that division of labor took so long to arrive at. Coroutines solved how async code reads; senders address how it allocates, cancels, and composes across execution contexts. The two features cover different ground, and the combination is the first C++ async story that addresses all of the structural problems at once.

Was this interesting?