Most async abstractions share a common assumption: work starts when you describe it. A std::future launches when created. A JavaScript Promise runs its executor immediately on construction. A Go goroutine starts the moment you write go f(). Coroutines begin executing on the first co_await. Every one of these models conflates the description of work with the initiation of it, and that conflation is where most async composition problems originate.
Eric Niebler’s 2024 article on senders builds the case for a different model. The answer to the titular question is not “senders are good for async” in the general sense. It is more specific: senders are good for writing generic async algorithms, the same way std::sort is good for sorting. To understand why that distinction matters, you have to look at what breaks when you try to write such algorithms with any other abstraction.
The Algorithm Problem
Consider writing a generic function that takes some async work, fans it out in parallel, and aggregates the results. With futures, you cannot write this without knowing whether the work is CPU-bound or I/O-bound, because futures do not carry scheduler information. With callbacks, you get the work done but composing error handling and cancellation requires threading those concerns manually through every layer. With coroutines, you get ergonomic syntax but no way to describe where the work runs before running it.
The sender model, proposed under P2300 for C++26, sidesteps this by making a sender a value that represents future work without starting it. A sender is connected to a receiver (a struct with set_value, set_error, and set_stopped methods), producing an operation state. Only when you call start() on the operation state does anything happen.
// Nothing runs here. snd is just a description.
auto snd = ex::schedule(thread_pool.get_scheduler())
| ex::then([] { return compute(); })
| ex::transfer(io_scheduler)
| ex::then([](int v) { return write(v); });
// Work begins here.
ex::sync_wait(std::move(snd));
The pipe syntax deliberately echoes std::ranges. A range pipeline is a lazy description of a sequence transformation; a sender pipeline is a lazy description of async work. Neither does anything until consumed. Both allow library-level algorithms to operate on the description before it runs.
Three Completion Channels
One of the less-obvious design decisions in P2300 is that every sender has three completion paths: value, error, and stopped. The stopped channel represents cancellation, and it is structural rather than bolted on. Every algorithm in the sender library propagates stop signals automatically.
This matters in practice. With std::future, cancellation is essentially impossible once work has started. With callbacks, cancellation requires manually threading a flag or a token through every async operation and checking it at each step. With P2300, cancellation propagates through the operation tree the same way error propagation does through exceptions in synchronous code: automatically, with no boilerplate.
exec::static_thread_pool pool(4);
auto [result] = ex::sync_wait(
ex::when_all(
work_a(pool.get_scheduler()),
work_b(pool.get_scheduler())
)
).value();
// If work_a fails, work_b is automatically stopped.
// If the sync_wait is cancelled, both are stopped.
The error channel is also more general than std::exception_ptr. A sender can propagate any error type, including std::error_code, a custom enum, or whatever a GPU kernel uses to signal failure. This matters for embedded and GPU contexts where exceptions are disabled.
Zero Cost in the Default Case
The performance story for senders is unusual because the zero-overhead guarantee holds at the type system level, not just as an implementation quality. Every sender in a pipeline has a unique static type that encodes the full chain. The compiler sees the entire graph and can inline through it completely.
std::future has a shared state object on the heap with reference counting and a mutex for synchronization. A round-trip on a std::future-based async operation typically costs on the order of 2000ns. A comparable sender chain in stdexec, NVIDIA’s reference implementation of P2300, runs in the tens of nanoseconds, comparable to a function call.
The key is that operation states can be stack-allocated. Because structured concurrency guarantees that child operations never outlive their parent, the compiler can analyze the entire operation tree at compile time and place all state on the stack. No heap allocation, no reference counting, no indirection.
If you need type erasure (to store senders in a container or behind an interface), ex::any_sender<> is the explicit opt-in. The overhead is there when you ask for it, not by default.
Coroutines Are Not a Substitute
Niebler’s argument on coroutines is worth examining carefully, because the relationship is complementary rather than competitive. C++20 coroutines are a syntax mechanism. They do not specify where work runs, how to compose multiple concurrent operations before starting any, or how cancellation propagates structurally. They are excellent for writing sequential async code with synchronous-looking control flow.
Senders handle the library half: describing work graphs, routing execution across contexts, composing concurrent operations, propagating cancellation. Coroutines handle the syntax half: writing the sequential steps inside those graphs readably. P2300 supports co_await-ing senders directly:
exec::task<int> process(exec::static_thread_pool& pool) {
// co_await a full sender pipeline
int v = co_await ex::schedule(pool.get_scheduler())
| ex::then([] { return step1(); });
co_return v * 2;
}
The exec::task<T> type in stdexec is itself a sender, which means coroutines written this way are composable as values. You can when_all two coroutine tasks, pass one to let_value, or build a larger sender graph that includes coroutine steps. The abstraction levels stack cleanly.
Heterogeneous Computing Is the Killer App
The argument for senders is clearest when you consider GPU computing. NVIDIA’s nvexec extends stdexec to CUDA GPU streams. A sender pipeline that runs on a CPU thread pool can be redirected to a GPU stream by swapping the scheduler:
// CPU version
auto cpu_work = ex::transfer_just(cpu_pool.get_scheduler(), data)
| ex::bulk(n, [](std::size_t i, auto& d) {
d[i] = transform(d[i]);
});
// GPU version: same algorithm, different scheduler
nvexec::stream_context gpu;
auto gpu_work = ex::transfer_just(gpu.get_scheduler(), data)
| ex::bulk(n, [](std::size_t i, auto& d) {
d[i] = transform(d[i]);
});
The bulk algorithm is generic over the scheduler. On a CPU thread pool, it distributes iterations across threads. On a GPU stream context, it generates a CUDA kernel. The same algorithm code, zero duplication. This is not achievable with futures (which are CPU-only and carry heap overhead incompatible with GPU memory models), callbacks (which require manual adaptation to each execution context), or coroutines (which have no notion of scheduler-generic bulk parallelism).
Where P2300 Stands
As of early 2026, P2300 has been targeting C++26 through multiple revision cycles (R10 was circulating through LWG in late 2024/2025). The committee’s concerns have been primarily about the paper’s scope (around 300 pages) and ensuring sufficient implementation experience before standardization. stdexec and libunifex (Meta’s predecessor prototype, also authored by Lewis Baker who co-wrote P2300) provide substantial implementation evidence. The companion paper P3149 proposes async_scope for structured concurrency scopes as a separable component.
The design is settled. The remaining question is scheduling within the C++ standards process.
The Underlying Idea
Senders are a bet that the right abstraction for async programming is not a communication primitive (channel, future, promise) but a composable work description. The difference is significant. Communication primitives are designed to be consumed; work descriptions are designed to be composed. You can pass a sender to a library function, and that function can wrap it in additional work, route it to a different context, or fan it out without knowing anything about what the sender does internally.
This is precisely what Niebler’s article is arguing: the question is not just “how do we do async” but “how do we write generic async libraries.” Senders are the answer C++ is converging on, and it is one of the more coherent async designs across any mainstream language. The comparison to ranges is apt in a deeper way than just the pipe syntax: ranges taught C++ programmers to think of algorithms as generic functions over lazy sequences. Senders are asking for the same shift of perspective applied to concurrent execution.