The Coroutine Protocol: What C++20's Promise Machinery Is Actually Doing

C++20 coroutines require you to implement several methods before anything runs. get_return_object, initial_suspend, final_suspend, return_void or return_value, unhandled_exception. If you’ve reached for coroutines and hit this wall, the temptation is to conclude that the design is needlessly complex. That conclusion misses what the design is actually doing.

The machinery is not accidental boilerplate. It is a deliberate exposure of a transformation protocol. Understanding each piece in terms of what question it answers makes the whole thing coherent.

What the Compiler Does With a Coroutine

A function becomes a coroutine when its body contains co_await, co_yield, or co_return. The compiler treats these as signals to rewrite the function as a state machine and move its local variables, control flow position, and a promise object into a heap-allocated frame. That frame persists across suspension points.

The frame contains: resume and destroy function pointers (the state machine’s dispatch table), all local variables whose lifetimes span at least one suspension point, copies of the coroutine’s parameters, and the promise object, embedded directly, not separately allocated.

The promise_type is the customization surface for this transformation. It does not represent a promise in the futures-and-promises sense, though the name suggests that lineage. It is closer to a policy class: every lifecycle event of the coroutine passes through it.

The Six Methods and What They Answer

get_return_object answers: what does the caller receive when they call this coroutine function? The frame exists before the coroutine body starts running. get_return_object is called once, immediately after frame creation, to produce the value returned to the caller. The typical implementation uses coroutine_handle<promise_type>::from_promise(*this) to create a handle from the promise object, which is embedded in the frame at a known offset, making this pointer arithmetic rather than a separate allocation.

Task get_return_object() {
    return Task{std::coroutine_handle<promise_type>::from_promise(*this)};
}

initial_suspend answers: should the coroutine run immediately when called, or wait until explicitly resumed? Return std::suspend_always for lazy behavior, where the body does not start until the caller resumes the handle. Return std::suspend_never for eager behavior, where the body starts executing immediately in the caller’s stack frame. Generators almost always use suspend_always. Async tasks split: lazy tasks compose more safely; eager tasks have lower latency. This choice propagates to every user of the coroutine type, so it belongs in the promise, not at each call site.

final_suspend answers: after the coroutine body finishes, should the frame be kept alive? If you return std::suspend_never here, the frame is destroyed immediately when the coroutine finishes. If the caller is in the middle of reading a result out of the promise at that moment, they are reading from freed memory. The safe pattern is almost always std::suspend_always, which keeps the frame alive until someone explicitly calls handle.destroy(). Scheduling frameworks that use symmetric transfer are the exception, which we will get to shortly.

return_void / return_value answer: what does co_return do? These store the result in the promise object so that awaiting code can retrieve it through the handle. You implement exactly one, depending on whether the coroutine produces a value.

unhandled_exception answers: what happens when an exception escapes the coroutine body? It must not throw. The standard pattern is to capture std::current_exception() into a std::exception_ptr stored in the promise and rethrow it when the caller retrieves the result. Calling std::terminate() is also valid if you need strict no-exception guarantees.

The Frame vs. The Promise

These two concepts overlap in implementation but serve different purposes.

The frame is the mechanical container: the state machine, the local variables, the function pointers for resume and destroy. It is what coroutine_handle<> points to.

The promise is the communication channel: values flow out of the coroutine through it via yield_value and return_value, exceptions flow through it via unhandled_exception, and context from the caller’s scheduler may flow into it via await_transform. Because the promise is embedded in the frame at a fixed offset, coroutine_handle<promise_type>::from_promise and handle.promise() are pointer arithmetic with no overhead.

This embedding is the key insight behind the zero-cost design. There is one allocation for both the frame and the communication channel. Compare this to a std::future<T>, which requires a separate shared state allocation, reference counts, and synchronization primitives.

How `co_await` Works Under the Hood

The transformation the compiler applies to co_await expr is more layered than the surface syntax suggests. First, if the promise defines await_transform, the expression passes through it. This is the hook that async frameworks use to attach executor context. Asio’s coroutine promise uses await_transform to ensure that after every co_await, the coroutine resumes on the correct executor without any explicit annotation at each individual await site.

After await_transform, the compiler looks for an awaiter, either via operator co_await() on the result or by treating the result as the awaiter directly. The awaiter provides three methods: await_ready() to skip suspension if the result is already available, await_suspend(handle) called when the coroutine is about to suspend and receives a handle so the awaiter can schedule resumption, and await_resume() called on resumption, whose return value becomes the result of the co_await expression.

The await_suspend signature has three valid forms. Returning void suspends unconditionally. Returning bool returns false to cancel the suspension and resume immediately. Returning coroutine_handle<> performs symmetric transfer: the current coroutine suspends, and the returned handle is resumed via a tail-call. This third form is what Lewis Baker’s cppcoro popularized and what Quasar Chunawala’s deep dive on isocpp.org demonstrates in practice. Without symmetric transfer, a chain of nested co_await calls grows the call stack one frame per layer. With it, the stack depth stays constant regardless of chain depth, because each handoff is a tail-call rather than a nested call. This is not an optimization hint; the standard requires it for the coroutine_handle<> return case.

struct FinalAwaiter {
    bool await_ready() noexcept { return false; }
    std::coroutine_handle<> await_suspend(
        std::coroutine_handle<promise_type> h) noexcept {
        // Resume whoever was waiting on this task, or do nothing.
        auto cont = h.promise().continuation;
        return cont ? cont : std::noop_coroutine();
    }
    void await_resume() noexcept {}
};

std::noop_coroutine() is a C++20 built-in that returns a handle which, when resumed, does nothing. It is the base case for symmetric transfer chains when there is no awaiting coroutine.

How This Compares to Python and Rust

Python’s async/await and Rust’s async fn make coroutines easier to write because both languages absorb the protocol into fixed interfaces. Python fixes the awaitable protocol (__await__ returning a generator) and ships asyncio as the standard event loop. Rust’s Future trait is a single interface; the executor (tokio, async-std) provides the reactor and scheduler.

C++ made the opposite choice. The language defines the transformation rules and customization hooks but provides no scheduler, no executor, no reactor. The coroutine body runs in whatever thread calls handle.resume(), without any runtime involvement. This is both the source of the boilerplate burden and the source of the performance ceiling. A Rust future has a fixed poll interface; a C++ coroutine frame can be scheduled, pooled, and resumed in ways that Rust’s trait boundary does not permit.

Go’s goroutines are a different design entirely: stackful green threads scheduled M:N by the runtime onto OS threads. No async/await keywords, no colored functions, no user-visible protocol. The runtime absorbs everything, at the cost of goroutine-per-operation memory overhead (the initial stack is ~2KB, growable) and a significant mandatory runtime.

The tradeoff is real. C++ coroutines with a carefully-written promise type and symmetric transfer can match the throughput of hand-written state machines. With HALO (Heap Allocation Elision Optimization), the compiler can allocate the frame on the caller’s stack when the coroutine’s lifetime is strictly bounded, eliminating the heap allocation entirely. Python async always carries interpreter overhead; Rust async has fixed overhead per poll. C++ coroutines can reach zero overhead per suspension in favorable conditions.

What the Ecosystem Provides

Writing a production-quality promise_type from scratch is hard. The result of the committee’s choice to ship minimal machinery is that the ecosystem developed its own abstractions: cppcoro (Lewis Baker, the reference implementation, now archived but widely influential), folly::coro (Meta, production-scale, deeply integrated with Folly’s executor model), Asio’s awaitable<T> (Christopher Kohlhoff, part of Boost.Asio and standalone Asio), and eventually std::generator<T> in C++23.

std::generator<T> deserves specific attention. It is the first standard coroutine library type, and it arrived in C++23 rather than C++20 because the committee deliberately kept the C++20 coroutine machinery free of standard library types. std::generator implements the full promise_type boilerplate behind a clean interface and is range-compatible, which means it composes directly with std::views. For synchronous lazy sequences, the boilerplate problem is solved.

For async work the story is more fragmented. P2300 (std::execution, the sender/receiver model) is still working through the standardization process. Until it lands, the choice is Asio, Folly, or rolling your own.

The Takeaway

The boilerplate the C++20 coroutine protocol requires is a complete accounting of every decision a coroutine type must make: when to start, when to finish, what to produce, how to handle errors, and how to schedule resumption. Languages that hide this complexity are making those decisions for you, with fixed policies baked into the runtime. C++ exposes the decisions because different coroutine types need genuinely different answers. Whether that tradeoff was worth it depends on your use case, but understanding why each method exists makes the protocol legible rather than arbitrary.