C++ Coroutines: Why the Boilerplate Is the Design

C++20 coroutines carry a reputation for boilerplate, and the reputation is fair enough: getting a minimal coroutine working requires implementing several methods with no equivalent in Python’s generators, Rust’s async, or Go’s goroutines. Understanding why those methods exist converts the apparent ceremony into a legible design.

What the Compiler Actually Does

When the compiler encounters a function containing co_await, co_yield, or co_return, it transforms that function into a heap-allocated state machine. Three objects come into existence: the coroutine frame, which holds all local variables and the current suspension point; a promise_type object, which lives inside that frame; and a std::coroutine_handle, a pointer-sized, non-owning reference to the frame.

The transformed function looks roughly like this:

ReturnType my_coroutine(Args...) {
    // 1. Allocate the frame (operator new, unless the compiler elides it)
    // 2. Construct promise_type inside the frame
    Promise promise;
    // 3. Build the return object before running anything
    ReturnType result = promise.get_return_object();
    // 4. Possibly suspend before the body runs
    co_await promise.initial_suspend();
    try {
        // 5. Original body, with co_await/co_yield/co_return transformed inline
    } catch (...) {
        promise.unhandled_exception();
    }
    // 6. Possibly suspend after the body finishes
    co_await promise.final_suspend();
    // 7. Frame destroyed here, or when handle.destroy() is called
}

The result object is returned to the caller before the coroutine body runs at all, assuming initial_suspend() suspends. This is what makes lazy coroutines work: the caller receives a Task<T> or Generator<T> immediately, while the coroutine frame sits suspended, waiting to be resumed. The full transformation is specified precisely in the standard and documented on cppreference.

The promise_type in Full

A February 2026 deep-dive by Quasar Chunawala on isocpp.org describes promise_type as the second important piece of the coroutine mechanism. It is more than that: it is the coroutine’s entire behavioral policy. A minimal implementation needs five things:

struct promise_type {
    ReturnType get_return_object();              // What the caller receives
    std::suspend_always initial_suspend() noexcept; // Lazy or eager start
    std::suspend_always final_suspend()  noexcept;  // Keep frame alive or destroy it
    void return_value(T value);                  // What co_return deposits
    void unhandled_exception();                  // Exception escaping the body
};

For generator-style coroutines that produce a sequence of values, a sixth method handles co_yield:

// co_yield expr; desugars to co_await promise.yield_value(expr)
std::suspend_always yield_value(T value) noexcept;

The optional await_transform method intercepts every co_await expression in the coroutine body. This is how executor libraries inject scheduling logic without touching user code:

auto await_transform(SomeAwaitable expr) {
    // Wrap expr in a scheduling-aware awaitable that resumes
    // on the correct executor thread
    return ScheduledAwaitable{executor_, std::move(expr)};
}

Lewis Baker’s Understanding the Promise Type remains the clearest explanation of how await_transform enables transparent scheduling and is still required reading for anyone building a coroutine library layer.

The Awaitable Protocol

Every co_await expression reduces to three calls on an awaiter object. The compiler generates something equivalent to:

auto&& awaiter = get_awaiter(expr);  // calls operator co_await if present
if (!awaiter.await_ready()) {
    awaiter.await_suspend(our_coroutine_handle);
    // --- SUSPENSION POINT ---
}
auto value = awaiter.await_resume();

await_ready() is the fast path: if the result is already available, the coroutine never suspends. This matters for I/O operations where data is buffered in the kernel, or for futures that resolved before the co_await was reached.

await_suspend() has three valid return type signatures, each with distinct semantics:

void await_suspend(std::coroutine_handle<> h);
// Suspend h; execution returns to whoever called h's resume().

bool await_suspend(std::coroutine_handle<> h);
// Return false to skip the suspension entirely and continue immediately.

std::coroutine_handle<> await_suspend(std::coroutine_handle<> h);
// Symmetric transfer: tail-resume a different coroutine instead of returning.

The third form is the one that matters most for production use. Without it, chaining awaitable tasks causes unbounded stack growth: every resume() call adds a frame to the OS stack, and a deeply nested chain of tasks will eventually overflow. Symmetric transfer compiles the resumed handle into a tail call, keeping stack depth constant regardless of how many tasks are chained. Baker’s post on symmetric transfer explains the problem and solution with full code, and it clarifies why every serious task<T> implementation, including cppcoro, relies on this form.

Why Open Customization Rather Than a Fixed Protocol

Rust’s async (stable since 1.39) uses a fixed Future trait with a single poll(cx: &mut Context) -> Poll<Output> method. Python’s generators use a fixed __iter__/__next__ protocol, and async/await uses __await__. Both languages chose uniformity: one interface, one execution model.

C++ chose an open customization surface, and the reason is performance. A fixed protocol means all coroutine types must conform to the same interface, which introduces type erasure, adapters, and allocations at composition boundaries. C++‘s open promise_type means the compiler knows the exact policy statically. Frame layout is precise, suspension behavior is inlined, and when the coroutine’s lifetime is nested within the caller’s, the compiler can elide the heap allocation entirely, a guarantee the committee calls Heap Allocation Elision (HALO).

The performance difference matters in practice. A coroutine context switch costs roughly 30 nanoseconds on current hardware, compared to 1 to 10 microseconds for a thread context switch. A coroutine frame is typically 100 to 500 bytes. A thread stack is 1 to 8 megabytes. A million concurrent coroutines is a practical design target at a few hundred megabytes of memory; a million threads would consume terabytes of virtual address space and exceed what any operating system scheduler can manage.

The trade-off is that nothing ships by default. The standard gives you std::suspend_always, std::suspend_never, and std::coroutine_handle. The rest belongs to user code or a library.

The Ecosystem That Fills the Gap

Lewis Baker’s cppcoro was the canonical reference implementation, providing task<T>, generator<T>, synchronization primitives, and I/O abstractions built directly on the raw mechanism. The design decisions in cppcoro shaped how the standard library’s own types were later specified.

Meta’s folly coro is the production-scale version, adding cooperative cancellation via CancellationToken and deep integration with folly’s EventBase. Boost.Asio, shipping C++20 coroutine support since Boost 1.75, exposes asio::awaitable<T> as a return type that connects coroutines to 15 years of battle-tested async I/O infrastructure. Facebook’s libunifex prototypes the P2300 std::execution model targeting C++26, which will provide a standard executor and scheduler interface that all coroutine libraries can target.

C++23 and the Completed Picture

std::generator<T>, standardized in C++23, is the first coroutine type that ships with the language rather than a third-party library. It is a synchronous, range-compatible, lazy generator with full integration with std::views:

#include <generator>

std::generator<int> fibonacci() {
    for (int a = 0, b = 1; ; ) {
        co_yield a;
        auto next = a + b;
        a = b;
        b = next;
    }
}

auto first_ten = fibonacci() | std::views::take(10);

It also supports co_yield std::ranges::elements_of(range) for recursive generators, using symmetric transfer internally to avoid stack overflow. The standard’s own implementation proves the model works: a general-purpose, allocation-efficient, zero-overhead generator is expressible within the promise_type mechanism without special compiler magic beyond what the C++20 spec already provides.

The roadmap is now legible in retrospect: the coroutine mechanism in C++20, the first standard type in C++23, a standard scheduler via std::execution in C++26. The complexity that makes an empty promise_type look intimidating is the same complexity that keeps coroutine frames small, allocation costs measurable, and executor strategies substitutable at any layer. When you understand what each method controls, the promise_type stops looking like boilerplate and starts looking like a configuration interface for a state machine the compiler builds on your behalf.