Quasar Chunawala’s deep dive into C++ coroutines is worth reading for anyone who has stared at a promise_type definition and wondered why it requires so much ceremony. The post does a solid job explaining the structure. What I want to do here is go one layer beneath that — into what the compiler physically generates, why promise_type is shaped the way it is, and what this design choice looks like against the coroutine models in Rust and Go.
The Coroutine Frame
C++20 coroutines are stackless. When the compiler encounters a function containing co_await, co_yield, or co_return, it does not give that function its own stack. Instead, it transforms it into a heap-allocated state machine struct — the coroutine frame.
At a simplified level, for a coroutine like this:
Task<int> compute(int x) {
auto a = co_await fetch_data(x);
co_return a * 2;
}
The compiler generates something along the lines of:
struct __compute_frame {
void (*resume_fn)(__compute_frame*);
void (*destroy_fn)(__compute_frame*);
Task<int>::promise_type __promise;
int __resume_index;
int x; // parameter, copied into frame
SomeType a; // local that lives across the suspension
__awaiter_t __awaiter; // awaiter at the co_await site
};
The first two members — resume_fn and destroy_fn — sit at a fixed offset from the start of the frame. std::coroutine_handle<> is a thin wrapper around a pointer to this layout; calling .resume() calls resume_fn, and .destroy() calls destroy_fn. The standard does not mandate this exact layout, but all major compilers (Clang, GCC, MSVC) converge on it.
The resume function dispatches on __resume_index to jump to the right suspension point:
void __compute_resume(__compute_frame* f) {
switch (f->__resume_index) {
case 0: goto entry;
case 1: goto after_fetch;
}
entry:
// set up awaiter for fetch_data(x)
f->__resume_index = 1;
// call await_suspend; if it suspends, return
return;
after_fetch:
f->a = f->__awaiter.await_resume();
f->__promise.return_value(f->a * 2);
// fall through to final_suspend
}
The key thing to understand is that there is no setjmp/longjmp, no fiber stack switch, no OS involvement. Suspension is just a return from the resume function, with the current state index saved in the frame. Resumption is a function call with a dispatch on that index.
promise_type as the Customization Point
For a coroutine returning ReturnType, the compiler looks up ReturnType::promise_type (or a specialization of std::coroutine_traits). This is the single place where you control every aspect of the coroutine’s behavior.
The required interface:
struct promise_type {
ReturnType get_return_object(); // build the value returned to the caller
auto initial_suspend() noexcept; // suspend on entry?
auto final_suspend() noexcept; // suspend on exit?
void return_value(T v); // or return_void()
void unhandled_exception(); // called inside catch(...)
};
get_return_object() is called before the coroutine body runs. This is why you can write auto t = coroutine_function() and get a Task back immediately, even though the coroutine has not done any work yet. The Task object is handed to the caller via this method; the frame is already allocated and the promise already constructed inside it.
initial_suspend() returns an awaitable. std::suspend_always means the coroutine starts lazily — it does nothing until someone calls .resume() on the handle. std::suspend_never means it runs eagerly until it hits the first co_await in the body. Most task implementations use lazy start because it lets you set up continuations before any work begins.
final_suspend() must be noexcept. If it returns std::suspend_always, the frame stays alive at the final suspension point so the awaiting code can retrieve the result. If it returns std::suspend_never, the frame is destroyed immediately. The standard says final_suspend must not throw because if an exception escaped from there, there would be nowhere to propagate it — the coroutine body’s try/catch wrapper has already exited.
The compiler wraps the entire coroutine body in a try/catch:
try {
// coroutine body
} catch (...) {
promise.unhandled_exception();
}
You typically implement unhandled_exception by calling std::current_exception() to capture the exception and rethrow it later when the awaiting coroutine calls co_await on your task.
The Awaiter Protocol
Every co_await expr desugars through a three-method protocol:
struct SomeAwaiter {
bool await_ready() noexcept; // if true, skip suspension entirely
??? await_suspend(std::coroutine_handle<> h); // called just before suspend
T await_resume(); // called on resumption; result is the co_await value
};
await_suspend has three valid return types, and the choice matters for performance. If it returns void, the coroutine always suspends and someone must call .resume() externally. If it returns bool, it can decide at runtime whether to suspend — useful when an operation completes synchronously. If it returns std::coroutine_handle<>, that handle is resumed as a tail call, which is how you chain coroutines without growing the call stack.
That last form is called symmetric transfer. Without it, writing co_await task would cause the awaiter to call resume() on the inner task, which would call resume() on its continuation, and so on — each .resume() call adding a frame to the real call stack. With symmetric transfer, the runtime takes the returned handle and resumes it directly, keeping the stack depth at O(1) regardless of how many coroutines are chained together. Lewis Baker’s original blog series on cppcoro explains this mechanism in detail and it is one of the more consequential design choices in the whole coroutine specification.
The promise type can also intercept every co_await in the coroutine body via await_transform:
struct promise_type {
template<typename T>
auto await_transform(MyAwaitable<T>& a) { return a.get_awaiter(); }
// Prevent awaiting on anything else:
template<typename T>
auto await_transform(T&&) = delete;
};
This lets a framework restrict what a coroutine may await, inject cancellation checks, or redirect awaitables through a scheduler — all without the coroutine body knowing.
The Boilerplate Problem and What Libraries Do
C++20 ships the coroutine machinery without any complete coroutine types, except for the trivial std::suspend_always and std::suspend_never. Writing a usable task<T> from scratch runs to roughly 50-100 lines before you have anything that handles errors, stores results, and resumes the correct continuation.
cppcoro by Lewis Baker established the reference designs: task<T> for lazy async computations, generator<T> for synchronous pull sequences, async_generator<T> for generators that themselves use co_await, and io_service wrapping Windows IOCP and Linux io_uring. Though cppcoro is largely unmaintained now, most libraries that followed — libcoro, Meta’s folly::coro, Asio’s coroutine support — trace their promise_type designs back to those implementations.
C++23 adds exactly one coroutine type to the standard library: std::generator<T> (P2168). It supports recursive generators via co_yield std::ranges::elements_of(subrange), which yields each element from a subrange without the O(depth) overhead that a naive recursive generator would incur. For async I/O, you still need a library. The P2300 std::execution proposal targets C++26 and includes a standard executor model with coroutine integration via co_await-able senders.
Comparison with Rust and Go
Rust’s async model is mechanically closest to C++. Both compile coroutines to state machines, both are stackless, and neither includes a standard executor. The differences are illuminating.
In Rust, a Future<Output=T> is the compiled state machine itself; there is no separate heap allocation by default. The future lives inline wherever you store it. This means Rust avoids the heap allocation that C++ incurs per coroutine, though C++ compilers can apply Heap Allocation eLision Optimization (HALO) when the coroutine’s lifetime is demonstrably nested within the caller’s. Rust also requires Pin<&mut Self> for self-referential futures — a consequence of the inline storage; moving the future would invalidate internal pointers. C++ sidesteps this because the heap-allocated frame has a stable address.
Go uses stackful goroutines. Each goroutine starts with a small growable stack (around 2-8 KB) and can call blocking functions directly; the Go runtime parks the goroutine and schedules another. A C++ coroutine cannot call a blocking read() without blocking the OS thread — you must use async APIs throughout. The tradeoff shows up in memory: a goroutine occupies at minimum a few kilobytes, while a C++ coroutine frame for a simple async function might be 100-200 bytes. At a million concurrent instances, that gap is significant.
The C++ design optimizes for flexibility and minimal overhead when not suspended. You pay for heap allocation (unless HALO fires), and you take on the responsibility of providing a scheduler. In exchange, the promise_type mechanism gives you precise control over every aspect of the coroutine’s lifecycle, and the language adds no runtime you did not ask for.
Writing a Minimal Generator
For something concrete, here is a synchronous generator using the raw coroutine primitives — no library required:
#include <coroutine>
#include <optional>
template<typename T>
struct generator {
struct promise_type {
std::optional<T> current_value;
generator get_return_object() {
return generator{std::coroutine_handle<promise_type>::from_promise(*this)};
}
std::suspend_always initial_suspend() noexcept { return {}; }
std::suspend_always final_suspend() noexcept { return {}; }
std::suspend_always yield_value(T v) {
current_value = std::move(v);
return {};
}
void return_void() {}
void unhandled_exception() { std::terminate(); }
};
struct iterator {
std::coroutine_handle<promise_type> h;
bool operator!=(std::default_sentinel_t) const { return !h.done(); }
iterator& operator++() { h.resume(); return *this; }
T operator*() const { return *h.promise().current_value; }
};
iterator begin() { handle_.resume(); return {handle_}; }
std::default_sentinel_t end() { return {}; }
~generator() { if (handle_) handle_.destroy(); }
std::coroutine_handle<promise_type> handle_;
};
generator<int> range(int n) {
for (int i = 0; i < n; ++i)
co_yield i;
}
// Usage:
for (int i : range(5))
std::cout << i << '\n';
The co_yield i desugars to co_await promise.yield_value(i), which stores the value and returns std::suspend_always. Control goes back to the range-for loop, which calls ++it (resuming the coroutine) to get the next value. The frame persists across each suspension; the local i lives in the coroutine frame because it crosses a suspension point.
C++23’s std::generator<T> does this properly — with allocator support, recursive generator composition, and correct handling of move-only types. But the pattern above shows everything the mechanism requires.
The Design Philosophy
C++20 coroutines are a language-level tool, not a framework. The committee deliberately separated the transformation mechanism from any runtime support. That choice frustrates people who want something like Python’s asyncio or Kotlin’s structured concurrency out of the box. It makes sense if you consider the breadth of environments C++ targets: embedded systems with no heap, game engines with custom allocators, server software running on Linux io_uring, Windows applications using IOCP. A single bundled scheduler would serve none of them well.
The promise_type mechanism is a customization point in the classic C++ sense: you specify a type, the compiler calls into it at well-defined points, and you own the behavior. The boilerplate cost is real, and the lack of a standard async I/O layer remains a gap. But the fundamental machinery — state machine transformation, symmetric transfer, the awaiter protocol — is sound, and the ecosystem of libraries building on top of it has matured considerably since C++20 shipped.