What OS Threads Cost and Why C++20 Coroutines Don't Pay It

The C++20 standard added coroutines without removing threads, and that was not an oversight. The two mechanisms solve different problems, operate at different levels of the stack, and carry different costs. Understanding when to reach for each requires looking at what both models do at the OS and hardware level, which is the ground Conor Spilsbury covered in his CppCon 2025 talk. The talk is a good starting point; the implementation details deserve a closer look.

Threads: The OS Owns the Schedule

std::thread, introduced in C++11, wraps a kernel thread. When you call std::thread t(fn), the OS allocates a dedicated stack (typically 8 MB on Linux), registers the thread with its scheduler, and places it in the run queue. From that point, the OS decides when your thread runs. Your code has no direct say in when it gets CPU time.

The classic producer/consumer pattern illustrates how this model handles waiting:

std::mutex mtx;
std::condition_variable cv;
std::queue<int> work_queue;
bool done = false;

void producer() {
    for (int i = 0; i < 1000; ++i) {
        {
            std::unique_lock lock(mtx);
            work_queue.push(i);
        }
        cv.notify_one();
    }
    { std::unique_lock lock(mtx); done = true; }
    cv.notify_all();
}

void consumer() {
    while (true) {
        std::unique_lock lock(mtx);
        cv.wait(lock, [] { return !work_queue.empty() || done; });
        if (work_queue.empty()) break;
        int val = work_queue.front();
        work_queue.pop();
        lock.unlock();
        process(val);
    }
}

When the consumer calls cv.wait(), it executes a syscall that tells the kernel to block this thread until the condition changes. The kernel removes the thread from the run queue, saves its state (registers, stack pointer, instruction pointer, thread-local storage), and schedules something else. When the producer calls cv.notify_one(), the kernel marks the consumer as runnable again, but it does not run until the scheduler gives it a slot.

That round-trip through the kernel is the cost. A context switch on Linux involves saving and restoring roughly 100 registers on x86-64, updating kernel data structures, and potentially causing TLB flushes and cache misses when the thread resumes on a different CPU core. Measured end-to-end, a context switch typically costs somewhere between 1 and 10 microseconds. That range is harmless when threads do useful work for milliseconds at a time. It becomes a problem when you have ten thousand network connections, each blocking its own thread, spending most of its time in cv.wait() or recv() while the scheduler churns through a run queue that is mostly idle.

The stack memory is a separate constraint. Eight megabytes per thread means a server sustaining ten thousand threads consumes eighty gigabytes of virtual address space for stacks alone. Virtual memory makes much of that unpaged in practice, but it still stresses the kernel’s page table management and sets a hard upper bound on concurrency that scales poorly with connection count.

Coroutines: The Programmer Owns the Schedule

C++20 coroutines approach the same waiting problem from a different direction. A coroutine is a function that can suspend execution at a co_await point without blocking the OS thread it runs on. Suspension is a function return: the coroutine saves its local state in a frame on the heap and hands control back to its caller, with no kernel involvement.

The introductory example that appears in most documentation is a lazy integer generator:

#include <coroutine>

struct IntGenerator {
    struct promise_type {
        int current_value = 0;

        IntGenerator get_return_object() {
            return IntGenerator{
                std::coroutine_handle<promise_type>::from_promise(*this)
            };
        }
        std::suspend_always initial_suspend() { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        std::suspend_always yield_value(int v) {
            current_value = v;
            return {};
        }
        void return_void() {}
        void unhandled_exception() { std::terminate(); }
    };

    std::coroutine_handle<promise_type> handle;

    explicit IntGenerator(std::coroutine_handle<promise_type> h) : handle(h) {}
    ~IntGenerator() { if (handle) handle.destroy(); }
    IntGenerator(const IntGenerator&) = delete;

    bool next() { handle.resume(); return !handle.done(); }
    int value() const { return handle.promise().current_value; }
};

IntGenerator range(int from, int to) {
    for (int i = from; i < to; ++i)
        co_yield i;
}

When the compiler encounters co_yield i, it transforms range() into a state machine. Local variables migrate from the stack into a coroutine frame allocated on the heap. The co_yield point becomes a state index: the frame records where to resume, stores the yielded value in the promise, and returns control to whoever called handle.resume(). There is no syscall and no kernel data structure update. The cost is a heap allocation for the frame, typically in the 20 to 50 nanosecond range on modern hardware, and one indirect function call per resume.

The Heap Allocation eLision Optimization (HALO) can eliminate that allocation entirely when the compiler can prove the coroutine’s lifetime is bounded within the caller’s scope, but this requires conditions that do not always hold in practice, particularly when coroutines cross translation unit boundaries.

The Awaitable Protocol

The boilerplate in the generator example exposes the machinery that makes coroutines composable. Every coroutine return type must carry an associated promise_type. The compiler-generated code calls promise.get_return_object() at construction, promise.initial_suspend() before the body runs, promise.final_suspend() before the frame is destroyed, and promise.yield_value() or promise.return_value() at each suspension or return.

More important is the awaitable protocol. When the compiler sees co_await expr, it performs three calls:

expr.await_ready(): if this returns true, skip suspension entirely and proceed without saving state.
expr.await_suspend(handle): called when suspension actually occurs. The argument is the handle of the coroutine being suspended.
expr.await_resume(): called when the coroutine is resumed; its return value becomes the result of the co_await expression.

await_suspend() is where scheduling logic lives. It receives a std::coroutine_handle<> that it can store anywhere: post it to a thread pool, register it with an I/O completion port, hand it to a timer queue, or queue it on an executor. When the awaited event completes, whoever holds the handle calls handle.resume(), and the coroutine continues from exactly where it left off. The language primitives provide suspension and resumption; library code decides when and on which thread resumption happens.

This stands in contrast to Go’s goroutines, which use a work-stealing scheduler built into the runtime. Goroutines have their own dynamically-growing stacks, and the runtime manages a pool of OS threads to multiplex goroutines onto them. Since Go 1.14, goroutines support asynchronous preemption at safe points, which prevents runaway goroutines from starving the scheduler. The tradeoff is that the scheduler is a fixed design; you cannot replace it or integrate your own execution policy.

Rust’s async/await is structurally identical to C++: stackless, state machine compilation, no built-in scheduler. Rust does not ship an executor in its standard library; production code uses Tokio or async-std. C++ is in the same position. The coroutine language primitives landed in C++20, but std::execution (P2300), the proposed standard executor framework, was still working through the committee at the time of CppCon 2025. Until a standard executor arrives, most production C++ async code runs on top of Asio, folly::coro, or libcoro.

The Combined Pattern

The two models are most productive together. The common production architecture runs a fixed-size thread pool, typically one OS thread per CPU core, with each thread executing an event loop. Coroutines multiplex on top of each thread: when a coroutine suspends at a co_await point, the thread picks up the next runnable coroutine rather than blocking:

Thread 0: [coro A runs] -> [coro A awaits network recv] -> [coro B runs] -> [coro C runs] -> ...
Thread 1: [coro D runs] -> [coro D awaits disk read]   -> [coro E runs] -> [coro F runs] -> ...

OS threads provide true parallelism across CPU cores. Coroutines provide lightweight concurrency within each thread, handling thousands of concurrent I/O operations without a corresponding number of blocked OS threads. The resource footprint for a coroutine is the frame on the heap, on the order of a few hundred bytes for typical async I/O code, versus 8 MB of committed stack per OS thread in the worst case. This is the architecture that makes high-concurrency network servers viable in C++ without adopting Go or Java virtual threads.

Asio has provided this model for years under its strand and executor abstractions. The Networking TS attempted to standardize a subset of it, and P2300 is the broader effort to give std::execution a foundation that covers this pattern.

The Colored Function Problem

Neither model is ergonomically free. Coroutine functions are typed differently from regular functions: a function returning Task<T> is a different animal from one returning T, and you cannot transparently substitute one for the other. Calling a blocking function from within a coroutine stalls the entire OS thread, defeating the point. Bob Nystrom’s essay on function coloring describes the structural issue: when suspension is explicit and typed, code that calls coroutines must be aware it is doing so, and that awareness propagates upward through the call stack.

Go and Java 21 virtual threads hide this by making all code look synchronous from the developer’s perspective; the runtime handles the multiplexing transparently. C++ and Rust trade that ergonomic smoothness for control: you know exactly where suspension can occur, you control the executor, and the runtime imposes no overhead you did not explicitly request. For embedded systems, real-time applications, or code with strict latency budgets, that control matters more than the ergonomic convenience.

When to Use Each

Threads are the right tool when work is CPU-bound and you need true parallelism, when you are calling blocking APIs without async alternatives (certain legacy C libraries, syscalls that lack io_uring or IOCP equivalents), or when total concurrency is small and bounded. A thread per CPU core doing compute-heavy work has no coroutine equivalent.

Coroutines are the right tool when work is I/O-bound and most time is spent waiting, when you need to sustain thousands or millions of concurrent operations with minimal memory overhead, when you want lazy generator-style evaluation, or when you need precise control over scheduling policy.

The CppCon 2025 talk is worth watching for the hardware-level walkthrough of how thread scheduling, context switching, and synchronization primitives actually behave. The practical summary is that threads are the unit of OS-level preemptive scheduling and coroutines are the unit of cooperative task management within a thread. They are tools at different levels of the stack, and most production systems eventually reach for both.