Preemptive vs Cooperative: The Hardware-Level Case for C++ Threads and Coroutines

A recent CppCon 2025 talk by Conor Spilsbury frames C++ threads and coroutines as tools that solve different problems. The framing is correct. The question is what “different” means at the level where the distinction matters: the operating system, the compiler, and the hardware.

How the OS Handles a Thread

std::thread, introduced in C++11, is a thin wrapper around the platform thread primitive: pthread_create on POSIX systems, CreateThread on Windows. Constructing one causes the kernel to allocate a default 8MB stack on Linux, register the new thread with the scheduler, and begin multiplexing it onto available CPU cores via preemptive time-slicing, typically every 1 to 10 milliseconds.

Context switching is the recurring cost. Suspending a thread requires saving all general-purpose registers, the program counter, the stack pointer, and any floating-point or SIMD state to the thread’s kernel control block, then restoring the next thread’s saved state. On modern Linux, a same-process context switch costs roughly 1 to 5 microseconds. For a server with 200 active threads, this is background noise. At 50,000 concurrent connections, the stack footprint alone reaches 400GB of virtual address space, and scheduler overhead starts appearing in profiles.

Synchronization compounds the picture. std::mutex on Linux is implemented as a futex: the uncontended path is a single atomic compare-and-swap in userspace and costs almost nothing. The contended path invokes the futex() syscall, causing a kernel entry and a thread park. std::condition_variable::wait() atomically releases the mutex and parks the thread. Each of these operations is cheap in isolation; they accumulate in systems with many waiting threads.

For CPU-bound work across independent cores, OS threads remain the right tool. Four threads running independent matrix multiplications on four cores do not pay for each other’s overhead. The kernel handles preemption, load balancing, and NUMA topology. Threads give genuine parallelism with minimal extra machinery.

What the Compiler Builds for a Coroutine

C++20 coroutines take a fundamentally different path. Any function containing co_await, co_yield, or co_return is transformed at compile time into a heap-allocated state machine. On Clang, this happens through @llvm.coro.* intrinsics and the CoroSplit pass, which splits the coroutine function into resume, destroy, and cleanup subfunctions.

The coroutine frame holds resume and destroy function pointers at a fixed offset, a promise_type instance, a resume index recording which suspension point was last reached, copies of parameters, and only the local variables that survive across suspension points. A simple async function frame is typically 100 to 200 bytes. Frames with only a few scalar locals can be as small as 24 bytes.

Suspension is a return. When execution reaches co_await expr, the compiler checks whether the awaitable reports ready. If not, the resume index is saved and the function returns. No setjmp, no fiber stack switch, no kernel call. Resumption is an indirect function call through the frame’s resume pointer. The round trip costs approximately 10 to 50 nanoseconds, roughly 100 times cheaper than an OS thread context switch.

Heap allocation can be elided entirely via HALO (Heap Allocation eLision Optimization) when the compiler proves the frame lifetime is bounded within the caller’s scope. When HALO fires, a coroutine resumption reduces to an indirect function call through a dispatch table.

One structural detail worth understanding is symmetric transfer, introduced by Lewis Baker and standardized in P0913R0. When await_suspend returns a std::coroutine_handle<> rather than void, the compiler emits a tail call to the returned handle instead of a normal return. Without this, chaining N coroutines builds O(N) real stack frames. With it, the chain depth stays constant regardless of how many coroutines are linked. Lewis Baker’s blog remains the most precise secondary documentation for these coroutine internals.

The Scheduling Difference

The OS-level mechanics make the tradeoff concrete. Threads are preemptive: the kernel interrupts them at timer intervals regardless of what they are executing. Coroutines are cooperative: they yield only at explicit co_await or co_yield sites. A coroutine that calls a blocking system call blocks the OS thread it runs on, stalling every other coroutine scheduled on that thread.

Cooperative yielding is a constraint by design. It eliminates involuntary context switches and lets the scheduler live entirely in userspace. The consequence is that async APIs are required throughout. Calling read() inside a coroutine that should remain non-blocking defeats the model; the correct call is to an async read primitive backed by io_uring or epoll.

For I/O-bound concurrency, this tradeoff pays off. A server handling 50,000 keep-alive connections on OS threads needs 50,000 stacks. With coroutines and 200-byte frames, the same connections need 10MB. The coroutines run on a small thread pool; the scheduler is an event loop that wakes coroutines when I/O completes.

Composition in Practice

Threads and coroutines serve complementary roles, and production systems routinely combine both. A video processing service might use coroutines to handle async job ingestion and response tracking while delegating compute-bound transcoding to a thread pool. The coroutine submits work and co_awaits a future; the thread pool executes in parallel. The concurrency model handles coordination; the parallelism model handles throughput.

Boost.Asio, which has had solid C++20 coroutine support since Boost 1.75, makes this concrete:

asio::awaitable<void> handle_connection(tcp::socket socket) {
    char buf[1024];
    for (;;) {
        std::size_t n = co_await socket.async_read_some(
            asio::buffer(buf), asio::use_awaitable);
        co_await asio::async_write(
            socket, asio::buffer(buf, n), asio::use_awaitable);
    }
}

Each connection is a coroutine with a small heap frame. The Asio executor provides the thread pool. OS thread count is determined by hardware concurrency, not connection count.

cppcoro, Lewis Baker’s reference implementation, defined the vocabulary types that Asio and Meta’s folly::coro drew from: task<T> for lazy async computations, generator<T> for synchronous pull sequences, async_mutex, when_all. It introduced the symmetric transfer pattern before standardization. It is largely unmaintained now, but its designs are what C++ library authors mean when they reference coroutine idioms.

C++23 added std::generator<T> (P2502), the first coroutine type in the standard library. It models std::ranges::input_range, composes with <ranges> algorithms, and handles recursive generators with co_yield std::ranges::elements_of(subrange) without O(depth) stack cost.

The executor and scheduler gap remains unfilled. P2300 std::execution, targeting C++26, proposes senders, receivers, and scheduler concepts that would let code express “run on this thread pool” or “complete when this I/O finishes” without tying to a specific library. Until it lands, choosing Asio, cppcoro conventions, or folly::coro means accepting library-specific scheduler coupling.

How Go and Rust Chose Differently

Go goroutines are stackful, starting at 2 to 8KB and growing dynamically. The built-in M:N runtime scheduler multiplexes goroutines onto OS threads with work stealing, and parks goroutines transparently when they block on I/O. A goroutine context switch costs around 100 to 300 nanoseconds, between a C++ coroutine and an OS thread context switch. The benefit is no async coloring: you write blocking-style code and the runtime handles scheduling. The cost is 2KB minimum per goroutine versus 200 bytes for a C++ coroutine frame, which matters at one million concurrent instances.

Rust’s async/await is structurally closest to C++20. Both are stackless; both compile to state machines via LLVM. The models diverge in scheduling direction: Rust uses a pull-based model where an executor polls futures and futures register a Waker for re-polling. C++ coroutines are push-based: the awaiter receives the coroutine handle and is responsible for scheduling resumption. Rust also requires Pin<&mut Self> for self-referential futures because futures can be moved in memory; C++ sidesteps this because the heap-allocated frame has a stable address. More significantly, Rust’s borrow checker catches dangling references in async code at compile time.

In C++, a coroutine taking const std::string& copies the reference into the heap frame. If the referent goes out of scope before resumption, the result is silent undefined behavior. Clang-Tidy’s cppcoreguidelines-avoid-reference-coroutine-parameters checker flags this pattern, but it is opt-in static analysis, not a language guarantee.

What the Standard Did Not Settle

Two structural limitations of C++20 coroutines cannot be fixed at the library level. The dangling reference problem described above requires lifetime analysis that C++ does not have. The second is that coroutines are syntactically indistinguishable from regular functions at their declaration sites: Task<int> compute(int n) could be a coroutine or a regular function, and only the body tells you which. This was a deliberate committee choice, documented in P0973R0, on the theory that the return type carries sufficient information. The [[nodiscard]] convention on coroutine return types catches silently dropped coroutines in practice, but it is a library pattern rather than a language guarantee.

Neither issue is on the C++26 roadmap. The language gave programmers a powerful and efficient primitive for cooperative scheduling; it left the safety and ergonomics work to compilers, libraries, and future standards revisions.