Two Concurrency Models, One Language: Why C++ Kept Both Threads and Coroutines

The CppCon 2025 talk by Conor Spilsbury starts from the right premise: threads and coroutines are not competing for the same job. The reason C++ ended up with both has less to do with committee politics and more to do with the fact that two genuinely distinct scheduling models are needed in practice, often within the same program.

What a Thread Actually Is

A std::thread in C++11 is a thin wrapper around an OS thread. On Linux that means a clone() syscall; on Windows, CreateThread. The OS allocates a stack (8MB by default on Linux), registers the thread with the kernel scheduler, and adds it to the run queue.

From that point, the OS is in charge. The kernel preempts threads using timer interrupts, typically every 1-10ms. When a thread’s time slice expires or it blocks on I/O or a mutex, the kernel performs a context switch: saves all CPU registers (general-purpose, floating-point, SIMD), updates the thread control block, selects the next runnable thread from the scheduler’s queues, and restores its register state. On a modern Linux kernel, a same-process thread context switch costs roughly 1-5 microseconds. That sounds negligible, but for a server handling tens of thousands of concurrent connections with a thread per connection, scheduling overhead becomes a measurable fraction of CPU time. Stack memory is the harder constraint: 100,000 threads at 8MB each demands 800GB of virtual address space. The OS overcommits, but resident memory and TLB pressure are real.

Synchronization adds its own costs. A std::mutex on Linux is a futex (fast userspace mutex). The uncontended path is a single atomic compare-and-swap in userspace, which is cheap. The contended path calls futex() to block, which requires a kernel entry. Condition variables compound this: each wait() atomically releases the mutex and parks the thread, and each notify_one() may require a syscall to wake it.

std::mutex mu;
std::condition_variable cv;
bool ready = false;

// Thread A waits
{
    std::unique_lock lock(mu);
    cv.wait(lock, [] { return ready; }); // may trap to kernel
    // process...
}

// Thread B signals
{
    std::lock_guard lock(mu);
    ready = true;
    cv.notify_one(); // may trap to kernel
}

None of this is a flaw. Threads are powerful precisely because the OS manages them. The kernel can schedule work across multiple cores, handle blocking I/O transparently, and preempt runaway computations. For CPU-bound parallelism, multiple threads on separate cores execute genuinely in parallel with no coordination overhead at all. The model is appropriate; it just has costs that become visible at scale.

What a Coroutine Actually Is

C++20 coroutines take a fundamentally different approach. Rather than delegating scheduling to the OS, they make interleaving explicit in the code itself through three new keywords: co_await, co_yield, and co_return.

A coroutine is a function that can suspend execution and return control to its caller, preserving enough state to resume later from the same point. C++20 coroutines are stackless. A coroutine does not have its own stack. Instead, the compiler allocates a coroutine frame, usually on the heap, to hold only the local variables that must survive across a suspension point. When the compiler can prove the frame’s lifetime is contained within the caller’s scope, it can eliminate the heap allocation entirely, a transformation known as Heap Allocation Elision Optimization (HALO).

The machinery involves three pieces. The promise type, defined by whoever authors the coroutine type, controls behavior at the start, each suspension, the return, and on exceptions. The coroutine handle is a lightweight pointer that can be passed around to schedule resumption. The awaitable/awaiter protocol defines what happens when co_await is applied to an expression.

#include <coroutine>

struct Task {
    struct promise_type {
        Task get_return_object() {
            return Task{std::coroutine_handle<promise_type>::from_promise(*this)};
        }
        std::suspend_always initial_suspend() noexcept { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        void return_void() noexcept {}
        void unhandled_exception() noexcept {}
    };

    std::coroutine_handle<promise_type> handle;
    ~Task() { if (handle) handle.destroy(); }
};

Task fetch_and_process() {
    co_await some_async_operation(); // suspends here
    // resumes here when the operation completes
    co_return;
}

When a coroutine reaches co_await, it evaluates the awaitable expression, calls await_ready() to check whether the result is already available, and if not, calls await_suspend() with its own coroutine handle. The coroutine is now suspended: the stack unwinds to the caller, and the caller continues running. When the async operation completes, whoever holds the coroutine handle calls handle.resume(), and execution picks up immediately after the co_await.

The cost of suspend and resume is on the order of 10-50 nanoseconds. There is no syscall, no register file dump to a kernel structure, no scheduler queue manipulation. The cost is approximately that of a small number of function calls, which puts it roughly 100x below a thread context switch.

The Scheduling Math

For CPU-bound workloads, threads are the correct tool. Coroutines are cooperative: they yield only where co_await or co_yield appears. Unless you explicitly schedule coroutines onto a thread pool, they run on one thread at a time. If you have four cores and four independent matrix multiplications, four threads running in parallel is exactly what you need. Coroutines do not change that picture.

For I/O-bound workloads with high concurrency, the numbers shift considerably. An HTTP server handling 50,000 concurrent keep-alive connections cannot afford 50,000 OS threads. With coroutines, each connection becomes a coroutine: a frame that typically fits in a few hundred bytes, with a context switch that carries no kernel involvement. The server’s thread count stays proportional to CPU cores, not connections.

Go’s success with network servers is instructive here. Go goroutines are stackful (starting at 2KB, growing as needed), with an M:N scheduler built into the runtime that multiplexes goroutines onto a pool of OS threads via work-stealing. A goroutine context switch costs roughly 100-300 nanoseconds: slightly more than a C++20 coroutine switch, but with work-stealing and multi-core scheduling included at no extra API cost. C++20 provides the coroutine building blocks; it leaves the runtime to library authors.

Rust’s async/await is the closest structural equivalent: also stackless, also cooperative, also dependent on a runtime. The difference is that the Rust ecosystem has largely converged on Tokio as a standard async runtime, whereas C++ has not converged on anything equivalent. Asio has had solid C++20 coroutine support for several years and is the most widely deployed option. Lewis Baker’s cppcoro defined much of the conceptual vocabulary for C++ coroutine libraries before going dormant. The longer-term answer is P2300 std::execution, a proposal that standardizes executors, schedulers, and async sender/receiver chains; it targets C++26 and would finally give coroutines a standard scheduling substrate.

Composing the Two Models

The most instructive case is not the pure I/O scenario or the pure CPU scenario but the mixed one: CPU-intensive processing triggered by async events. A video transcoding service receiving jobs over a network, for instance, needs async coordination at the ingestion layer and parallel CPU work at the compute layer. Coroutines handle the coordination: waiting for incoming jobs, tracking per-request state, writing responses. A thread pool handles the compute: the coroutine submits work and co_awaits a future, and the thread pool executes the transcoding across available cores.

The two models compose rather than compete. The threading model provides genuine parallelism across cores. The coroutine model provides lightweight concurrency for the coordination layer that would otherwise require one thread per in-flight request.

The CppCon talk makes this point from the OS upward: understanding why each model costs what it costs is what tells you which to reach for. Preemptive scheduling with OS-managed threads is appropriate when you want the kernel to handle fairness and parallel execution across hardware. Cooperative scheduling with coroutines is appropriate when you want deterministic switching points, minimal overhead per concurrent unit of work, and the ability to manage hundreds of thousands of logical tasks without thread-stack memory for each one.

C++ having both is not a design failure. It reflects that production systems routinely need both, often at different layers of the same program.