Where std::ranges Stops Being Zero-Cost

The C++20 ranges library arrived with a strong pitch: composable, lazy, expressive pipelines with zero overhead. In practice, four years of production use and a November 2025 analysis by Daniel Lemire on isocpp.org have clarified what that promise actually covers. The short version is that std::views::transform earns the zero-overhead label, and std::views::filter does not. The longer version is worth understanding, because the reason maps directly onto how auto-vectorization works and what the ranges design deliberately chose to sacrifice.

The Benchmark That Shows the Split

The canonical comparison looks like this:

// Raw loop
long sum = 0;
for (auto x : v) {
    if (x % 2 == 0) sum += x;
}

// Ranges pipeline
long sum = std::ranges::fold_left(
    v | std::views::filter([](auto x){ return x % 2 == 0; }),
    0L, std::plus{});

Both are semantically identical. On SIMD-capable hardware with GCC 14 or Clang 18/19 at -O3, the raw loop auto-vectorizes: the compiler identifies the contiguous array, recognizes the masked-reduce pattern, and emits vpaddq or equivalent instructions processing multiple elements per cycle. The ranges pipeline does not. Lemire measured the filtered ranges version running roughly 2 to 5 times slower depending on compiler and dataset size.

Swap filter for transform alone and the picture changes entirely. A transform_view over a contiguous range compiles down to essentially the same SIMD loop as std::transform at -O3 on GCC and Clang. The penalty belongs to filter_view specifically, not to ranges as a whole.

Why filter_view Breaks Vectorization

Auto-vectorization requires that the loop stride be statically knowable. A SIMD unit processes N elements at once, which means the compiler must know, at compile time, which N elements are next. That requirement eliminates the possibility of vectorizing any loop whose iteration pattern depends on runtime data.

filter_view::iterator::operator++ advances through the underlying range until it finds an element satisfying the predicate. Given a vector of integers and a filter on even numbers, a single increment might skip one element or skip ten, depending entirely on what the data contains. The stride is data-dependent, and that property is not something the compiler can reason away with inlining or constant folding. There is no static information to act on.

Contrast this with transform_view::iterator::operator++, which simply calls ++underlying_iterator. The stride is always one. The compiler inlines through the adapter, sees a pointer increment over contiguous memory, and produces vectorized output. The same analysis that blocks vectorization for filter_view poses no obstacle for transform_view.

This is not a compiler implementation gap waiting to be closed. It is a structural property of what filtering means: you are asking the iterator to seek the next matching element, and seeking is inherently sequential when matches are unpredictable.

The Inlining Prerequisite

Before vectorization can be attempted, the compiler has to inline through the entire view iterator hierarchy. Each adapter wraps the previous iterator in a new type. A pipeline like v | filter | transform | take requires inlining through take_view::iterator, then transform_view::iterator, then filter_view::iterator, and finally the raw pointer increment.

GCC’s inliner operates under a cost model controlled by -finline-limit, defaulting to 600 pseudo-instructions. Deep template instantiation stacks can exceed this budget at -O2, leaving a non-inlined function call inside the hot loop. At -O3, inlining becomes more aggressive, but it is not unconditional. For short lambdas and shallow pipelines, the optimizer usually succeeds. For longer lambdas or four-plus composed adapters, it can stop partway through and produce scalar code even for patterns that would otherwise vectorize.

The non-obvious consequence is that performance does not degrade linearly with pipeline depth. Adding one more adapter can push the inlining cost past a threshold and produce a sudden, large regression. This makes pipelines with many composed adapters worth testing explicitly at both optimization levels rather than assuming -O3 fixes everything.

What the Design Chose

The ranges library design, formalized in P0896R4 by Eric Niebler and Casey Carter, deliberately prioritized composability, lazy evaluation, and a uniform iterator/sentinel model. These are coherent choices with real value.

Lazy evaluation means v | filter | transform produces no intermediate allocations. Values flow through the pipeline one at a time as the consumer requests them. For large datasets where you process only a fraction of elements, this can outperform a strategy that materializes intermediate results into heap-allocated vectors.

The uniform model means the same range-based for loop works identically over contiguous arrays, linked lists, infinite integer sequences from views::iota, and lazily-evaluated recursive structures. That generality matters when you want algorithms that are genuinely agnostic to their input source. The cost of that generality is that the iterator abstraction is shaped for correctness and composability across all these cases, not optimized for the specific case of filtering a dense, cache-resident vector.

The designers were aware of the vectorization implications when P0896 was adopted. Their expectation was that compiler technology would advance to handle these patterns automatically. That expectation has partially been met: transform_view over contiguous ranges vectorizes well on modern GCC and Clang. Predicate-based filtering represents a harder category that current auto-vectorizers do not handle, and there is no clear mechanism by which they could without additional annotations in the language or standard library.

C++23 Gives You an Escape Hatch

C++23 added std::ranges::to<>, which materializes a range pipeline into a concrete container before further processing:

// Materialize the filtered range into a vector,
// then run a vectorizable reduction over contiguous memory.
auto evens = v | std::views::filter([](auto x){ return x % 2 == 0; })
               | std::ranges::to<std::vector>();
long sum = std::reduce(evens.begin(), evens.end(), 0L);

This adds an allocation, so it is not always a win. But for pipelines where the filtered result is consumed multiple times, or where the downstream computation is expensive enough to justify separating the filter step from the reduction step, materializing first can yield better total throughput. The filter runs scalar over the input; the downstream std::reduce runs with SIMD over the smaller, contiguous result. Whether that beats the lazy path depends on filter selectivity, dataset size, and the relative cost of allocation versus repeated scalar iteration.

std::views::stride and std::views::chunk (both C++23) offer another angle. If the structure of your data allows you to express the filtering logic as a regular stride rather than a predicate-based filter, you recover the uniform-stride property that vectorizers need. This requires rethinking the problem, but for patterns like “every even-indexed element” or “blocks of N”, it is worth considering before defaulting to filter.

Practical Guidance

The useful framing is not that ranges are slow, but that the zero-overhead guarantee is conditional on pipeline shape and must be verified rather than assumed.

views::transform and views::iota over contiguous ranges can be treated as performance-equivalent to raw loops at -O3 on GCC and Clang. MSVC has historically produced worse codegen for ranges pipelines, though it has improved across recent releases; verify there rather than assuming parity.

views::filter in any pipeline should be treated as a scalar operation until benchmarks demonstrate otherwise. The code is still correct and often still fast relative to alternatives involving virtual dispatch or heap allocation. It simply will not leverage SIMD width, which matters when you are filtering large vectors of small integers or floats in a hot path.

Deep pipelines, meaning four or more composed adapters, are worth benchmarking explicitly at both -O2 and -O3. Inlining thresholds can cause discontinuous performance drops that surface-level benchmarks at a single optimization level will miss entirely.

Lemire’s retrospective is a useful corrective to the marketing around zero-overhead abstractions in C++. Zero-overhead is a property of specific abstractions under specific conditions, not a blanket guarantee that applies to everything in the standard library. Ranges are well-designed for their stated goals: composability, laziness, and range-category generality. The gap between those goals and the goal of guaranteed auto-vectorization is not a defect in the design; it is an explicit trade-off that has real consequences for performance-critical filtering over dense collections.