std::ranges Has Two Faces, and Only One of Them Can Vectorize

Daniel Lemire published a pointed benchmark article on isocpp.org in November 2025 showing that std::ranges pipelines can fall measurably short of equivalent hand-written loops. Looking back at that piece now, the interesting question is not whether the finding holds (it does) but why the gap exists where it exists and not everywhere.

The std::ranges library has two largely separate halves that most introductions treat as a unified API. The first is the algorithm half: std::ranges::copy_if, std::ranges::sort, std::ranges::fold_left, std::ranges::find_if, and their siblings. The second is the view half: std::views::filter, std::views::transform, std::views::take, and the pipe operator that composes them. Both halves live under the std::ranges namespace and handle sequences, so they are easy to conflate. They do not behave the same way under optimization.

What the Algorithm Half Actually Receives

When you call std::ranges::copy_if, you hand it a source range and a destination. The algorithm receives the range object, queries it for iterators, and proceeds to iterate through contiguous memory if the source is a contiguous range. The predicate is a separate argument applied by the algorithm’s internal loop. Crucially, the compiler sees a single function that has direct access to the source data’s underlying pointer.

std::vector<int> src = /* ... */;
std::vector<int> dst;
dst.reserve(src.size());
std::ranges::copy_if(src, std::back_inserter(dst), [](int x) { return x % 2 == 0; });

The algorithm’s internal loop operates over a contiguous range. With a simple arithmetic predicate and a compiler set to -O2 or higher, GCC and Clang can recognize the conditional copy pattern and apply SIMD instructions. On x86-64 with AVX-512, the compiler can use VCOMPRESSD to batch-process elements and pack the matching ones into the destination without a scalar branch per element. On AVX2, it may fall back to a mask-accumulate approach that still achieves partial vectorization. The key observation is that the algorithm operates directly on the underlying data layout, so the optimizer has enough information to apply these transformations.

What the View Half Wraps

Now consider the view equivalent:

auto dst = src
    | std::views::filter([](int x) { return x % 2 == 0; })
    | std::ranges::to<std::vector>();

This goes through a filter_view. The filter_view iterator’s operator++ must advance the underlying iterator until the predicate is satisfied. As specified in the C++20 standard (and visible in any standard library implementation), the increment operator for a filter view looks structurally like this:

constexpr auto& operator++() {
    ++current_;
    while (current_ != end_ && !pred_(*current_))
        ++current_;
    return *this;
}

This is a conditional loop with a data-dependent trip count. The compiler cannot determine at compile time how many underlying elements will be skipped per output element. This kills vectorization not because of a compiler limitation that might improve in a future toolchain version, but because there is no SIMD representation of “advance until a condition is met” that maps to contiguous output. The filter view’s iterator also degrades the iterator category to at most bidirectional, even when the source is a random-access contiguous vector. The compiler that receives this iterator type cannot infer that the source data is contiguous, so the optimization opportunity that copy_if exploits is not visible.

std::ranges::to<std::vector>() (added in C++23) resolves ergonomic friction but does not change this. The materialization of a filter view into a vector still iterates through the filter_view iterator one element at a time.

The Reduction Case Shows the Same Split

The algorithm-vs-view distinction applies equally to reductions. A hand-written reduction with an inline conditional is something compilers have vectorized for years:

int total = 0;
for (int x : src) {
    if (x % 2 == 0) total += x;
}

The compiler processes this with masked SIMD: compute the predicate for eight elements at once, apply a mask to zero out non-matching values, add all eight (masked) values to an accumulator. The branch is gone. On modern GCC with -O3 and AVX2, this loop runs at nearly full SIMD throughput.

std::ranges::fold_left is the C++23 standard answer for reductions, and it is a proper algorithm, not a view. But composing it with a filter view still breaks vectorization:

// Does not vectorize: fold_left sees a filter_view iterator
auto total = std::ranges::fold_left(
    src | std::views::filter([](int x) { return x % 2 == 0; }),
    0,
    std::plus<int>{}
);

The fold_left algorithm correctly calls operator++ on the range’s iterator, which here is a filter_view iterator. The sequential, conditional increment is back. The algorithm half cannot rescue a computation whose input is the view half’s filtered iterator.

The version that can vectorize avoids the filter view entirely:

// Vectorizable: fold_left sees the full contiguous range
auto total = std::ranges::fold_left(
    src,
    0,
    [](int acc, int x) { return (x % 2 == 0) ? acc + x : acc; }
);

Here the predicate is inside the accumulator function rather than inside a view. The fold_left algorithm iterates over the raw contiguous range, and the compiler can apply the same masked-SIMD reduction pattern it uses for the hand-written loop. The functional structure is similar; the optimizer’s visibility into memory layout is entirely different.

Why `copy_if` Can Beat `filter_view | to<vector>`

std::ranges::copy_if on a contiguous range can, under the right conditions, use VCOMPRESSD or equivalent instructions to batch-copy matching elements. This instruction takes a vector register and a mask and writes only the masked lanes to contiguous output. On processors with AVX-512 (available on server-class Intel and AMD silicon, and increasingly on desktop chips), this is a single instruction that does what the filter_view iterator does one element at a time.

The compiler can emit this instruction when it can prove:

The source is a contiguous range with a known element type.
The predicate is simple enough to emit as a SIMD comparison.
The output is written sequentially.

All three conditions hold for copy_if on a std::vector<int>. None of them are visible to the compiler when iterating through a filter_view, because the filter_view iterator hides the contiguity of the source behind a category-downgraded interface.

Lemire’s benchmarks capture this gap empirically. The explanation is in the iterator model.

Where Views Are Not the Problem

std::views::transform over a contiguous range does not degrade the iterator category. A transform view over a std::vector<int> is still a random-access range. The compiler can inline a simple lambda and produce SIMD instructions for the transformed loop. Pipelines that use only transform views, take, or drop over contiguous data often do match hand-written loop performance, and this is the context where the zero-overhead claim is accurate.

The performance ceiling appears specifically when a filter-class view, one that produces a variable and unpredictable number of output elements from a given input, enters the pipeline. std::views::filter is the primary example, but std::views::chunk_by and std::views::take_while create similar patterns when the predicate is not trivially bounded.

The Practical Distinction

For code on a hot path that processes large contiguous datasets with conditional selection, reaching for the algorithm half of std::ranges rather than the view half is the practical change:

// View half (does not vectorize the filter):
auto results = data
    | std::views::filter(pred)
    | std::views::transform(fn)
    | std::ranges::to<std::vector>();

// Algorithm half + view (the transform over a contiguous output can vectorize):
std::vector<int> filtered;
filtered.reserve(data.size());
std::ranges::copy_if(data, std::back_inserter(filtered), pred);
std::vector<int> results(filtered.size());
std::ranges::transform(filtered, results.begin(), fn);

The second version allocates an intermediate buffer. That allocation has a cost, especially when the filtered output is a small fraction of the input. Measure before substituting. When the dataset is large enough that SIMD throughput dominates allocation time, and when the predicate is cheap enough that the filter pass itself is fast, the two-pass version with copy_if can outperform the single-pass view pipeline.

For reductions, avoid putting the filter predicate inside a view when the reduction can absorb it as a conditional accumulator instead. The loss is some pipeline elegance; the gain is that the compiler can vectorize the loop.

The broader lesson from Lemire’s retrospective is worth stating precisely. std::ranges is not uniformly slow or uniformly fast. The algorithm half operates on the data layout directly and participates in the same optimization passes as hand-written code. The view half abstracts the iteration itself, which is the thing the compiler needs to see to apply SIMD transformations. Using the two halves interchangeably in performance-sensitive code is where the gap appears. Treating them as distinct tools with different performance contracts is how you close it.