The Vectorization Wall Inside std::views::filter

std::ranges arrived in C++20 with a compelling promise: composable, lazy pipelines that express intent clearly without trading away performance. The appeal is genuine. Code like views::filter(pred) | views::transform(fn) | views::take(n) reads like what it does, and the standard’s zero-overhead abstraction framing makes it tempting to assume the compiled result is equivalent to a hand-written loop.

Daniel Lemire’s November 2025 piece on the ISO C++ blog puts benchmarks behind the intuition that something is off, particularly for filter-heavy pipelines. The numbers are real, and the cause is specific enough to be worth examining closely. This is a retrospective look at that piece and what it points toward.

Why filter specifically

Not all range views have performance problems. views::transform, views::take, views::drop, and views::iota all preserve the structural regularity that compilers need for vectorization. They maintain contiguous or strided access patterns, they do not introduce conditional branches in the hot path, and the loop bounds remain predictable.

views::filter is categorically different. Its job is to skip elements, and skipping requires a branch per element. The resulting iterator cannot tell you in advance how many output elements there will be, what the stride between passing elements is, or whether consecutive inputs will both pass the predicate. That uncertainty is not a compiler limitation; it is a property of the operation itself.

Here is the simple case that illustrates the problem:

std::vector<int> data = /* large vector */;

// Ranges version
int sum_ranges = 0;
for (int x : data | std::views::filter([](int x){ return x % 2 == 0; })) {
    sum_ranges += x;
}

// What the compiler actually sees structurally
int sum_manual = 0;
for (int x : data) {
    if (x % 2 == 0) sum_manual += x;
}

A reader would expect identical performance. The compiler, given the raw loop version, can vectorize it: load 8 integers at a time with AVX2, test all 8 against the predicate using a comparison mask, apply the mask to zero out non-matching values, and accumulate. The branch disappears from the inner loop entirely.

The ranges version cannot get that treatment. The filter_view iterator must advance by calling operator++, which internally loops calling the predicate until it finds a passing element. The compiler sees iteration driven through an iterator adaptor, not a flat loop over a buffer. The abstraction, despite being notionally equivalent, hides the structure the vectorizer depends on.

The iterator contract is the constraint

The specific issue lives in what filter_iterator::operator++ is required to do. It must find the next element satisfying the predicate, which means a loop, which means a data-dependent branch, which means variable-length advancement through memory. There is no way to express “advance by 8 elements, apply predicate mask, report which passed” within the forward iterator contract. The contract is inherently scalar.

For bidirectional ranges (like std::vector), the situation adds another layer. filter_view satisfies bidirectional_range when the underlying range is bidirectional, so operator-- must work too. That means scanning backwards one element at a time looking for the previous element that satisfies the predicate. This is specified carefully in the standard, with a subtlety: when decrementing, the implementation must call the predicate at least once for each element it skips, plus once for the element it lands on. The standard text says the predicate is called “at most” a certain number of times, but the behavior under the hood is necessarily scalar.

There is also a less obvious issue: filter_view caches the result of begin(). The first call to begin() on a filter_view may need to scan from the start of the underlying range to find the first passing element, and that result is stored in the view object. This is required by the standard to keep begin() amortized O(1). It also means filter_view carries mutable state, which creates aliasing constraints that compilers handle conservatively when reasoning about what can be optimized across loop iterations.

The sentinel-based end design, where filter_view::end() returns a sentinel type rather than an iterator of the same type, is elegant for composition but means the range is not a common_range unless you force it with views::common. Several standard algorithm overloads that could theoretically apply optimizations require common_range, so this design choice has downstream effects.

The alternative that does vectorize

The pattern that gets you both composability and vectorization is to materialize the filtered result eagerly before processing it:

std::vector<int> filtered;
filtered.reserve(data.size());
std::ranges::copy_if(data, std::back_inserter(filtered),
                     [](int x){ return x % 2 == 0; });
int sum = std::reduce(filtered.begin(), filtered.end(), 0);

std::ranges::copy_if operates over a contiguous input buffer, and on x86 with SSE4.2 or AVX2 both GCC and Clang can vectorize the predicate evaluation: load several integers, compare in parallel, and use packed store operations to write passing elements. std::reduce then runs over a contiguous output buffer and vectorizes the accumulation straightforwardly.

The trade-off is an extra allocation and a two-pass approach. For large datasets in hot paths, the throughput improvement from vectorization typically compensates. For small datasets or predicates with side effects, the lazy approach remains appropriate. The point is that the choice is meaningful, and the default lazy approach is not unconditionally faster.

C++23’s std::ranges::to<> makes the eager materialization more composable:

// C++23
auto filtered = data
    | std::views::filter([](int x){ return x % 2 == 0; })
    | std::ranges::to<std::vector>();
int sum = std::reduce(filtered.begin(), filtered.end(), 0);

This reads clearly and, because filtered is a concrete std::vector, the subsequent reduce can vectorize over contiguous memory.

Pipeline ordering has measurable consequences

There is a related performance consideration in pipelines that combine filter and transform. The composition order matters:

// Option A: filter then transform
auto a = data | std::views::filter(pred) | std::views::transform(fn);

// Option B: transform then filter
auto b = data | std::views::transform(fn) | std::views::filter(pred);

Option A applies fn only to elements that pass pred, so it does less work when the filter is selective. Option B applies fn to all elements and then discards some, but it may allow fn to run in a more structured context before the filter interrupts traversal.

Neither option gets the ideal outcome across the board. Filter-then-transform skips work but also prevents transform from vectorizing because the filter iterator drives the entire traversal; transform-then-filter vectorizes transform over all elements but wastes computation on elements that will be discarded. A sufficiently smart compiler with full visibility into the pipeline could theoretically fuse and reorder these, but current compilers do not do this through the ranges abstractions. The transformation is not transparent enough.

For pipelines where pred is highly selective (say, passing fewer than 20% of elements), filter-first is worth the scalar penalty on the transform step. For dense pipelines where most elements pass, the savings from filtering first are small and transform-first may produce better code overall. Neither rule holds universally, and measuring on your actual data and hardware is the only reliable method.

Rust has the same problem

Rust’s iterator adapters face structurally similar constraints. iter().filter(pred).map(fn).sum() goes through scalar, element-at-a-time execution for the same fundamental reasons. The filter adapter returns an iterator that calls next() and checks the predicate per element. LLVM, the backend for both Rust and Clang, cannot vectorize this pattern in the general case.

Where Rust differs is in making batch-oriented alternatives idiomatic earlier. The chunks and windows iterators are part of the standard library and designed for batch processing. Crates like rayon provide parallel iterators that partition work across threads. The ecosystem pushes you toward thinking about data layout, which makes the gap between “elegant pipeline” and “fast pipeline” more visible earlier in development.

C++ has similar facilities, std::for_each with execution policies, std::transform_reduce, std::inclusive_scan, but they are less composable with the ranges pipeline model. The cultures are different even when the underlying constraint is the same.

What this means in practice

For most application code, none of this matters. If a filter pipeline runs once over a few thousand elements and is not in a hot path, the difference between a vectorized and scalar loop is noise. The expressiveness that ranges provide is worth the minor overhead in those cases.

Where it matters is in inner loops: image processing, signal processing, numerical kernels, anywhere you iterate over millions of elements multiple times per frame or per request. In those contexts, do not assume the elegant pipeline compiles to the efficient loop. The assembly output on Compiler Explorer with -O3 -march=native for a filter-based loop versus a raw loop over the same data is often immediately informative. The scalar instruction sequences for filter-based code stand out against the vectorized output for the equivalent raw loop.

Lemire’s benchmarks are a reminder that the zero-overhead abstraction claim comes with conditions. The overhead is real in filter-heavy pipelines, and knowing the mechanism, the iterator contract that makes filter inherently scalar, gives you a principled basis for choosing when the abstraction is worth it and when to materialize eagerly and process the result.