The Vectorization Wall Inside std::ranges::filter

The C++20 ranges library arrived with considerable enthusiasm. The ability to compose transformations as pipeline expressions, data | views::filter(pred) | views::transform(fn), promised the expressive clarity of functional programming with none of the performance cost. The abstraction overhead was supposed to compile away.

Daniel Lemire’s November 2025 post on isocpp.org revisits that promise with benchmarks. The conclusion is measured but pointed: std::ranges may not deliver the performance profile you expect, particularly in tight numerical loops. Looking back at that article now, what strikes me is that the performance gap is not arbitrary or fixable by a future compiler version. It is structural, and understanding why requires looking at the specific iterator model that ranges is built on.

How Lazy Views Actually Work

std::ranges divides the world into two pieces: algorithms (std::ranges::sort, std::ranges::copy) that operate eagerly on complete ranges, and views (std::views::filter, std::views::transform) that adapt ranges lazily. When you write:

auto result = data | std::views::filter(is_even) | std::views::transform(square);

No computation happens at that line. What you get back is a transform_view<filter_view<vector<int>&, ...>, ...> object. Each element is computed on demand when the range is iterated. This laziness is the point: you pay only for elements you consume, and you do not allocate an intermediate vector for intermediate stages.

The iteration model uses sentinels instead of matching iterator pairs. Each increment of the outer iterator calls through to the inner view’s iterator, which in the case of filter_view calls the predicate on each underlying element until it finds one that qualifies. This is a conditional branch on every step.

The Filter View and Its Iterator Category

The iterator category determines what optimizations are possible. std::views::filter degrades any input to at most a bidirectional range, even if the underlying data is a contiguous vector. This is not a flaw in the implementation; it is a direct consequence of what filter means semantically. The output positions are not predictable in advance. You cannot jump to element N of a filtered range in O(1) because you do not know how many elements satisfy the predicate before position N.

This matters enormously for auto-vectorization. SIMD instructions (SSE, AVX, NEON) operate on multiple elements simultaneously, typically 4 to 16 integers at a time. For this to work, the compiler must be able to see a predictable, strided memory access pattern. A filter pipeline offers none of that. The compiler sees a loop that increments one element at a time, checks a condition, and branches. The scalar branchy inner loop is the correct implementation of filter_view. It is not a missed optimization.

For reference, here is what the filter view’s increment operator conceptually does internally:

// Simplified inner loop of filter_view::iterator::operator++()
do {
    ++current_;
} while (current_ != end_ && !pred_(*current_));

Even if the compiler inlines this perfectly, what remains is a conditional loop with an unknown trip count and data-dependent control flow. Auto-vectorization requires neither of those to be true.

The begin() Cache and Its Side Effects

There is a second design decision in filter_view worth examining: it caches the result of begin() after the first call. The rationale is sound. Finding the first element that passes the predicate requires scanning from the start of the underlying range, which is O(n). Without caching, calling begin() twice would scan twice, violating the implicit expectation that begin() is O(1).

The cache is a mutable member of the view. This means filter_view is not const-iterable in the way a transform_view is, and it creates subtle ownership hazards: modifying the underlying container while iterating a filter_view over it is undefined behavior, and the cached begin can silently point into freed memory if the container is reallocated. More concretely for performance, the mutation of a mutable cache member on first access is not amenable to the kind of analysis compilers use to prove that loop bodies do not alias.

These constraints were known during the standardization process. The P0789 and related papers from Eric Niebler’s range-v3 work document the trade-offs explicitly. The committee accepted the cache as necessary for conformant iterator behavior. That decision has been stable across three C++ standards cycles now, which suggests it is not going to change.

Transform Alone Is a Different Story

std::views::transform applied to a contiguous range preserves more of the original structure:

auto squares = data | std::views::transform([](int x){ return x * x; });

The input remains contiguous and random-access. The transform view wraps element access in a function call, but if the compiler inlines the lambda (and it typically does for simple lambdas at -O2 and above), the resulting loop is equivalent to a hand-written one. You can verify this on Compiler Explorer: GCC and Clang at -O3 produce SIMD instructions for a simple transform over a vector. The zero-overhead claim holds here.

The problem is composition:

auto result = data | std::views::filter(pred) | std::views::transform(fn);

Once filter is in the pipeline, the transform view receives a filter_view as its input, not the original contiguous vector. The iterator category drops to bidirectional. The transform can no longer assume contiguous, predictable access. Vectorization is gone for the entire pipeline.

What the Assembly Shows

Lemire’s benchmarks fit a pattern visible across independent compiler explorations. A hand-written loop that the compiler can vectorize looks like this:

int sum = 0;
for (int x : data) {
    if (x % 2 == 0) sum += x * x;
}

Compilers can apply masked SIMD operations to this. They compute squares for all elements simultaneously, generate a mask for the even ones, and accumulate only the masked lanes. Instructions like VPMASKMOVD and VPCMPEQD on x86 make this efficient. The scalar branch-per-element version of the same computation is measurably slower on large arrays because the processor’s scalar pipeline becomes the bottleneck rather than the SIMD unit.

The equivalent ranges pipeline:

auto sum = std::ranges::fold_left(
    data | std::views::filter([](int x){ return x % 2 == 0; })
         | std::views::transform([](int x){ return x * x; }),
    0, std::plus{});

produces scalar output with GCC 14 at -O3. The filters’s non-random-access iterator is the ceiling, and no compiler version available today lifts it.

How Rust’s Iterator Chains Compare

Rust’s iterator model is worth comparing here, because it is often cited as doing better. A Rust chain like:

let sum: i32 = data.iter()
    .filter(|&&x| x % 2 == 0)
    .map(|&x| x * x)
    .sum();

has the same structural limitation: filter() produces an iterator with unknown output count, and LLVM cannot vectorize the resulting loop in the general case. Where Rust sometimes outperforms the C++ equivalent is in the ownership model’s aliasing information. LLVM’s loop optimizer can make stronger no-alias assumptions in Rust code, which occasionally opens vectorization paths that are closed in C++ due to pointer aliasing ambiguity. But for a filter chain specifically, both languages produce scalar loops, and the performance gap between the chained version and a hand-written branch-inside loop exists in both.

The range-v3 library, the direct ancestor of std::ranges, exhibits the same characteristics. It predates the standard by several years and has had extensive optimization attention. The ceiling has not moved, because the ceiling is the design, not the implementation.

Materialization as an Escape Hatch

For numerical inner loops where SIMD throughput matters, a few practical approaches restore vectorizability:

The first is a raw loop with a branch inside, which the compiler recognizes as a vectorizable pattern:

int sum = 0;
for (int x : data) {
    if (pred(x)) sum += fn(x);
}

The second is to materialize the filtered results into a contiguous container before transforming:

std::vector<int> filtered;
filtered.reserve(data.size());
std::ranges::copy_if(data, std::back_inserter(filtered), pred);

// Now filtered is contiguous; std::views::transform can vectorize over it
std::vector<int> result(filtered.size());
std::ranges::transform(filtered, result.begin(), fn);

This allocates intermediate memory but gives the compiler a contiguous, random-access range for the transform step. C++23’s std::ranges::to<std::vector>() makes the materialization step cleaner syntactically, though the allocation cost is the same.

The third option, available where the data layout permits, is to restructure the problem as a transform-then-reduce, processing all elements and masking rather than filtering:

int sum = std::transform_reduce(
    data.begin(), data.end(), 0, std::plus{},
    [](int x){ return (x % 2 == 0) ? x * x : 0; });

This formulation applies the function to every element but zero-contributes the filtered ones. The loop now has a fixed trip count equal to data.size(), the memory access is contiguous, and compilers vectorize it readily. The trade-off is that you pay for the transform computation on filtered-out elements; whether that is acceptable depends on how expensive fn is relative to the predicate.

The Broader Pattern

Ranges are not unusual in this tension between abstraction and optimization. Java streams have a documented performance ceiling for sequential filter pipelines relative to explicit loops. C# LINQ is similarly constrained in tight numerical code. The pattern in all three cases is the same: the staged lazy abstraction creates optimization boundaries at each function frontier. When the compiler sees through all of them, you get zero overhead. When it cannot, you pay per element for the abstraction.

C++23 added std::views::stride, std::views::chunk, std::views::zip, and std::views::enumerate. These extend the vocabulary but do not change the filter story. The std::simd proposal from C++26 (P1928) gives programmers explicit SIMD types, which could eventually allow range algorithms to have SIMD-aware overloads for contiguous data, but that requires opt-in from the programmer and is not automatic.

The right reading of Lemire’s retrospective is neither that ranges are broken nor that they are fine as-is. They are a tool with a specific performance envelope. For composing transformations over moderate-sized data in application code, they are clear wins. For tight loops where the vectorizer is the deciding factor, views::filter is a ceiling you need to know about before you commit to the design. Measure with your compiler, your data, and your flags. The zero-overhead principle in C++ is a design intention, not a property you can assume without verification.