When LLM-Generated C++ Looks Optimized But Runs Slower Than a For Loop

There is a specific kind of bad code that is harder to catch than obviously broken code. It compiles cleanly, it passes tests, and it looks, at a glance, like the author knew what they were doing. It uses the right vocabulary: SIMD intrinsics, manual loop unrolling, careful data alignment. Then you benchmark it and discover a plain for loop over the same data is faster.

That is the story Andrey Karpov tells in his recent ISOCpp post about the markus project, a small C++ tool generated with Claude Opus. The model produced code with SSE2 intrinsics, extra scaffolding, and the general shape of performance-conscious systems programming. In benchmarks, it was the slowest option tested. The simplest loop implementation beat it.

This is not primarily a story about AI being bad at code. It is a story about a specific and well-understood failure mode in systems optimization, one that humans fall into too, now industrialized by language models that pattern-match on the surface appearance of high-performance code.

What SSE2 Optimizations Are Supposed to Do

SSE2 (Streaming SIMD Extensions 2) is an x86 instruction set extension that has been baseline on x86-64 since AMD64 launched in 2003. It provides 128-bit XMM registers that can hold four 32-bit floats, two 64-bit doubles, or various integer widths, and operate on all lanes simultaneously.

The promise is straightforward: if you are adding two arrays of floats, instead of processing one element per instruction, you process four at once. Done correctly, this is a real speedup. Done incorrectly, the overhead of the setup, the alignment handling, the pre-loop and post-loop scalar tails, and the loss of compiler visibility into what you are doing can eat the benefit and then some.

Manual SIMD via intrinsics, functions like _mm_add_ps or _mm_loadu_si128 from <immintrin.h>, trades compiler flexibility for direct control. The compiler stops seeing a loop it can reason about and starts seeing opaque function calls. It cannot reorder them, cannot fold them into surrounding operations, cannot decide that a different vectorization strategy would be better for the actual hardware and surrounding code.

// What an LLM might generate: explicit SSE2 with manual loop structure
#include <emmintrin.h>

void process_sse2(const float* src, float* dst, size_t n) {
    size_t i = 0;
    for (; i + 4 <= n; i += 4) {
        __m128 v = _mm_loadu_ps(src + i);
        v = _mm_mul_ps(v, _mm_set1_ps(2.0f));
        _mm_storeu_ps(dst + i, v);
    }
    // scalar tail
    for (; i < n; ++i)
        dst[i] = src[i] * 2.0f;
}

// What the compiler will often generate from this:
void process_simple(const float* src, float* dst, size_t n) {
    for (size_t i = 0; i < n; ++i)
        dst[i] = src[i] * 2.0f;
}

With -O2 or -O3, a modern compiler targeting x86-64 will auto-vectorize the simple loop. GCC will emit SSE2 or AVX instructions depending on the target, handle alignment automatically, unroll based on its own cost model, and sometimes produce better code than the manual version because it retains the full context of surrounding operations. The manual version with intrinsics locks in specific choices that may be suboptimal for the actual call site.

Why Language Models Generate This Pattern

Large language models are trained on code. A meaningful portion of high-quality C++ performance code includes SIMD intrinsics, because human engineers writing deliberately optimized code reach for them. The model learns that “optimized C++” and “SSE2 intrinsics” correlate. When asked to produce optimized code, it produces code that looks like the optimized code it has seen.

This is not hallucination in the usual sense. The model is not making things up. SSE2 intrinsics are real, the code is syntactically correct, the logic is usually sound. What the model lacks is the judgment to know when the technique applies. Specifically:

Whether the auto-vectorizer would handle this equally well or better
Whether the data access pattern is actually vectorizable without expensive shuffles
Whether the loop body is compute-bound or memory-bound (SIMD helps compute-bound loops; memory-bound loops are limited by bandwidth regardless)
Whether the added complexity creates maintenance cost that outweighs speculative performance gains

Karpov’s framing in the article cuts to the point: the value of a skilled developer is shifting toward the ability to evaluate generated code, not just produce it. Generating code is now cheap. Knowing whether the generated code is actually correct, efficient, and appropriate for the context is still a human skill, and one that requires understanding what the code is actually doing at the hardware level.

The Benchmark Gap and What It Means

The article’s central finding, that the simplest loop implementation beat the Claude Opus SSE2 version, is something experienced C++ developers will recognize immediately. The performance gap between naive-looking and intrinsic-heavy code is notoriously unpredictable, and the direction often surprises people.

Consider what a modern compiler’s auto-vectorizer does. GCC’s tree-loop-vectorize pass and Clang’s loop vectorizer both analyze loop bodies for vectorizability, determine safe widths, handle alignment padding, and produce SIMD code calibrated to the specific target. When you compile with -march=native, they also have access to AVX2 or AVX-512 if available, widening to 256 or 512 bits where the manual SSE2 code stays at 128. An LLM generating SSE2 intrinsics for a machine that supports AVX2 has left half the SIMD width on the table, while also disabling the compiler’s ability to use wider instructions.

The 64-bit scalar path, as Karpov notes, is sometimes competitive with SSE2 for certain data types. A 64-bit integer operation that processes two 32-bit values or four 16-bit values in sequence is not true SIMD, but the throughput is real. Compilers know this and exploit it. Manual SSE2 code that the compiler cannot improve competes against a compiler that has full optimization freedom over a scalar loop, and sometimes loses.

Vibe Coding as a Systematic Risk in Systems Software

The term “vibe coding” is a useful shorthand for code written by feel: code where the author (human or model) has a sense of what correct or optimized code looks like and produces something in that style without verifying that the specific implementation achieves the goal. Vibe coding has always existed. Junior engineers cargo-cult patterns they have seen senior engineers use. Senior engineers sometimes apply patterns from one context to a situation where they do not fit.

What language models do is scale this pattern. The output looks confident. It uses the right vocabulary. It has comments explaining what the SSE2 registers contain. It has a scalar tail loop for non-aligned remainders. It looks like the work of someone who knew what they were doing. The only way to know whether it actually is correct and efficient is to benchmark it, analyze the compiler’s output, and understand the access patterns involved.

For a Discord bot or a web service, most of this does not matter. Correctness matters; a 20% performance gap in a string operation that runs in microseconds is irrelevant. But in systems software, in codecs, in network packet processing, in databases, in anything that runs on the critical path at scale, this is exactly where the gap between code that looks fast and code that is fast becomes expensive.

What Code Review Actually Requires Now

Karpov’s broader point is that the shift toward AI-assisted code generation does not reduce the need for engineering judgment; it redirects it. The task changes from writing the right code to recognizing whether the generated code is right.

For performance-critical C++ specifically, that means several things:

Benchmark before trusting. The presence of SIMD intrinsics is not evidence of performance. Measure the candidate implementation against a simple loop compiled with -O3. If the simple loop is competitive, it should be preferred unless there is a documented reason otherwise.

Read compiler output. Tools like Compiler Explorer (godbolt.org) let you compare assembly output between implementations immediately. A simple loop that the compiler has successfully auto-vectorized will show SSE2 or AVX instructions in the output. If the manual intrinsic version and the auto-vectorized version emit similar assembly, the manual version has no advantage and substantial cost in readability and maintainability.

Understand the bottleneck first. SIMD is relevant when a loop is compute-bound. If the operation is memory-bound (reading and writing arrays that do not fit in cache), wider SIMD does not help because the bottleneck is the memory bus, not arithmetic throughput. Profilers like Linux perf or Intel VTune can distinguish these cases.

Recognize the false complexity signal. Generated code that is longer, more complex, and harder to read than a simple alternative is not necessarily better. Length and complexity are not proxies for correctness or performance. A reviewer who does not know this will approve the sophisticated-looking code and reject the simple one.

The markus project is a small example of a problem with large implications. As code generation becomes more prevalent, the ability to read generated code critically, to notice that the hand-rolled SSE2 block is slower than the loop it replaced, to ask why the code is the way it is rather than accepting its surface appearance, is the skill that separates useful AI-assisted development from a steady accumulation of confident, well-formatted, underperforming code.

Karpov’s analysis is grounded in the kind of low-level work his team at PVS-Studio has done for years: finding bugs and inefficiencies that automated tools generate and humans miss. The lesson has not changed. The code still needs to be read.