The Compiler Already Knows: How LLMs Reach for Optimizations the Toolchain Solved Years Ago

Andrey Karpov’s recent analysis on isocpp.org of the markus project, generated with Claude Opus, landed on a result that will feel familiar to anyone who has maintained performance-critical C++ for more than a few years: the AI-generated code used SSE2 intrinsics, compiled cleanly, ran correctly, and was slower than a plain for loop. The simplest possible implementation won.

The immediate question is why a model with enough knowledge to write syntactically correct SIMD intrinsics would produce code that performs worse than the naive version. The answer is not really about AI being bad at code. It is about a temporal mismatch baked into the training data, one that shapes which techniques look like “optimized C++” to the model.

When Manual SIMD Was the Right Answer

SSE2 shipped with the Pentium 4 in 2001 and became the x86-64 baseline in 2003. For the following decade, reaching for _mm_loadu_si128 and friends was often justified because compiler auto-vectorization was unreliable. GCC 3.x and early GCC 4.x had vectorization support that was limited in scope and required coaxing. The safe assumption for performance-critical code was: if you want SIMD, write SIMD.

So engineers wrote it. That code went into open-source projects, performance blogs, Stack Overflow answers, and internal codebases. When those codebases were eventually published or discussed online, the patterns went with them. A substantial portion of the high-quality, expert-authored C++ performance code on the internet reflects the constraints of hardware and toolchains from 2005 to 2015.

LLMs are trained on that internet. When the model produces “optimized C++,” it is, in a real sense, producing code that looked like optimized C++ at a point in time that may no longer be accurate.

What the Auto-Vectorizer Does Now

GCC’s tree-loop-vectorizer has been under active development since the early 2000s, but the practical quality difference between GCC 4.8 (released 2013) and GCC 13 (released 2023) is significant. Clang’s loop vectorizer, introduced around 2012, has similarly matured. Both compilers, when given -O2 or -O3 and a clean loop, will:

Detect loops with no loop-carried dependencies
Determine safe vector widths for the data types involved
Insert alignment handling and scalar preamble/tail loops automatically
Choose between SSE2, SSE4.1, AVX2, or AVX-512 based on the target architecture
Apply loop unrolling calibrated to the target’s throughput and latency characteristics

The last point matters for the markus situation specifically. An LLM writing SSE2 intrinsics in 2026 is targeting 128-bit registers because that is what appears in the reference code it learned from. A compiler targeting the same machine with -march=native may emit AVX2 (256-bit) or AVX-512 (512-bit) instructions, processing twice or four times as many elements per instruction. The hand-rolled SSE2 code does not just fail to match the compiler’s output, it actively prevents the compiler from improving on it, because the optimizer sees opaque function calls rather than a loop it can reason about.

You can verify this in about thirty seconds on Compiler Explorer. Take a simple loop:

void scale(float* dst, const float* src, size_t n, float factor) {
    for (size_t i = 0; i < n; ++i)
        dst[i] = src[i] * factor;
}

Compile with gcc -O3 -march=skylake and read the output. You will see vmulps operating on YMM registers, which are 256-bit. The compiler handles the scalar tail, the loop unrolling, and the instruction selection. Now paste in equivalent code using _mm_mul_ps (128-bit SSE2) and compare. The manual version is narrower and the compiler cannot widen it.

The Pointer Aliasing Barrier

There is a legitimate reason compilers sometimes fail to auto-vectorize and the programmer needs to intervene: pointer aliasing. When two pointers could legally refer to overlapping memory regions, the compiler cannot vectorize the loop because the order of reads and writes matters. This is why the C restrict keyword and its C++ analogue __restrict__ exist.

The fix is not to reach for intrinsics. It is to tell the compiler what it needs to know:

void scale(float* __restrict__ dst, const float* __restrict__ src,
           size_t n, float factor) {
    for (size_t i = 0; i < n; ++i)
        dst[i] = src[i] * factor;
}

With __restrict__, the compiler is told the pointers do not alias, and auto-vectorization proceeds. The result is often better than a hand-written version because the compiler can apply its full knowledge of the target architecture and surrounding call context.

LLMs rarely generate __restrict__ because it appears less frequently in the training data than intrinsics-heavy code. The model’s mental model of “making a loop faster” is to replace the loop with SIMD calls, not to give the existing loop more information to work with.

The Memory-Bound vs. Compute-Bound Distinction

The markus benchmarks showed the simplest implementation winning, which points to another dimension the model cannot reason about: whether the loop is compute-bound or memory-bound.

SIMD instruction throughput helps when the CPU is limited by arithmetic operations. If the bottleneck is loading and storing data, wider SIMD does not help, because the memory bus is the constraint and it is already saturated. A loop that reads a large array, does minimal arithmetic, and writes it back is almost always memory-bound on modern hardware. Replacing scalar arithmetic with SSE2 arithmetic does not move the bottleneck; it just adds complexity.

For a small project like markus, the working set may fit in L1 or L2 cache, which shifts the bottleneck toward instruction overhead rather than memory bandwidth. In that regime, a simple loop with minimal per-iteration overhead can outperform a complex SSE2 loop that loads registers, executes parallel operations, and then has to shuffle results back out. The fixed cost of the SIMD setup exceeds the savings from parallelism when the loop body is short.

This is exactly the kind of analysis that requires a profiler and knowledge of the hardware. It is not derivable from how the code looks. A model generating code based on pattern recognition cannot perform it.

What the Training Data Bias Produces in Practice

The broader pattern here is worth naming: LLMs inherit the assumptions of the code they were trained on, including assumptions about what the compiler can and cannot do. Code written in 2010 for GCC 4.4 assumed the compiler would not vectorize automatically. Code written in 2026 for GCC 14 can assume it will, given the right conditions. The model does not know which assumption applies because it does not know what year it is generating code for, or what compiler version is in use.

This produces a systematic bias toward over-engineering. The model reaches for explicit intrinsics because that is what cautious, expert engineers did when they could not trust the auto-vectorizer. That caution was appropriate then. Applied now, without the context that motivated it, it adds complexity and can subtract performance.

Karpov’s framing in the article is precise: the value of a skilled developer is shifting toward the ability to effectively review code. But effective review of AI-generated performance code requires something specific: knowing the history of why certain patterns exist, so you can distinguish patterns that are still appropriate from patterns that solved a problem the toolchain no longer has.

A reviewer who sees SSE2 intrinsics in generated code and approves them because they look sophisticated is being fooled by historical residue. A reviewer who compiles the simple version with -O3 -march=native, checks the assembly on Compiler Explorer, and asks whether the intrinsics-heavy version offers anything the compiler does not produce automatically, that reviewer is doing the right work.

Writing Code That Lets Modern Compilers Help

The corrective to cargo-culted intrinsics is not to avoid optimization entirely. It is to write code that gives the compiler maximum room to do its job:

Use __restrict__ when pointers genuinely do not alias. This is the single most valuable hint for enabling auto-vectorization in loops with pointer arguments.

Keep loop bodies simple and predictable. Branches inside loops defeat vectorization. If you need conditional processing, look at branchless formulations or separate the cases into distinct loops.

Use standard library algorithms where they exist. Implementations like std::transform and std::inner_product give compilers more context about intent and are sometimes optimized specially.

Measure, then optimize. Profiling identifies actual bottlenecks. Manual intrinsics are justified when a profiler points to a specific loop, the compiler’s output is suboptimal for a known reason, and the optimization is documented with a benchmark that quantifies the gain.

The markus project is a compact illustration of what happens when that process runs in reverse: optimization is applied speculatively, based on appearance rather than measurement, using a technique that predates the toolchain improvements that made it unnecessary. The lesson is not that AI cannot help with performance code. It is that AI-generated performance code, like any other kind, requires a developer who understands what is actually happening, at the instruction level, on the hardware where it runs.