Optimization Without a Profiler: What AI Learns From Optimized Code

The code that lands in a public repository is the end of a process, not the process itself. When an experienced systems programmer commits manual SIMD intrinsics, they have already run the profiler, identified the function on the critical path, verified the data sizes in production, checked whether the compiler was auto-vectorizing, measured the before and after. None of that is in the source file. The file contains the conclusion. The reasoning that justified the conclusion is in a Godbolt tab that was closed weeks ago.

This is the problem at the center of Andrey Karpov’s analysis of the markus project on isocpp.org. Claude Opus generated C++ with SSE2 intrinsics that was slower than a simple loop and longer to express. The model produced code that resembles the output of a profiling session without having the input to one. It learned the conclusion that SIMD intrinsics appear in optimized C++ without learning the preconditions under which that conclusion is true.

What Training Data Cannot Contain

A language model trained on source code sees an enormous volume of high-performance C++. A meaningful fraction of that code contains SSE2 intrinsics: _mm_loadu_si128, _mm_cmpeq_epi8, _mm_movemask_epi8, and their relatives. The model learns that these tokens correlate with performance-sensitive code, because the correlation is real. Engineers who write manual SIMD do so because they measured and found it helped.

What the training data does not contain is the measurement. The commit does not include: “SSE2 was 18% faster than the compiler’s auto-vectorized version at 64KB buffer sizes on the target Skylake machines, measured with perf stat across 10,000 iterations, controlling for branch prediction warmup.” It might include a comment. It will not include the benchmark harness, the profiler output, the specific -march= flag that caused the compiler to under-vectorize the simple version, or the size distribution of the actual production workload.

The model learns: SIMD intrinsics appear in fast C++ code. It cannot learn: SIMD intrinsics are appropriate when these specific conditions hold. That information does not survive into the artifact the model trains on.

The Pattern-Completion Mechanism

When asked to produce optimized C++ code, the model is not reasoning about whether SSE2 is appropriate for the given loop. It is completing a probability distribution over tokens conditioned on the surrounding context. The context signals “performance-critical,” and the model has learned that performance-critical C++ contains manual intrinsics. The output is token-probable, not justified.

The code that results is not wrong in a shallow sense. It compiles. The intrinsics operate as documented. The loop structure matches patterns from the training data. The scalar tail handles the remainder after the SIMD bulk. Someone reviewing it quickly might approve it: it looks like the work of an engineer who knew what they were doing. The problem is that an engineer who actually knew what they were doing would have checked the preconditions first.

SSE2 is worth reaching for when a loop is compute-bound and the compiler is not auto-vectorizing it adequately. Whether either of those is true depends on the specific loop body, the surrounding code, the compiler flags, and the target hardware. The model has none of that context at generation time. It produces SSE2 code because SSE2 code is what the completion looks like, not because the situation calls for it.

The Compiler Has the Context the Model Lacks

The contrast with how a compiler handles the same decision is worth examining. GCC’s tree-loop-vectorize pass and Clang’s loop vectorizer inspect the loop body for iteration-carried dependencies that would prevent vectorization. They check whether pointers alias, because aliased pointers make vectorized loads unsafe without additional analysis. They have a cost model for the target: cycles per instruction, throughput of SIMD loads across cache line boundaries, the width of the SIMD units available. With -march=native, they also know whether the machine has AVX2 or AVX-512, which doubles or quadruples the register width relative to SSE2.

A simple loop compiled with -O2 -march=native gives the compiler full latitude. It can choose SSE2, AVX, AVX-512, scalar unrolling, or any combination. It selects based on the actual cost model for the actual target. Manual SSE2 intrinsics lock the compiler into 128-bit XMM registers on a machine that may have 256-bit YMM registers available, and they block the compiler from changing its choice as the surrounding code changes. An LLM generating SSE2 for a Zen 4 machine has left half the SIMD width on the table while also giving up the compiler’s ability to use something wider.

// Manual SSE2: locked to 128-bit, blocks compiler re-optimization
#include <emmintrin.h>
void count_bytes_sse2(const uint8_t* buf, size_t n, uint8_t target, int* out) {
    __m128i needle = _mm_set1_epi8(target);
    int count = 0;
    size_t i = 0;
    for (; i + 16 <= n; i += 16) {
        __m128i v = _mm_loadu_si128((const __m128i*)(buf + i));
        __m128i eq = _mm_cmpeq_epi8(v, needle);
        count += __builtin_popcount(_mm_movemask_epi8(eq));
    }
    for (; i < n; ++i) count += buf[i] == target;
    *out = count;
}

// Simple version: compiler chooses width, emits AVX2 if available
void count_bytes_simple(const uint8_t* buf, size_t n, uint8_t target, int* out) {
    int count = 0;
    for (size_t i = 0; i < n; ++i) count += buf[i] == target;
    *out = count;
}

Compile the second version with -O3 -march=native on a machine with AVX2 and the compiler will emit 256-bit vpcmpeqb instructions, processing 32 bytes per cycle instead of 16. The SSE2 version provides no path to that outcome regardless of the hardware. This is the mechanism behind Karpov’s finding that 64-bit scalar code outperformed the AI’s SSE2 version in many cases: scalar loops compiled with full optimization flags give the compiler maximum flexibility, and the compiler uses it.

Cargo Culting at Scale

This failure mode predates AI. Engineers have cargo-culted SIMD for decades. Someone reads a blog post about SSE2 making a string-scanning loop four times faster. They apply SSE2 to their own loop without checking whether their loop has the same structure, data sizes, or compiler behavior. Knuth’s observation about premature optimization is directly applicable: the assumption that a particular section of code is the bottleneck is frequently wrong, and adding complexity to the wrong section does not help.

What language models do is industrialize this. The model has processed every tutorial about SSE2 performance gains, every GitHub repository with hand-rolled intrinsics, every Stack Overflow thread recommending manual vectorization for hot paths. It synthesizes code that resembles all of them. The output is confident in its surface appearance: the intrinsics are correctly named, the loop structure is recognizable, the scalar tail is present. The only thing missing is the specific context that made those examples valid, which was never in the training data to begin with.

Human cargo-culters at least have a reason for their choice, even if the reason does not transfer. The model has a probability distribution over tokens. It cannot distinguish the case where SSE2 is appropriate from the case where it actively makes things worse, because the information that would allow that distinction was not preserved when the original code was committed.

What Reviewing This Code Requires

Karpov frames the broader implication well: reviewing generated code requires recovering the information that the code itself does not contain. For SIMD code specifically, that means asking what would have justified this choice, and verifying whether the justification actually holds for the current situation.

A concrete starting point is Compiler Explorer. Paste the simple loop and the intrinsic version side by side with the same compiler flags your project uses and compare the assembly. If the compiler auto-vectorized the simple version and produced comparable or wider SIMD code, the manual intrinsics are adding complexity without benefit. This takes about five minutes and answers the core question directly.

The next question is whether the loop is compute-bound. SIMD does not help memory-bound loops: if the operation reads and writes arrays that do not fit in L1 cache, the bottleneck is memory bandwidth, not arithmetic throughput, and a wider SIMD register does not change the bandwidth limit. Profilers like perf stat with hardware counters for cache misses and memory bandwidth can answer this. An AI model generating intrinsics for a memory-bound loop will make the code more complex without affecting the bottleneck at all.

None of this is novel due diligence. These are the same questions you would ask about hand-written intrinsics. The difference is volume and confidence. A human who writes bad SIMD code usually had some reason; the reason did not generalize, but it existed. The model has no reason, and the output looks equally confident whether the intrinsics are appropriate or not.

The markus project is the benign version of this problem: the code was measurably slower, which means a benchmark catches it before it causes damage in production. The more difficult version is AI-generated SIMD that is slightly faster on the development machine, slightly slower on the production hardware, and never measured in production because the local benchmark looked adequate. The model will not flag this. It generated the code with the same confidence either way.

The shift Karpov identifies, toward reviewing generated code as the core engineering skill, is real. For SIMD code specifically, that review requires more than reading the intrinsics. It requires reconstructing the profiling session that would have justified them, and verifying that the session would have reached the same conclusion for this code, this data, and this hardware. If that session never happened, the intrinsics are speculation written in the style of certainty.