When AI Code Looks Fast but Isn't: The SIMD Trap in Vibe Coding

There is a pattern that keeps showing up as AI-generated code becomes more common in C++ projects: the model writes something that looks sophisticated, uses the right vocabulary, reaches for low-level intrinsics or clever bit manipulation, and compiles cleanly. Then you benchmark it, and a simple for loop beats it.

Andrey Karpov documented exactly this in a recent post on isocpp.org, examining a small project called markus that was generated using Claude Opus. The conclusion was blunt: the AI-generated code was the worst performer of all the options tested, slower than the simplest possible implementation, and longer. The extra lines bought nothing.

This is worth unpacking carefully, because the failure mode here is subtle and has implications for how we should be using AI tools in systems programming.

What SIMD Promises and When It Delivers

SSE2 (Streaming SIMD Extensions 2) has been available on x86 processors since the Pentium 4 in 2001, and it is the baseline SIMD instruction set that every x86-64 processor supports. The premise is straightforward: instead of operating on one value at a time, you pack multiple values into a 128-bit register and operate on all of them simultaneously. For floating-point or integer work that maps cleanly onto that model, the gains are real and substantial.

The compiler already knows this. Modern GCC, Clang, and MSVC will auto-vectorize loops that have predictable access patterns, no loop-carried dependencies, and enough iterations to amortize the setup cost. When you write:

for (size_t i = 0; i < n; ++i) {
    result[i] = a[i] + b[i];
}

…a competent compiler targeting x86-64 with -O2 will already emit PADDQ or ADDPS instructions, depending on type. You do not need to call _mm_add_epi32 manually to get that.

Where manual intrinsics actually pay off is when the compiler cannot prove that vectorization is safe or beneficial: aliased pointers, non-trivial control flow inside the loop, data structures that cross cache lines unpredictably, or algorithms where the SIMD formulation is fundamentally different from the scalar one (like horizontal reductions, gather/scatter operations, or population count via POPCNT).

AI models learn the surface pattern: “optimized C++ uses SSE2 intrinsics.” They do not learn the preconditions under which that trade-off is worth making.

The Cargo Cult Problem

Cargo cult optimization in systems code predates AI. Programmers have been manually unrolling loops, inserting __builtin_expect, and sprinkling volatile for decades without understanding the cost model. The difference is that a human developer who writes bad intrinsics usually had some reason to try it, even if that reason turned out to be wrong. They read a blog post, saw a profiler result, or were following a pattern from a codebase where the conditions were different.

AI-generated intrinsic code often has no such lineage. The model is completing a probability distribution over tokens that looks like high-performance C++. It knows that _mm_loadu_si128, _mm_cmpeq_epi8, and _mm_movemask_epi8 appear together in fast string-scanning code. It will emit them together because that is what the training data showed. Whether the surrounding loop structure actually enables those instructions to run faster than a scalar equivalent is a different question, one that requires understanding memory access patterns, branch predictor behavior, pipeline depth, and the specific CPU’s throughput characteristics for each instruction.

The markus project is an example of what that looks like in practice. The generated code has the right ingredients: SIMD loads, packed comparisons, mask extraction. It probably looks impressive in a code review if you scan it quickly. It compiled. It produced correct output. But it was slower, because the actual bottleneck was elsewhere, or the loop was too short to amortize the SSE2 setup, or the memory layout prevented effective vectorization, or all three.

Why the 64-bit Comparison Matters

Karpov’s benchmark included a comparison that is worth dwelling on: 64-bit scalar code was faster than the AI’s SSE2 code except in cases where SSE2 was genuinely involved. This is a known phenomenon. On modern out-of-order processors, the integer execution units are highly pipelined, and a loop that processes 8 bytes at a time with a single 64-bit load and some bitwise operations can match or exceed narrow SIMD code that has higher instruction overhead.

The classic example is a byte-scanning loop. A naive byte-at-a-time scan of a string is obviously slow. But instead of reaching for SSE2, you can cast to uint64_t, load 8 bytes at once, and use a technique like Mycroft’s null-byte detection:

// Detect if any byte in a 64-bit word is zero
bool has_zero_byte(uint64_t v) {
    return ((v - 0x0101010101010101ULL) & ~v & 0x8080808080808080ULL) != 0;
}

This runs entirely in the integer pipeline, has no SIMD register pressure, needs no alignment consideration, and on many workloads will outperform naive SSE2 code. The point is not that this technique is always better, it is that reaching for SIMD intrinsics is not the only way to write fast code, and it is frequently not the right first move.

AI models are unlikely to make this trade-off analysis. They will pattern-match to SIMD because SIMD is what “fast C++” looks like in the training data.

The Shift in What Developers Need to Know

Karpov frames the broader point well: the value of a skilled developer is shifting toward the ability to effectively review code. This is true, and it points to a particular kind of knowledge becoming more important.

Generating code is easy. Evaluating whether generated code is correct under all inputs, efficient on the target hardware, secure against adversarial input, and decomposed in a way that remains maintainable as requirements change, that is not easy, and it is not getting easier as the volume of generated code increases.

For C++ specifically, the review burden is unusually high because the language offers a very large surface area for subtle errors. A piece of generated code can be correct for the test cases the model imagined, incorrect for edge cases it did not consider, and have undefined behavior that only manifests at a particular optimization level or on a particular compiler version. None of that is visible from reading the code casually.

The markus case is relatively benign: the code was slow but correct. The more dangerous version of the same pattern produces code that passes all tests, ships to production, and fails in the field because of an alignment assumption that holds on the developer’s machine but not on the target hardware, or a signed integer overflow that gets optimized away in ways the developer did not expect.

What Good Review Looks Like Here

When reviewing AI-generated C++ that uses intrinsics or other low-level optimizations, a few concrete questions cut through most of the noise:

Does the compiler already do this? Compile the simple version with -O2 -march=native and check the assembly. If the compiler auto-vectorizes correctly, the manual intrinsics are adding complexity without benefit.

Is this loop actually the bottleneck? Profiling is not optional. AI models do not have a profiler; they guess at what is slow based on how the code looks. That guess is frequently wrong.

What are the alignment assumptions? _mm_load_si128 requires 16-byte alignment and will segfault on unaligned data. _mm_loadu_si128 handles unaligned data but is slower on some CPUs. AI code often uses the former where the latter is needed, or uses neither in the right places.

What happens on the bounds? SIMD loops commonly process the bulk of the data in chunks and then need a scalar tail to handle the remaining elements. Generated code often gets the tail wrong, producing off-by-one errors or, worse, reading past the end of a buffer.

None of these questions are new. They are the same questions you would ask about hand-written intrinsics. The difference is that with AI-generated code, the volume is higher and the author is not available to explain their intent.

Where This Leaves Vibe Coding

The term “vibe coding” captures something real: a style of development where you describe what you want, accept what the model gives you, and move on without deeply engaging with the implementation. For throwaway scripts, prototypes, and glue code, this is often a reasonable trade-off. The cost of a slow script that processes a file once is trivial.

For performance-critical systems code in C++, it is not a reasonable trade-off. The gap between code that looks fast and code that is fast is large, the consequences of getting it wrong are serious, and the model’s output does not come with a label indicating which category it falls into.

The markus project is a useful reminder that surface-level correctness and syntactic sophistication are not the same as verified correctness and measured performance. AI can generate the former reliably. The latter still requires a developer who understands what the code is actually doing, at the instruction level, on the actual hardware.