· 5 min read ·

AI Agents Pass Your Tests and Fail Your Codebase

Source: martinfowler

Erik Doernenburg has been maintaining CCMenu, a macOS menu bar app that polls CI/CD servers and displays build status, since around 2007. He recently rewrote it as CCMenu2 in Swift and SwiftUI, and when he added a feature using a coding agent, he did something most developers skip: he measured what the agent actually did to the code. His write-up on martinfowler.com is one of the more careful pieces of technical writing about AI-assisted development in recent memory, because it looks at a dimension that the standard productivity literature ignores.

The tools he used are the CK metrics suite, named for Chidamber and Kemerer who published them in 1994: cyclomatic complexity, LCOM (lack of cohesion in methods), coupling between objects, and code duplication. SonarQube reports these automatically in CI. Lizard is a solid open-source CLI alternative that handles Swift along with most other languages. The metrics are not controversial. What matters is what they showed.

The agent’s code passed the tests and the feature worked correctly, but by every structural metric the code was worse. Methods grew longer instead of being extracted into smaller units. Classes took on responsibilities beyond their original scope. Coupling to other classes increased as the agent reached across module boundaries. Logic that already existed elsewhere got inlined rather than reused, producing duplication, none of which appeared in the test results.

The major productivity studies that get cited constantly, the MIT/Stanford research showing 55% task completion speedups, the McKinsey estimates of 20-45% developer productivity gains, measured how quickly developers finished tasks and whether the resulting code passed tests. None of them tracked cyclomatic complexity trends, cohesion, coupling, or duplication over time. They were measuring the output agents are optimized to produce, which is code that runs and passes verification. The structural properties that determine whether code is maintainable a year from now simply weren’t in the measurement.

Why Agents Specifically Degrade Structure

There are four mechanisms worth understanding, because they follow from how these systems work rather than from any particular implementation choice.

The first is context window limits. A coding agent working on a feature sees the files in the current session, the task description, and maybe some retrieved context from a vector search of the codebase. It does not carry a coherent model of the entire system. When it needs to compute something, it cannot reliably know whether that computation already exists in a utility class three directories away. The duplication follows from a visibility problem: the agent can’t see what it can’t see, so it writes the logic again.

The second is training data distribution. These models learned from code on the internet, which skews heavily toward tutorial examples, Stack Overflow answers optimized for brevity in isolation, and production code of wildly varying quality. The structural patterns of a well-maintained codebase, where responsibilities are carefully separated and coupling is minimized, are present in the training data but underrepresented relative to expedient code that just works.

The third is additive bias. Agents add code; they rarely delete it. When asked to add a feature, the path of least resistance is to extend existing methods and classes rather than restructure them. Restructuring requires understanding intent, not just behavior. It requires knowing why a class was kept small, or why a particular separation of concerns was maintained. Without that context, the agent stacks new behavior onto existing structures.

The fourth is design amnesia. Even if architectural decisions are documented in the project, the agent doesn’t carry a continuous understanding of those decisions across sessions. It can follow instructions stated in the current context, but it cannot reason about the implicit constraints that accumulated over years of development. The developer who built CCMenu2 over months understood why certain classes stayed small. The agent approached each task fresh.

The Scale Evidence

GitClear’s 2024 study analyzed 153 million lines of code changes and found that code churn roughly doubled between 2021 and 2023, a period that maps closely to Copilot’s adoption curve. They also found increased copy-paste patterns and reduced refactoring activity, meaning developers using AI tools are writing more duplicate code and doing less of the structural cleanup that keeps codebases healthy. A 2024 University of Alberta study found that GPT-4-generated code showed elevated rates of Long Method, Large Class, Feature Envy, and Shotgun Surgery smells compared to human-written code for equivalent tasks.

These are different methodologies pointing at the same underlying pattern. The GitClear data suggests that at aggregate scale, AI-assisted codebases are accumulating more code that gets written and then rewritten, which is what you’d expect if agents are adding rather than restructuring. The Alberta results suggest that the smell patterns Doernenburg measured in CCMenu2 are not peculiar to his project or his agent; they’re reproducible characteristics of AI-generated code across different models and contexts.

The churn finding carries particular weight because churn is expensive in ways that extend beyond the immediate cost of rewriting. Churned code is harder to reason about, harder to review, and more likely to introduce subtle behavioral differences between versions. If AI tools are doubling churn rates, some of the productivity gains measured at task-completion time are likely offset by downstream costs that the productivity studies didn’t track.

What To Do About It

Prompt-level constraints help at the margins. Files like .cursorrules or AGENTS.md can instruct agents to keep methods short, avoid duplicating logic, and prefer extracting over extending. Doernenburg notes these have some effect. But they’re instructions competing with the task description and the agent’s trained priors within a single context, and they don’t fully counteract the four mechanisms above.

The more robust intervention is automated structural quality gates in CI. SonarQube supports quality gates that block a merge if cyclomatic complexity, duplication, or coupling metrics cross a threshold. SonarCloud is free for open-source projects. For projects where SonarQube is overkill, Lizard can be scripted into a CI step to fail builds where method complexity exceeds a configured limit. The configuration is straightforward; the barrier to entry is low.

The point of putting this in CI rather than relying on human review is that structural degradation accumulates incrementally and is easy to miss in individual reviews. A method that grows from 30 lines to 45 lines across three agent-assisted commits doesn’t trigger alarm in any single review session. Automated tooling catches cumulative complexity increases that humans reliably miss.

The shift required in human review is also worth naming. When reviewing AI-generated code, correctness is largely handled by the test suite. The reviewer’s attention is better spent on structure: whether the agent extracted logic or inlined it, whether a class has taken on a new responsibility it shouldn’t have, whether coupling to other components increased without justification. This requires looking at the shape of the code, not just its behavior, and it’s a different reading mode than most developers default to.

Doernenburg’s work on CCMenu2 is valuable because it’s concrete and measured rather than impressionistic. He ran the metrics before and after and published the numbers. That kind of before-and-after structural audit, ideally automated to run on every pull request, is something any team adopting AI-assisted development should establish early. The agents are improving at the things we measure. Codebases that age well under AI assistance will be the ones that also measure what the agents don’t optimize for.

Was this interesting?