· 7 min read ·

What Happens to Your Codebase After an AI Agent Touches It

Source: martinfowler

Erik Doernenburg published a retrospective assessment in January 2026 of what happened to CCMenu, his macOS CI/CD status bar app, after he used AI coding agents to add features. The experiment is notable because it uses a real maintained codebase rather than toy problems, and because Doernenburg measured what happened using static analysis rather than just vibes.

The short version: the features worked, the tests passed, and the internal code quality measurably declined.

That pattern is worth understanding in detail, because it is not random. The specific ways AI-generated code degrades quality are predictable once you understand what LLMs optimize for.

The CCMenu Codebase

CCMenu has been around since roughly 2007, when Erik wrote it in Objective-C to display CruiseControl build status in the Mac menu bar. The cc.xml feed format it pioneered became a de facto standard adopted by Jenkins, GoCD, CircleCI, and others. Around 2021, he rewrote it as CCMenu2 in Swift using SwiftUI, targeting macOS 12+. The codebase uses async/await for networking, the Combine framework for reactive state, and SwiftUI view models following the @ObservableObject pattern.

It is a mature, well-structured codebase with genuine architectural decisions embedded in it: protocols over concrete types, extracted view models, consistent naming conventions built up over years. That architectural intent is exactly what AI agents cannot read from the source alone.

The Correctness Trap

The term “correctness trap” describes something specific: AI coding agents optimize for observable outcomes. Tests pass. The feature works. The build is green. These are the signals the agent can measure and get feedback on. Internal quality metrics like cyclomatic complexity, cohesion, coupling, and duplication are invisible to the agent unless you explicitly feed them back into the loop.

This is not a flaw in the agent so much as a consequence of how it learns. LLMs trained on code see the output of human programming but not the reasoning behind structural decisions. They see that if branches and switch statements work; they don’t model the cognitive load of reading code with fifteen branches vs. four. They see that copying a pattern into a new method works; they don’t model what happens to a codebase when that pattern gets copied twelve times instead of extracted once.

The result, as Doernenburg documented, is code that passes every check you have set up while quietly accumulating internal debt.

What the Metrics Showed

Doernenburg used static analysis to compare the codebase before and after AI-assisted development. The findings align with what other researchers have found in less rigorous settings.

Cyclomatic complexity increased in AI-generated methods by roughly 15-20% compared to the pre-experiment baseline. Cyclomatic complexity counts the number of linearly independent paths through a function, essentially measuring how many branches the code has. AI agents tend toward exhaustive conditional handling: they account for every case they can think of by adding branches rather than restructuring. A human author might refactor toward polymorphism or a lookup table; the agent adds an else if.

Code duplication roughly doubled in AI-assisted sections. This is the most predictable failure mode. LLMs are pattern-completion machines. When a similar pattern exists nearby in the context window, the model completes it rather than abstracting it. The pragmatic move from the agent’s perspective is to copy the working code and adapt it; the correct architectural move is to extract a shared function or protocol. The agent picks the pragmatic path every time.

Coupling increased. In the CCMenu2 SwiftUI view models, the AI introduced direct dependencies on concrete types where the existing codebase used protocols. This is subtle: the code works, the tests pass, but you have lost the abstraction boundary that would have made the component testable in isolation and replaceable later. Efferent coupling (the number of types a module depends on) crept upward across several modules.

Test branch coverage dropped even while line coverage stayed roughly constant around 80%. The AI writes tests, and it writes enough of them to maintain the line coverage metric. But the tests tend to cover the happy path. Conditional branches, error conditions, and edge cases require the test author to reason about what could go wrong. The AI reasons about what should work. The result is a test suite that looks healthy by one metric while hiding significant gaps in another.

Why These Failure Modes Specifically

These four patterns (complexity, duplication, coupling, weak branch coverage) are not independent failures. They share a common cause: the AI agent solves the problem that is visible in its context window without modeling the full cost of the solution in the broader system.

When you ask an agent to add a feature, its context includes the immediate file, maybe some adjacent files you have provided, and the description of what the feature should do. It does not have a running model of the coupling graph, the duplication ratio, or the average cyclomatic complexity of the codebase. It certainly does not have the architectural intent that drove the protocol-over-concrete-type decisions Erik made when he rewrote CCMenu in Swift.

A 2023 arXiv study analyzing ChatGPT-generated code found that LLM-generated solutions had significantly higher cyclomatic complexity than human-written equivalents for equivalent functionality, averaging 30-40% higher per function. This aligns with what Doernenburg found on a real codebase: the AI’s complexity overhead is consistent and structural, not incidental.

The Thoughtworks Technology Radar (2024) flagged AI-generated code as a technical debt risk for exactly these reasons, recommending that teams treat AI contributions like any other external code: read it, review it, and measure it before merging.

Static Analysis as a Feedback Signal

One concrete implication of Doernenburg’s experiment is that static analysis needs to be part of the AI coding workflow, not just a CI check at the end.

If you are using an agent and the only feedback loop is “do the tests pass,” you are optimizing for functional correctness and accepting whatever internal quality the agent delivers. You can change that by integrating tools like SonarQube or SwiftLint (for Swift codebases) into the process, either by running them after each agent commit and reviewing the diff, or, in some agent frameworks, by feeding the results back as a prompt to the agent.

For Swift specifically, a few tools are worth knowing:

# SwiftLint: style and complexity rules
swiftlint analyze --compiler-log-path compile_commands.json

# Periphery: finds unused code (coupling symptom)
periphery scan

# Sourcery: for structural analysis and metrics
sourcery --sources Sources --templates Templates/Metrics.stencil

For the cyclomatic complexity problem specifically, you can configure SwiftLint with a threshold:

# .swiftlint.yml
cyclomatic_complexity:
  warning: 10
  error: 15

This will flag methods that exceed the threshold, which is useful as a gate but more useful as a prompt: when a generated method triggers this rule, it is a signal to review whether the complexity is essential or whether the agent just took the easy path through a series of else if branches.

The Tacit Knowledge Problem

Beyond the measurable quality metrics, Doernenburg’s series surfaces a harder problem. Some of the quality degradation in CCMenu was not detectable by any static analysis tool. The AI added code that was structurally reasonable by generic standards but wrong for this codebase specifically, because it violated conventions and architectural decisions that existed only in the head of the original author.

This is what Doernenburg called the institutional knowledge problem in an earlier installment of his series: the AI cannot absorb the implicit reasoning behind existing structure. It sees that a concrete type works here; it does not see that the protocol boundary was placed deliberately to enable a future extension point that has not been built yet.

There is no tool that fixes this. The fix is human review by someone who holds that context, which means the value proposition of AI-assisted development shifts: it saves you time writing the code, but it does not save you the review time, and in some cases the review is harder because you are auditing unfamiliar code rather than reading your own.

What to Do With This

The CCMenu experiment does not argue against using AI coding agents. Erik Doernenburg uses them deliberately and with clear eyes about the tradeoffs. What the experiment provides is a useful calibration.

For a greenfield project or a throwaway script, internal quality drift from AI contributions is a manageable risk. For a long-lived, maintained codebase, it is a compounding cost. The code that passes tests today but has 40% higher cyclomatic complexity than it needs to, uses concrete types where protocols belong, and duplicates six copies of a pattern that should be extracted once, will be harder to change in six months.

The practical response is to treat internal quality metrics as first-class signals in your AI coding workflow:

  • Run static analysis on AI-generated diffs, not just on CI after merge.
  • Review coupling changes specifically, not just functional behavior.
  • Treat duplicated code from the agent as a refactoring opportunity, not just accepted output.
  • Write branch-coverage checks with the same rigor as line-coverage checks, because the AI will game line coverage without meaning to.

The full series on Martin Fowler’s site is worth reading as a whole. It is some of the more rigorous practitioner-level analysis of what AI coding agents actually do to real codebases, as opposed to the productivity benchmarks run on isolated task completion. The quality assessment article is the culminating piece, and the finding is straightforward: AI agents shift the quality risk from functional correctness to structural integrity, and that shift requires you to adapt your review process accordingly.

Was this interesting?