Agents Write Code. They Don't Maintain Codebases.

Erik Doernenburg’s assessment of how a coding agent affects internal code quality, published in January 2026 as part of Martin Fowler’s ongoing Exploring Gen AI series, is worth revisiting for a reason that wasn’t its central claim. The study is often cited for what the agent got wrong: coupling violations, cohesion problems, testability regressions, duplicated logic. Those findings are real and important. But there’s a more structural observation embedded in the results, one that points to a gap that isn’t about capability at all.

The agent added a working feature to CCMenu, Doernenburg’s long-maintained macOS menu bar application for CI/CD build status. The tests passed. The feature shipped. What the agent did not do, and what any experienced developer working in a familiar codebase would have done, is maintenance work: noticing the existing helper that the new code should have used, observing that the addition created a coupling that warranted cleanup, recognizing that the new struct was taking on responsibilities already distributed elsewhere. The agent wrote code. It did not maintain the codebase.

This distinction matters because it reframes what’s actually happening when teams use AI assistance to move faster.

The Velocity/Maintenance Asymmetry

A 2024 analysis by GitClear examined over 150 million lines of code changes across repositories where AI tool adoption was tracked. The headline finding was that code duplication patterns nearly doubled year-over-year. But a second finding deserves equal attention: refactoring activity, measured as commits that restructure code without adding net functionality, declined over the same period. Simultaneously, overall code volume and velocity increased.

This is not a coincidence. It’s the same phenomenon Doernenburg observed, just measured at scale. When AI assistance increases the speed of code generation, it doesn’t proportionally increase the speed of code maintenance. Maintenance requires a different set of behaviors: reading code to understand what’s already there, recognizing opportunities to reduce duplication, cleaning up coupling after an addition changes the structure of the system. These are behaviors that experienced developers perform continuously, often without explicitly deciding to, because they’ve built up enough context to notice when something is off.

An agent working within a single session doesn’t have that accumulated context. It has whatever files were read during the session, whatever code it generated, and the immediate task. Maintenance work requires the kind of ambient familiarity that comes from living in a codebase, and agents don’t live anywhere between sessions.

What Doernenburg’s Position Made Visible

The CCMenu experiment was credible partly because of who ran it. Doernenburg has maintained CCMenu2, the Swift rewrite targeting macOS 12+, through a complete architectural transition. He knows the dependency injection conventions, the module boundaries, where responsibility is supposed to live. When the agent placed network-aware logic in a component that should be concerned only with model state, he recognized it immediately as wrong.

A team using AI assistance on a shared codebase might not have that recognition distributed across all reviewers. The coupling violation looks like a local decision. The duplicated utility looks like a reasonable new function. The cohesion problem looks like an acceptable trade-off for getting the feature done. Each individual decision is defensible in isolation. Collectively they represent a codebase that is drifting away from its own architecture.

This is exactly the dynamic the GitClear churn metric captures: code added and then revised or removed within two weeks. Churn is a leading indicator of code that was locally correct but globally misfit. The revision happens when someone, eventually, notices that the new code doesn’t belong where it landed. The cost is paid later, by different people, in a different context.

The Benchmark Gap

The standard evaluation framework for coding agents, SWE-bench, measures whether generated patches make failing tests pass on real GitHub issues from projects like Django and Flask. Frontier agentic systems cleared over 60 percent of SWE-bench Verified tasks by late 2025. Those numbers are used as proxies for agent capability, and for the specific task they measure, producing a patch that satisfies existing tests, they’re reasonable proxies.

But a March 2026 analysis by METR found that a substantial fraction of SWE-bench-passing patches from frontier models would not pass real code review by project maintainers. The rejection reasons align closely with what Doernenburg found: duplicated logic, wrong abstraction layer, missing documentation, architectural mismatch. The patches satisfied tests. They did not fit the codebase.

The missing evaluation is maintenance-oriented: did the agent notice and use existing utilities? Did it clean up coupling its addition created? Did it refactor in a way that left the codebase in better shape than it found it? No current benchmark measures this, because it requires domain knowledge about what “better shape” means for a specific codebase.

Why Maintenance Is Structurally Hard for Agents

Refactoring requires recognizing redundancy across the codebase and acting on it even when it isn’t strictly necessary for the task. An agent optimizing to satisfy a prompt has no incentive to do this. The feature works without it. The tests pass without it. Nothing in the feedback loop signals that maintenance was needed.

Duplication is particularly persistent because it compounds. When an agent reimplements a utility that already exists, the codebase now has two implementations of the same logic. The next agent session sees both. The proliferation of near-duplicate implementations makes the discovery problem worse over time, not better. The architecture drifts faster as the baseline becomes noisier.

SwiftLint can catch some of this in Swift codebases: cyclomatic_complexity, function_body_length, and file-level rules provide structural signals. SonarQube offers coupling metrics and duplication detection at the project level. CodeClimate tracks maintainability scores over time, making regression visible. These tools exist. Most teams configure them for style enforcement, not structural health, and most default thresholds were calibrated for human-written code.

For architectural boundary enforcement, dependency-cruiser lets TypeScript and JavaScript projects encode valid import relationships as configuration, failing CI when an import crosses a layer it shouldn’t. ArchUnit provides a Java API for writing architectural fitness functions as unit tests. The concept, described in Building Evolutionary Architectures by Ford, Parsons, and Kua, is that architectural intent should be encoded as executable constraints, not documentation that becomes stale. AI-assisted development makes this more urgent, not less, because the agent will not consult documentation and will not notice when its output violates structural norms that aren’t enforced anywhere.

The Missing Phase

The practical implication is that AI-assisted development needs a maintenance phase that currently doesn’t exist in most workflows. After an agent adds a feature, before the changes are committed, someone with codebase context needs to ask the questions the agent didn’t: is there existing code this should have used? Did this addition create coupling that should be cleaned up? Is the new code in the right place by the standards of the rest of the system?

This isn’t a call for less AI assistance. It’s a description of what full-stack AI-assisted development actually requires. The generation phase is faster. The maintenance phase still requires judgment from someone who has accumulated enough context to exercise it. Martin Fowler’s Design Stamina Hypothesis holds that internal quality pays off over time through sustained development speed. AI assistance increases short-term velocity but does not, by itself, sustain internal quality. That requires deliberate maintenance work on top of it.

Doernenburg’s experiment ran on a codebase he knew deeply, which is exactly why he could see what went wrong. Most teams don’t have that depth distributed evenly. The maintenance gap is widest precisely where it’s hardest to see.

The GitClear data and the CCMenu experiment are both pointing at the same structural reality: coding agents are excellent at the generation half of software development. The maintenance half, the refactoring, the reuse, the cleanup, the structural stewardship, requires a different kind of engagement that agents don’t currently provide and workflows haven’t yet been redesigned to ensure.