· 6 min read ·

What Happens to Your Codebase's Internal Quality When an Agent Writes the Feature

Source: martinfowler

When Erik Doernenburg added a feature to CCMenu using a coding agent and then measured what happened to the codebase, the result was not a dramatic failure. The code worked. The tests passed. The feature shipped. What changed was subtler, and that subtlety is exactly what makes the experiment worth understanding.

CCMenu is Doernenburg’s own project, a Mac menu bar application that polls CI/CD servers using the cctray XML feed format and shows build status at a glance. He has maintained it for nearly two decades across a rewrite from Objective-C to Swift, so he knows every corner of the codebase. When he published his findings in January 2026 as part of Martin Fowler’s Exploring Generative AI series, the value was in the methodological care. He picked a project he understood deeply, added a feature with an agent, and then assessed the result against concrete internal quality metrics rather than just checking whether it built.

That framing matters, because most accounts of AI coding productivity skip the second measurement entirely.

Internal Quality Is Not the Same as Correctness

Software quality has two faces that are easy to conflate. External quality is what users observe: does the feature do what it should, are there bugs, does it crash? Internal quality is what the people maintaining the code observe: is the design coherent, do modules have clear responsibilities, is coupling kept low?

The distinction matters because internal quality is what determines the cost of the next feature, not the current one. A codebase with high coupling and low cohesion can work perfectly from the outside while becoming progressively harder to change from the inside. Martin Fowler has written about this dynamic extensively under the label design stamina: good internal design pays for itself over time because it keeps the cost of change from compounding.

Coding agents optimise for the first face, not the second. They receive a task, reason about the immediate context, and produce code that satisfies the task. They do not naturally ask whether the solution is the right fit for the existing design, whether a new abstraction should be extracted, or whether a similar capability already exists somewhere else in the codebase and should be reused.

The Structural Pattern That Emerges

This is not a criticism of any specific tool; it follows from how current large language models work. A model reasoning about how to implement a feature in a file sees that file and the surrounding context it has been given. It does not have a holistic view of the codebase’s conceptual architecture the way a developer who has spent months in the code would. The result tends toward a recognisable set of patterns.

Duplication increases. Rather than identifying an existing function that does something similar and refactoring to share it, the agent writes a new implementation for the specific case at hand. The code is correct, but two things now exist where one refined thing could.

Coupling creeps up. When a new piece of functionality needs to coordinate across modules, the agent wires them together directly. A developer with a broader view might introduce a mediating abstraction or use an existing one. The agent takes the path that solves the problem with the least structural invention.

Classes or modules that were previously focused acquire new responsibilities. A component that existed to handle one concern gains methods to handle another because the agent saw it as a convenient place to add the needed behaviour. Cohesion, the degree to which a module does one thing well, declines.

These are the same patterns that show up when code is written under time pressure, when a codebase is handed off to developers unfamiliar with its intended design, or when a piece of software is maintained by many hands without a shared conceptual model. AI coding agents are not unique in producing these patterns; they are just a new mechanism that can produce them at high speed.

Metrics That Surface What Working Code Hides

Doernenburg’s approach to measuring this is grounded in a tradition of using structural metrics to assess codebases objectively. The relevant measures here are mostly graph-theoretic: afferent and efferent coupling at the module level, the ratio of abstract components to concrete ones (instability and abstractness in Robert Martin’s component coupling principles), and cohesion measures like LCOM (Lack of Cohesion of Methods).

These metrics are not infallible. High coupling in one design might be fine; low cohesion in another might be acceptable. But they give you a compass. If coupling metrics shift upward after an agent adds a feature, that is a signal worth investigating, even if you ultimately decide the structure is fine.

For Swift projects specifically, tools like swiftlint cover style and some complexity rules, but deeper structural analysis typically requires something like swiftdependencies or static analysis via the compiler’s module graph. Doernenburg, given his background building code visualisation tools at ThoughtWorks, is better positioned than most to apply these measures rigorously to a Swift codebase.

The Velocity Trade-off Nobody Quotes

The productivity case for AI coding agents is usually framed around lines of code per day, or the time saved on boilerplate, or the reduction in context-switching when you can describe a change in natural language and get an implementation back. These are real benefits and the measurements that support them are not fabricated.

What the productivity numbers do not capture is the maintenance cost that accumulates downstream. A feature that takes an agent ten minutes to implement but that leaves the codebase 15% harder to understand is not a neutral trade. The ten minutes shows up in your velocity dashboard. The 15% degradation in understandability does not.

This is the same dynamic that played out with offshore development in the 2000s and with deadline-driven coding in every era. Code that was written fast and shipped successfully still left a maintenance burden that eventually slowed everything down. The phenomenon is old. What is new is the speed at which it can now accumulate.

Fowler’s design stamina argument is that teams that invest in internal quality move faster over medium and long horizons because each new change does not require untangling the previous ones. AI tools can accelerate delivery in the short term while undermining that stamina, if the code they produce is not reviewed with internal quality in mind.

What This Should Change About How You Review Agent-Generated Code

The practical implication of Doernenburg’s experiment is not that you should use coding agents less. It is that code review for agent-produced code needs to explicitly check things that human-written code review often skips or handles implicitly through shared context.

A developer who has been in a codebase for six months and writes a new feature will generally not introduce gratuitous duplication or wire modules together in ways that violate the established architecture, because they have that architecture in their head. They also have colleagues who will flag it in review if they do. Those implicit checks do not apply to an agent.

Reviewing agent output should include questions like: does this duplicate something that already exists? Does this introduce a dependency between modules that did not previously need to know about each other? Did the agent find and use the right abstraction, or did it solve the problem locally in a way that is semantically correct but structurally redundant?

None of this requires exotic tooling. It requires adjusting the mental model you bring to review. Agent-generated code is not junior developer code, but it shares one important property with it: it solves the stated problem without necessarily understanding the design sensibility that the rest of the codebase reflects.

Doernenburg’s experiment is useful precisely because it takes a real project, applies real measurements, and reports honestly on what happened. The field has too many claims about AI productivity that are measured only at the moment of delivery. Measuring the state of the codebase after delivery, and again after several rounds of agent-assisted changes, is the data we actually need.

Was this interesting?