The Internal Quality Problem That Coding Agents Cannot See

Back in January 2026, Erik Doernenburg published an experiment on Martin Fowler’s site that cuts to something the AI coding discourse mostly sidesteps. Erik is the maintainer of CCMenu, a Mac menu bar app that polls CI/CD servers and shows build status at a glance. He used a coding agent to add a new feature, then stepped back and asked: what did this do to the code?

Not “does it work” — that’s the easy question, answered by tests. The harder question is internal quality: is this code readable, well-structured, appropriately coupled, and consistent with the patterns already in the codebase? The finding, in broad strokes, is that agents produce working code that can quietly degrade the internal health of a project. This is worth taking seriously, and not just as a caveat about current tools.

What Internal Quality Actually Means

The term gets used loosely, so it helps to be concrete. Internal quality covers things like cyclomatic complexity (how many independent paths run through a function), coupling between modules (how many things a given class or struct knows about), cohesion (whether the responsibilities of a module stay tightly related), and the signal-to-noise ratio of tests. None of these are directly observable by running the program. They only show up when you read the code, trace a bug, or try to extend the system six months later.

External quality — does the software do what it’s supposed to, does it handle error cases, does it perform acceptably — is largely testable. You can automate a feedback loop around it. Internal quality cannot be fully automated. That distinction is load-bearing in understanding what coding agents are good at and where they fall short.

CCMenu’s codebase is a useful subject for this kind of experiment precisely because it’s a real production application, not a toy. It’s written in Swift, uses SwiftUI for its modern interface, and fetches CI status from a range of providers: GitHub Actions, Jenkins, Travis CI, CircleCI, and others. Each provider has a different API shape — some return XML, some JSON — and the app maps these into a unified internal model of pipelines and builds. Adding support for a new provider means writing fetching logic, a parser, error handling, and tests, while fitting all of that into the patterns already established in the project.

The Feedback Loop Asymmetry

When a coding agent writes code and runs the test suite, it gets a clear signal: pass or fail. The agent can iterate on that signal. This is exactly the inner loop that makes agents useful for tasks with well-defined acceptance criteria.

Internal quality has no equivalent signal. There is no binary output from running swift test that tells you whether the new GitLabFetcher struct is following the same error propagation conventions as GitHubFetcher, or whether it’s introduced a leaky abstraction that will make the next provider harder to add. The agent cannot observe these things, so it cannot optimize for them.

This creates a systematic skew. Agents are effective at the things they can measure and iterate on. They are unreliable at the things that require reading the existing codebase for patterns, conventions, and architectural intent. The code they produce is not randomly bad — it often looks reasonable in isolation. The problem is that it does not fit cleanly into the surrounding structure, or it makes local choices that conflict with established conventions, or the tests it writes cover execution rather than behavior.

This is not a new failure mode. It’s exactly the same problem that shows up with scaffolding tools, code generation frameworks, and copy-paste development. The difference is velocity. An agent can produce and integrate a substantial amount of code faster than a developer can review it for internal quality. The asymmetry between production speed and review speed is what makes this a structural concern rather than an edge case.

What Degrades and How

In Swift codebases specifically, the failure modes tend to cluster around a few patterns. Agents will often reach for concrete types where an existing codebase uses protocols, or vice versa. In a macOS app following MVVM or a coordinator pattern, the placement of logic in the wrong layer is common: business logic drifting into a view model when it belongs in a domain type, or network logic scattered across multiple layers when the project has a clear fetching abstraction.

Test quality is a subtler issue. Agents write tests that pass, but tests can pass while telling you very little. A test that instantiates a type, calls a method, and asserts that the result is not nil has technically covered the code. It has not tested any behavior. The distinction between coverage and behavioral specification is something that requires intentional prompting and review to preserve.

Theres also the naming and consistency problem. An existing codebase accumulates naming conventions over time: how types are named, how error types are structured, how async work is expressed. An agent working without explicit instruction about these conventions will produce code that works but reads as foreign. This is not a cosmetic issue. Inconsistent naming increases cognitive load and makes onboarding harder, and in a codebase like CCMenu that a single maintainer operates over years, that load accumulates.

The Implications for Review

One response to this is to say that code review catches it. That’s true, and it matters, but it raises the question of what code review looks like when the agent has already produced a working implementation. The psychological dynamics shift. It is harder to question the internal structure of code that already passes tests than it is to shape code as it’s being written. The working implementation exerts pressure on the reviewer to accept it.

This is an argument for front-loading quality concerns into the prompting and scaffolding phase rather than relying on review to catch them after the fact. Providing the agent with explicit examples of how a similar feature was previously implemented, specifying the naming conventions in use, and reviewing the generated code against the existing patterns before running tests all shift the quality gate to a point where intervention is cheaper.

Some teams are experimenting with running static analysis tools or complexity metrics as part of the agent’s feedback loop, so that the agent gets signal on internal quality the same way it gets signal on test results. This approach works to a degree, but it’s bounded by what static analysis can actually detect. Cyclomatic complexity thresholds catch obvious problems; they do not catch architectural drift or misuse of existing abstractions.

The Broader Context

Erik’s experiment is useful partly because it’s grounded in a specific, real project rather than a benchmark. The general claim that coding agents reduce code quality is too coarse. The more precise claim is that coding agents optimize for external correctness and produce inconsistent results on internal quality, with the variance depending heavily on the specificity of the context provided and the vigilance of the developer reviewing the output.

For a project like CCMenu, where the maintainer has deep familiarity with the codebase, catching and correcting internal quality issues is feasible. For a larger team where no single person knows the full system, the same dynamics produce technical debt that compounds quietly until it becomes expensive.

The skill this puts pressure on is not prompt writing or tool selection. It’s the ability to read generated code critically and evaluate it against the standards of the surrounding system, which is exactly the skill that years of code review builds. The irony is that the people best positioned to maintain quality in an agent-assisted workflow are the ones with the most experience doing it without agents.