· 6 min read ·

Coding Agents Optimize for Tests, Not Architecture

Source: martinfowler

Erik Doernenburg is the maintainer of CCMenu, a macOS menu bar application that displays CI/CD pipeline status. It’s a mature, real-world Swift codebase with over a decade of architectural decisions built into it. Doernenburg ran a careful experiment: he used a coding agent to add a feature, then measured what happened to the code’s internal quality. His findings, published in January 2026 as part of Martin Fowler’s “Exploring Generative AI” series, point to a structural problem that benchmarks and pass rates don’t capture.

The feature worked, tests passed, and by every automated measure the contribution was a success; the internal quality was a different matter.

What Internal Quality Means

Martin Fowler has written carefully about the distinction between internal and external quality. External quality is what users observe: does the software do what it’s supposed to do, reliably and correctly? Internal quality is what developers observe: is the code easy to understand, modify, and extend? The two are related but not the same. A codebase can be externally correct while internally decaying.

The Design Stamina Hypothesis makes the stakes concrete: high internal quality keeps development velocity from degrading as a codebase grows. Low internal quality pays short-term dividends and exacts long-term costs. The cost accumulates through hundreds of small delays when every change touches more code than it should.

Coding agents optimize almost entirely for external quality. Their feedback loop consists of tests: tests pass or fail, and the agent adjusts accordingly. No analogous signal exists for coupling, cohesion, or duplication.

What Degraded in CCMenu2

Doernenburg’s experiment documented degradation across four dimensions.

Coupling increased. The agent did not respect the architectural boundaries already established in CCMenu2, which is built as a SwiftUI application with protocols used in place of concrete types across layer boundaries. The new code introduced direct dependencies on concrete types where protocols existed, effectively hardwiring module relationships that had been kept flexible.

Cohesion decreased. The agent placed new logic in the nearest available container rather than the architecturally appropriate one. Methods landed in classes with the wrong responsibility. The behavior was correct; the placement was wrong.

Duplication appeared. Logic that already existed elsewhere in the codebase was reimplemented rather than found and reused. This is not a laziness issue; it’s a limitation of how agents navigate a codebase. They retrieve context based on the immediate task and may not surface the existing helper that covers the same case.

Testability dropped. The agent generated enough tests to maintain line coverage around 80 percent, but branch coverage declined. Happy-path tests satisfied the coverage threshold while edge cases remained untested. The new code also diverged from the dependency injection patterns the existing codebase used to make unit testing tractable.

None of these failures surfaced as compiler errors. None showed up as failing tests. A default-configured linter would have passed all of them.

The Root Cause: Asymmetric Feedback

The core issue is structural rather than a prompt engineering problem. Coding agents learn from a feedback signal that measures external quality. A test passes or fails; the agent receives a clear reward or penalty. No equivalent signal exists for architectural fit. Nobody writes a test that fails when a new class introduces a cycle in the dependency graph, or when a function is placed in a class with the wrong responsibility.

The agent sees the code it was given as context. It does not see the architectural principles that shaped the existing code, the reasons certain abstractions exist at certain levels, or the long-range consequences of placing logic in one location versus another. This information exists in the maintainer’s head, accumulated over years of deliberate decisions, and is not encoded anywhere the agent can read.

Michael Polanyi’s observation from epistemology applies directly here: we can know more than we can tell. The knowledge that makes CCMenu2’s architecture coherent is largely tacit. Doernenburg can look at a method and know it’s in the wrong class because he designed the class and remembers what it was supposed to do. An agent has no access to that layer.

Supporting Evidence at Scale

The CCMenu experiment is a controlled case study, but the pattern appears at larger scale. A 2024 analysis by GitClear across 150 million lines of code changes found that copy-paste and duplicate code patterns nearly doubled over the period correlating with rising AI tool adoption. Code churn, measured as lines committed and then substantially revised or removed within two weeks, also increased while refactoring activity declined.

A March 2026 study from METR examined SWE-bench-passing patches from frontier models and found that a significant fraction would not pass real code review by project maintainers. The rejection reasons match Doernenburg’s findings: duplicated logic, wrong abstraction layer, exception suppression to satisfy a test rather than fix the underlying issue, and modifications that overfit to the specific test case rather than addressing the general behavior.

Martin Fowler also flagged a Carnegie Mellon University study in his December 2025 notes on AI’s effects on open-source software, which concluded that AI code contributions probably reduced overall code quality in those projects. The open-source context amplifies the problem because contributor volume increases with AI tooling while maintainer bandwidth does not scale proportionally.

A 2023 paper on arXiv analyzing ChatGPT-generated code found LLM-generated solutions had roughly 30 to 40 percent higher cyclomatic complexity per function than human-written equivalents for equivalent functionality. But cyclomatic complexity is not the most important signal in Doernenburg’s findings. What tools like SwiftLint, SonarQube, and CodeClimate measure is what is present in the code. What they cannot measure is whether what is present belongs where it was placed. A method can have a cyclomatic complexity of one and be precisely in the wrong place relative to the system’s design.

Structural Responses

The most direct response to the feedback loop problem is encoding architectural constraints as automated tests, so agents encounter structural requirements alongside functional ones in CI. The concept comes from Neal Ford, Rebecca Parsons, and Patrick Kua’s work on architectural fitness functions: automated tests for architectural properties that run alongside unit tests in every build.

For a Java project, ArchUnit provides a straightforward way to express these constraints:

@Test
void networkingLayerShouldNotDependOnViews() {
    JavaClasses classes = new ClassFileImporter().importPackages("com.example.ccmenu");
    ArchRule rule = noClasses()
        .that().resideInAPackage("..networking..")
        .should().dependOnClassesThat().resideInAPackage("..views..");
    rule.check(classes);
}

For Swift/SwiftUI, Swift Package Manager target separation enforces module boundaries at the compiler level. A target that does not declare a dependency on another target cannot import it, regardless of what any agent attempts to generate:

// Package.swift
.target(name: "NetworkLayer", dependencies: [], path: "Sources/Network"),
.target(name: "DomainLayer", dependencies: [], path: "Sources/Domain"),
.target(name: "ViewLayer", dependencies: ["DomainLayer"], path: "Sources/Views")

Custom SwiftLint rules add a second layer for violations that compile successfully but break conventions:

custom_rules:
  no_urlsession_in_models:
    name: "No URLSession in Model layer"
    regex: 'URLSession'
    included: '.*/Model/.*\.swift'
    message: "Network calls belong in the Service layer, not in Model types."
    severity: error

For TypeScript projects, dependency-cruiser serves a similar purpose. What all of these share is that they translate tacit architectural knowledge into explicit, machine-checkable constraints. When those constraints fail in CI, the agent receives structural feedback in the same channel as functional feedback.

Context engineering also helps, to a point. Birgitta Böckeler’s Harness Engineering article from February 2026 argues that files like CLAUDE.md and .cursorrules function as executable specifications, encoding not just style preferences but architectural intent. An agent told that the networking layer must not be imported by view models has at least the opportunity to comply. The limit is that tacit knowledge not yet written down cannot be conveyed through any context file.

What Changes About Code Review

Treating coding agent output like a PR from a skilled contractor new to the codebase captures the right frame. The code may be correct and the feature may work; the review question shifts from “does this do what it’s supposed to do” to “does this belong where it was put, and does it preserve the structural properties the codebase depends on.”

These are different cognitive tasks. Correctness review looks at what the code does; structural review looks at what the code is and where it sits. Both need to happen, and at most organizations only the first is explicitly required.

The CCMenu experiment makes the cost of skipping structural review concrete. Green tests and a passing build are necessary conditions for merging; they are not sufficient for maintaining a codebase that stays easy to work with as it grows. The agent wrote working code. Working code and well-structured code are not the same thing, and the difference compounds over time in ways that no CI dashboard will flag.

Was this interesting?