Encoding Architecture as Tests: What AI-Assisted Development Demands of Your Codebase

Erik Doernenburg’s experiment with CCMenu, published in January 2026 on Martin Fowler’s site, documents something that practitioners who use coding agents regularly recognize but rarely measure: working code and well-structured code diverge. The agent ships the feature. Whether the feature belongs where the agent placed it, or whether the logic duplicates something that already exists three files away, is a different question, and there is no automatic feedback on it.

The problem has two components. First, the agent does not have access to the implicit architectural knowledge embedded in the codebase’s structure, because that knowledge was never encoded anywhere. Second, even if the agent guessed at that structure correctly, there is no signal in its feedback loop when it gets it wrong. Tests pass or fail. Architecture violations do not.

Architectural Rules Have Always Been Implicit

The uncomfortable truth the CCMenu experiment surfaces is that architectural rules have always lived in developers’ heads. A senior developer who has worked on a codebase for two years knows that feed parsers do not know about UI types. She knows that date utilities already exist in the utilities module. She knows the naming conventions because she established them. When she reviews a pull request, she applies that mental model. When a new team member submits a pull request that violates it, she explains the rule in the review.

This has always been a fragile arrangement. The rule exists in prose in a code review comment, and if the reviewer is absent or the team changes, the rule is re-violated. The AI coding agent just makes the fragility visible, because the agent is always the new team member who has no institutional knowledge.

The traditional response is documentation: architecture decision records, README files, team wikis. These are legible to humans and ignored by CI. The more durable response, which predates AI coding tools by nearly a decade, is the architectural fitness function.

What Fitness Functions Are

Neal Ford, Rebecca Parsons, and Patrick Kua introduced the term in Building Evolutionary Architectures (2017). A fitness function is any mechanism that assesses how well an architecture meets its specified objectives. The useful class is the executable fitness function: a test that runs in CI, fails the build when architectural properties degrade, and gives you the same continuous feedback on structure that unit tests give you on behavior.

A package dependency rule in ArchUnit (Java) is an executable fitness function:

@Test
void networkingLayerShouldNotDependOnViews() {
    JavaClasses classes = new ClassFileImporter()
        .importPackages("com.example.ccmenu");

    ArchRule rule = noClasses()
        .that().resideInAPackage("..networking..")
        .should().dependOnClassesThat()
        .resideInAPackage("..views..");

    rule.check(classes);
}

This is a normal JUnit test. It runs in CI alongside behavior tests. If a developer, or an agent, writes networking code that imports a view type, the build fails with a specific, actionable message. The rule that lived in the senior developer’s head is now enforced by the test suite.

The equivalent exists for every major ecosystem. In JavaScript and TypeScript, dependency-cruiser accepts a configuration file that specifies allowed and forbidden dependency relationships as a directed graph with rules, and can run as a CI check. In Swift, directly relevant to the CCMenu codebase, SwiftLint supports custom rules that can enforce naming patterns and module relationships:

# .swiftlint.yml
custom_rules:
  model_ui_separation:
    name: "Model-UI Separation"
    regex: "^import.*SwiftUI"
    included: ".*/Model/.*\\.swift"
    message: "Model types should not import SwiftUI directly"
    severity: error

This specific rule would catch the class of violation the CCMenu experiment identified: model types accumulating UI-framework dependencies because the agent took the shortest path to a working result.

Closing the Loop in an Agentic Workflow

Fitness functions were designed for CI, with humans in the loop. The interesting question is what happens when you connect them to the agent’s inner loop rather than just the build.

Claude Code can run tests. An agent tasked with implementing a feature in a test suite that includes architecture tests will encounter failing architecture tests if its generated code violates layering rules, and it will see those failures the same way it sees failing unit tests: as structured feedback that requires a fix. This is not guaranteed to produce correct results, but it changes the feedback signal. The agent is no longer optimizing only for feature tests passing. It is also optimizing for architecture tests passing, which encodes some of the structural knowledge that would otherwise be invisible to it.

The research on context engineering for coding agents from the same Martin Fowler series describes the agent’s context as the primary lever for improving output quality. Fitness functions fit into this framing as constraint-layer context engineering: rather than telling the agent in natural language “do not let networking code import from views,” you give it an executable test that will fail if it does. The rule is enforced the same way behavior is enforced, through the test suite, rather than through the agent’s interpretation of a written instruction. The METR study from March 2026, which found that a substantial fraction of SWE-bench-passing patches would be rejected in real code review, is the empirical grounding for this concern: test passage does not predict structural acceptability.

What Different Languages Already Offer

Some ecosystems encode structural constraints more deeply than others. Go’s flat package structure and its tooling’s treatment of unexported identifiers as a hard boundary give you some enforcement of module encapsulation by default. The compiler enforces what would otherwise be a convention. Rust’s module visibility rules, combined with the type system’s nominal typing, make certain architectural violations fail at compile time rather than at test time. Elm’s enforced architecture separates model, update, and view into a structure that cannot be violated; the runtime expects it.

Swift sits between these extremes. The language has access control (internal, private, fileprivate, public) that can enforce module boundaries when the modules are actually separate compilation units. For a single-target application like CCMenu, the compiler’s enforcement is lighter, which is why SwiftLint custom rules and explicit architectural tests matter more.

This suggests a reframing: the degree to which a language enforces its structural contracts determines how much explicit fitness function work an AI-assisted team needs to do. Working in a language with strong structural enforcement, you get some guarantees for free. Working in a single-module Swift application, you need to encode the rules yourself.

The Honest Tradeoff

Writing architectural fitness functions takes time. Encoding rules that previously existed only as shared developer knowledge requires that you articulate them clearly enough to express as executable tests. That articulation is itself a design activity: you will discover that some rules are hard to state precisely because the team’s understanding of them was fuzzy.

This overhead exists independent of AI assistance. The fitness function work is worth doing on any codebase where architectural coherence matters. AI adoption makes it more urgent because it multiplies the rate at which architectural rules are tested against the codebase. A team that reviews three pull requests per day has three opportunities for architectural drift. A team where several developers each work with a coding agent might see fifteen pull requests per day. The rules that lived in three heads are now tested against fifteen feature implementations by agents that do not share the mental model.

Doernenburg’s experiment demonstrates what can be noticed by someone who has maintained a codebase for years. Architectural fitness functions are a partial answer to what happens when that person is not in the review, or when review is happening at a speed that does not permit careful structural assessment. The rules become legible to the CI system and, through the test suite, to the agent itself. That is not a substitute for expert judgment, but it is more reliable than informal convention, and encoding the architecture explicitly has always been worth doing. The adoption of AI coding tools makes the case more urgent, because it raises the rate at which conventions are tested against the codebase by actors who have no access to the shared mental model those conventions depend on.