Why the CCMenu Experiment Worked: Tacit Knowledge and the Limits of AI Code Review

The question Erik Doernenburg set out to answer in his CCMenu quality assessment — does using a coding agent degrade the internal quality of your codebase — is one that many teams are trying to answer right now. What makes his experiment unusually credible is something that most AI coding benchmarks lack entirely: the evaluator built the thing being evaluated.

That is not a small distinction. It is the entire methodological basis for why the findings mean anything.

The Benchmark Problem

Most assessments of AI coding quality work by constructing a problem with a known solution, running the AI against it, and measuring the output against that solution. This works well for external quality: does the function return the right values, do the tests pass, does the output match the spec. It works poorly for internal quality, because internal quality requires judgment that is contextual, not reducible to a checklist.

When a benchmark says an AI-generated solution scores 4 out of 5 on “maintainability,” that score comes from some combination of automated metrics — cyclomatic complexity, method length, module coupling — applied by someone who did not design the codebase in question. The assessment is general. It has no relationship to the specific architectural intentions behind the code, the conventions that have accumulated over the project’s lifetime, or the ways in which a particular structural decision would make future changes harder.

Doernenburg’s experiment does not have this problem. He has maintained CCMenu for well over a decade. The macOS menu bar application, which displays CI/CD pipeline build status from Jenkins, GitHub Actions, CircleCI, and similar systems, has been actively developed through multiple macOS versions, through an Objective-C era and a Swift rewrite, through a transition to SwiftUI. Doernenburg made the architectural choices. He knows why they were made. When he looks at agent-generated code and says it introduced a duplication that should not be there, or placed logic in the wrong layer, that assessment carries a weight that no automated metric can replicate.

What Tacit Knowledge Means in a Codebase

The philosopher Michael Polanyi described tacit knowledge as the dimension of understanding that cannot be fully articulated. His formulation — “we can know more than we can tell” — applies with particular force to the architectural knowledge embedded in a long-running codebase.

A project that has been actively maintained for ten or more years accumulates decisions that are not written down anywhere. Why is a certain abstraction defined at this level rather than one level up? Why does this component not depend on that service directly, even though a direct dependency would simplify two specific code paths? Why is a particular feature implemented with a protocol rather than a concrete class, even though the protocol currently has only one conforming type? The answers to these questions exist in the maintainer’s head. They might surface in a commit message, or in a code comment, or in a pull request discussion, but more often they do not surface anywhere, because they are the background assumptions against which all other decisions are made.

An AI coding agent has no access to this layer of knowledge. It sees the code that exists, the tests that exist, and whatever context appears in the prompt. It makes inferences from patterns in the training data about what “good” code generally looks like. It produces code that is structurally consistent with common patterns in similar codebases, which is not the same as being structurally consistent with this codebase’s specific intent.

The agent is not failing to access a document. There is no document. The knowledge is tacit.

The Evaluation Advantage

This creates an asymmetry in how AI coding contributions can be reviewed. When a contributor who is not deeply familiar with a codebase reviews agent-generated code, they can verify that the code passes tests, that it handles edge cases, that the naming is clear. They cannot easily verify that the code fits the existing design, not because they are incapable, but because understanding whether code fits the design requires the same tacit architectural knowledge that the agent lacks.

Doernenburg is one of the few people who can reliably answer that question for CCMenu. He does not need to reverse-engineer the intent behind each design decision; he made the decisions. When agent-generated code violates a convention, he recognizes it with the immediate clarity of someone who owns the codebase, not someone who has to infer ownership from the evidence.

This is not a limitation specific to Doernenburg. It is a property of maintainership. For any codebase that has been in active development long enough to accumulate architectural character, the original maintainer or core team has access to a layer of structural judgment that is not available to anyone coming to the code fresh, including an AI agent.

What This Means for Assessing Your Own Tools

The practical implication is uncomfortable: if you are not the expert on your own codebase, your assessment of what AI agents are doing to its internal quality is unreliable. You can measure specific metrics, complexity scores, duplication ratios, inter-module dependency counts, and those measures will tell you something. But the metrics will not tell you whether a new method belongs where it was placed, whether a newly introduced abstraction fits or conflicts with the existing vocabulary, or whether a structural shortcut made sense in isolation but will cost you in six months.

This problem grows as codebases age and as the knowledge gap between the AI tool and the codebase’s history widens. A six-month-old project, with most of its architectural decisions still fresh in the team’s memory and often still recoverable from git history, is a different evaluation target than a decade-old project whose conventions are distributed across the team’s collective memory and nowhere else.

For teams using AI agents heavily on established codebases, there is a compounding risk that is harder to see than duplication or complexity growth: the gradual erosion of the architectural knowledge itself. When agent-generated code is accepted and built upon, and when the original conventions are no longer being reinforced by human-written code that embodies them, those conventions become harder to articulate and easier to violate. The next contribution, from an agent or from a new team member, has a slightly less coherent baseline to work from.

The Argument for Making Conventions Explicit

One response to the tacit knowledge problem is to make more of the knowledge explicit. Architecture decision records, documented conventions, style guides that go beyond formatting, these give an AI agent something to reason from. Some agent workflows support injecting this context directly; a CONVENTIONS.md or ARCHITECTURE.md file can give the model a place to look before generating code. This does not eliminate the gap between explicit documentation and tacit understanding, but it narrows it in ways that matter.

When I work on my own projects — Discord bots with layers of state management and external API integrations that have accumulated their own structural logic — the most useful thing I can do before asking an agent to add a feature is write down the relevant conventions in the prompt. Not only “here is the code,” but “here is why the code is structured this way, and here are the constraints you should not violate.” The agent produces significantly more coherent output when the tacit knowledge is made explicit, even partially.

The effort required to do this is non-trivial. Writing down your architectural intent in a form an agent can use is slower than just adding a feature, and it competes with every other priority on the board. But for a project that matters over the long term, it is the investment that prevents the slow drift toward incoherence.

The Real Value of Practitioner Evaluation

Doernenburg’s experiment, published as part of the Exploring Gen AI series on Martin Fowler’s site and worth treating as a retrospective on a question that is still live now, is the work of someone willing to do what most AI assessments avoid: measure something real on something real, using judgment that is not synthetic.

The finding, that agents introduce systematic pressure on internal quality, is not surprising. What the experiment clarifies is why that pressure is so difficult to detect and resist. The knowledge required to detect it is the same knowledge the agent does not have. And the person best positioned to provide that knowledge, the original maintainer, is also the person who may be least likely to pause and write it down when there is a feature to ship.

Keeping architectural knowledge explicit, alive, and part of the review process is the work that does not go away just because code generation got faster. If anything, it becomes more important when the pace of code generation increases, because the gap between how fast code appears and how carefully it fits the existing design is where structural debt accumulates.