Harness Engineering: When the Codebase Is the Interface

When Birgitta Böckeler wrote about harness engineering on Martin Fowler’s blog in February, she was responding to a framing that OpenAI had put forward: that working effectively with AI coding tools is less about writing clever prompts and more about engineering the environment those tools operate in. The term borrows from software testing, where a test harness is the scaffold that holds your tests together. Applied to AI, it covers everything a team builds and maintains to make AI-generated code reliable, coherent, and actually useful.

The framing matters because it shifts responsibility from the individual developer at the keyboard to the team and the codebase itself. Prompt engineering was always a somewhat awkward term because it implied the skill you needed was verbal cleverness at interaction time. Harness engineering implies something more structural: you’re building infrastructure, not finding the right words.

Böckeler identifies three components of the harness: context engineering, architectural constraints, and garbage collection of the code base. Each has a different cost, a different leverage point, and a different set of people it implicates. Understanding that distinction matters for any team thinking seriously about AI adoption beyond individual productivity gains.

Context Engineering Is the Entry Point

Context engineering is the most immediately accessible of the three. It covers what you put in front of the model: the system prompt, project-level instruction files like CLAUDE.md or .cursorrules, relevant type definitions, interface contracts, architectural decision records, and examples of code you actually want the AI to follow.

The practical insight here is that these files are a new kind of executable specification. They’re instructions that run inside a context window rather than a runtime, but they have real effect on outputs. A well-maintained CLAUDE.md that explains the module structure, the conventions the team has settled on, and the libraries in use produces substantially different results than the same model running blind.

The tools have made this more systematic. Cursor reads .cursorrules. Claude Code reads CLAUDE.md. GitHub Copilot reads repository-level configuration. What started as ad-hoc practices around “how do I get the AI to write code in our style” is becoming a structured discipline. Teams are starting to version these files, review changes to them, and treat them with the same care as configuration.

Where context engineering gets harder is at scale. A monorepo with fifteen services in different languages and three different generations of architectural philosophy cannot be fully described in a single context file. Maintaining per-module context files accurately requires ongoing effort. And if those files drift from reality, they become misleading rather than helpful, which may be worse than no context at all.

Architectural Constraints Are the Leverage Point

The second component is the one most developers underestimate: architectural constraints. The argument is that well-structured code, with clear types, consistent module boundaries, predictable naming, and enforced conventions, acts as a natural guide for AI-generated code. The model reads the existing patterns and extends them coherently.

The inverse is also true, and more important. A codebase with three different patterns for handling database transactions, two competing approaches to error handling, and inconsistent naming between modules does not confuse the AI into refusing to proceed. It confuses the AI into confidently interpolating between those patterns in ways that are locally plausible and internally inconsistent. The result is code that compiles, passes superficial review, and fails in production.

This gives new economic justification to practices that were already good engineering: strong typing, clear interface boundaries, enforced linting, consistent abstractions. The argument for these practices used to be human readability and long-term maintainability. The argument now also includes AI legibility. A well-typed codebase is one where the compiler and the model both have enough information to stay within the intended design space.

For teams using TypeScript, this means going beyond permissive types. An any annotation is not just a compromise for human developers; it is an invitation for the AI to assume anything about that value’s shape, because the type system has communicated that anything is acceptable. The same applies to dynamic dispatch, magic strings, and implicit contracts expressed only in comments that may or may not be current.

Codebase Garbage Collection Is the Hardest Part

The third component is the one nobody wants to schedule. Böckeler’s phrase “garbage collection of the code base” is apt. It refers to the active removal of dead code, deprecated patterns, duplicate implementations, and anything else that does not represent how the team intends to write software today.

This is hard because it has always been hard. Deleting code is uncomfortable. Resolving competing patterns requires making a decision, which requires organizational alignment. There is always higher-priority work. What harness engineering adds is a new argument for why this cannot be deferred indefinitely.

When a codebase contains two implementations of the same utility, a human developer knows from context which one is current. The model does not. It will use both, sometimes in the same file, because both exist and both appear functional. The longer a deprecated pattern persists, the more new AI-generated code will be written in that style. Technical debt compounds faster with AI in the loop because AI scales the throughput of code generation without scaling the judgment about which patterns to use.

The specific consequence is that the ROI on AI tools correlates with codebase cleanliness in a way that correlates with little else as directly. Teams with clean, modern codebases see AI tools providing consistent value. Teams with heavily accumulated technical debt see AI tools producing large volumes of code that lands in the debt pile. This is not a problem the model solves on its own.

The Platform Engineering Parallel

There is a structural similarity between harness engineering and what platform engineering teams call “golden paths.” A golden path is a standardized, well-supported route through the platform for accomplishing a common task: building a service, wiring up observability, deploying to production. The value of a golden path is not that it is the only way to do something, but that it is the way automation understands and can support reliably.

The harness is a golden path for AI. Context files describe the conventions the model should follow. Architectural constraints make those conventions machine-readable beyond documentation. Codebase GC removes the competing paths that cause the model to deviate. The result is a development environment where AI can participate predictably in the workflow rather than occasionally doing the right thing by coincidence.

Platform engineering required significant organizational investment to make infrastructure automation reliable. Harness engineering is asking teams to make an analogous investment at the code level, and the organizational dynamics are similar: the work is invisible until it is absent, and the cost falls on everyone while the labor falls on a few.

What This Means Organizationally

The most significant thing about Böckeler’s framing is that it distributes the work across the full team and into engineering leadership. Prompt engineering was an individual skill. Harness engineering requires decisions about what the codebase should look like, maintained over time, by multiple people.

That means teams need to treat the harness as a product. Context files need owners. Architectural constraints need enforcement mechanisms, not just guidelines. Codebase GC needs a budget and a schedule, not just goodwill. None of this happens spontaneously, and none of it is captured in per-seat license costs when evaluating AI tooling.

The framing also implies that teams with existing investments in code quality will see better returns from AI tools than teams without them, which runs counter to the narrative that AI tools democratize software development by lowering the bar. They lower the bar for generating code. They raise the stakes for everything that determines whether generated code is good.

The field has moved fast even in the weeks since February. But the core argument holds: the returns on AI coding tools are not evenly distributed, and the distribution follows codebase health more closely than it follows model capability or prompt sophistication. Building the harness is the work that determines how much of the capability any given team actually accesses.