The Framing That Changes Everything
There is a common assumption built into how most teams adopt AI coding tools: the bottleneck is figuring out how to ask the right question. Better prompts, more specific instructions, the right amount of context injected at the right moment. Prompt engineering as the primary skill.
Birgitta Böckeler’s Harness Engineering article on Martin Fowler’s site, published in February 2026, challenges that assumption directly. It draws on OpenAI’s own framing of the work involved in making AI-enabled software development productive, and the conclusion is that the bottleneck is not the prompt. It is the environment, the harness, surrounding the model.
The metaphor is worth sitting with. A harness does not change the horse’s strength or intelligence. It directs energy, constrains motion to useful paths, and prevents chaos. Harness engineering, applied to AI development, means shaping your codebase, tooling, documentation, and workflows so the AI produces consistently useful output regardless of how any individual developer phrases their request on a given day.
This reframes the entire problem. The question is not “how do I write a better prompt?” It is “what does my codebase communicate to the AI?”
Context Engineering vs. Prompt Engineering
Andrej Karpathy argued in 2024 that “prompt engineering” was always an undersell. The real skill was context engineering: systematically managing everything the AI sees, not just the words in a single query.
The distinction matters because context engineering operates at a different level of abstraction. A prompt is ephemeral, existing for one interaction. Context is structural. It includes the files open in the editor, the instruction files the AI reads before it starts (CLAUDE.md, .cursorrules, .github/copilot-instructions.md, AGENTS.md), the examples embedded in the codebase, the naming conventions and patterns the AI infers from existing code, and the test suite, which implicitly defines what “correct” looks like.
A well-engineered context means every AI interaction in your codebase starts from a well-informed state. A poorly-engineered context means every developer is individually compensating for the AI’s ignorance with longer, more elaborate prompts, and getting inconsistent results anyway.
Here is what a minimal but effective CLAUDE.md looks like for a Node.js service:
# Project Instructions
## Architecture
This is a hexagonal architecture. Domain logic lives in `src/domain/`.
Infrastructure adapters live in `src/adapters/`. Never import from `src/adapters/`
inside `src/domain/`.
## Conventions
- Services are named `*Service.ts` and contain only business logic
- Repositories are named `*Repository.ts` and handle all DB access
- Use `Result<T, E>` from `src/lib/result.ts` for error handling; never throw
## Testing
Run `npm test` before submitting. All new code requires unit tests in `__tests__/`.
This is not magic. It is documentation written for an AI reader rather than a human one: concise, structured, unambiguous. The difference from a README is intentional. A README explains context to someone who will ask follow-up questions. An instruction file has to stand alone.
Architectural Constraints That Cannot Be Ignored
The most durable form of context engineering is not documentation. It is tooling that makes the right pattern structurally easier than the wrong one.
When an AI generates code, it follows the path of least resistance based on patterns it has seen. If your codebase has three different ways of handling database errors, the AI will pick one, not necessarily the one your team prefers. Consistency is not a natural output of AI assistance; it is a property you have to engineer.
One concrete approach: use a linter to enforce architectural boundaries rather than relying on the AI to remember rules stated in a prompt. dependency-cruiser for JavaScript and TypeScript lets you define and enforce module dependency rules:
{
"forbidden": [
{
"name": "no-domain-to-adapter",
"comment": "Domain must not depend on infrastructure adapters",
"from": { "path": "^src/domain" },
"to": { "path": "^src/adapters" }
}
]
}
With this rule in CI, the AI cannot produce code that violates the architecture regardless of how the developer phrases their request. The constraint is structural, not conversational. ArchUnit serves the same purpose in Java. pylint and import linters fill the role in Python.
The same principle applies to scaffolding generators, schema-first API design, and test-first workflows. If you define an OpenAPI spec before asking the AI to implement a handler, the schema acts as a hard constraint on the solution space. The AI cannot ignore a schema it is given; it can ignore a rule stated in prose.
Garbage Collection of the Codebase
This is the concept from Böckeler’s framing that has the most immediate practical weight: dead code, outdated comments, and inconsistent patterns are not just technical debt. They are noise that actively degrades AI output quality.
Human developers reading code develop editorial judgment. They see an outdated comment or a deprecated helper and mentally discount it, understanding from surrounding context that it is legacy. The AI has no such judgment. Everything in the context window is treated as equally valid signal.
Consider what this means concretely. An AI asked to implement a new feature learns from the examples already in your codebase. If you have forty well-structured service files following your current patterns, fifteen older files using a deprecated pattern you never cleaned up, and five files somewhere in between, the AI produces output that reflects all three patterns non-deterministically. It is learning from a noisy training set that is your own codebase.
Technical debt has always had a cost in human productivity. Harness engineering makes explicit that it also has a direct cost in AI output quality. Deleting dead code, removing outdated comments, eliminating inconsistent patterns: these are now infrastructure maintenance tasks, not optional cleanup. The term “garbage collection” is precise because the goal is the same as in memory management, which is to reclaim resources held by data that is no longer useful. In this case, the resource is context window space and model attention.
A practical heuristic: if you would not want a new developer to learn from a piece of code, do not leave it in the codebase for the AI to learn from either.
Measurement and Feedback Loops
A harness without feedback is scaffolding. To know whether your context engineering is working, you need mechanisms to observe AI output quality over time.
The simplest feedback loop is your existing CI pipeline. If AI-generated code consistently fails linters, type checks, or tests, that is a signal that your harness is insufficiently constraining. The failures tell you where to add more explicit guidance. A spike in linter violations after a team starts using AI tools more heavily is diagnostic information, not just noise.
More sophisticated teams instrument this deliberately: tracking what percentage of AI suggestions are accepted without modification, running periodic audits of AI-generated code, and comparing output quality across different task types. The SWE-bench benchmark, which measures whether an AI can resolve a GitHub issue against a real repository, has seen state-of-the-art scores go from roughly 2% in 2023 to over 50% by mid-2025. But the benchmark tests isolated, well-defined tasks against a single repository. Production development involves ambiguity, evolving requirements, and multi-week context that benchmarks do not capture. Your own codebase, measured over time, is the right benchmark for your own harness.
What This Asks of Teams
The shift Böckeler is pointing to is not subtle: it asks teams to treat the codebase as a communication medium with AI, not just with each other. That changes what “good code” means.
Good code has always meant readable, maintainable, and well-structured. Harness engineering adds another criterion: legible to an AI that will read it and produce more code like it. Naming conventions matter more, because the AI infers rules from names. Consistency matters more, because inconsistency creates noise. Completeness of documentation matters more, because gaps get filled with the AI’s best guess.
None of this requires adopting new tools or frameworks immediately. It requires changing how you think about the relationship between your codebase and the AI operating within it. The model’s capability is largely fixed from your perspective as a team. The harness is the variable you control, and the variance in AI productivity gains across teams is explained more by harness quality than by which model or IDE plugin they chose.