Context, Constraints, and Dead Code: What Harness Engineering Asks of Your Codebase

Back in February, Birgitta Böckeler published a piece on Martin Fowler’s site laying out why OpenAI’s framing of “Harness Engineering” deserves serious attention from software teams using AI tools. A month on, the framing holds up well. It names something practitioners have been learning through friction but rarely formalized: the quality of AI-assisted development is shaped more by the environment surrounding the model than by the model itself.

The word “harness” borrows deliberately from test harness terminology. A test harness doesn’t change the code under test; it creates the conditions under which that code behaves reliably and predictably. The AI development harness works the same way. It doesn’t change what the model can do. It determines what the model reliably does in your specific project, on your specific codebase, for your specific team’s conventions.

Böckeler identifies three pillars: context engineering, architectural constraints, and codebase garbage collection. Each represents a different kind of work, and teams tend to neglect them in different ways.

Context Engineering

Context engineering is the discipline of deciding what information is present before you type a single prompt. This is structurally different from prompt engineering, which focuses on what you say in a given interaction. Context engineering is about the system that determines what the model already knows when the interaction begins.

The most concrete expression of this today is the instruction file. Claude Code reads CLAUDE.md, Cursor processes .cursorrules, Codex loads AGENTS.md. These files exist because every project has conventions that are obvious to the people working in it and completely invisible to a model encountering the codebase fresh. A minimal but effective CLAUDE.md eliminates a category of errors that no amount of re-prompting will fix:

## Stack
- Go 1.23, pgx/v5 for Postgres
- No ORM. SQL lives in /internal/db

## Error handling
- Wrap at package boundaries: fmt.Errorf("functionName: %w", err)
- Do not use errors.New inside handlers

## Constraints
- Do not add dependencies without discussion
- Propagate request context; never use context.Background() in handlers

That file doesn’t make the model more capable. It makes the model’s capability available in a form that fits your project. The difference between a team with a well-maintained CLAUDE.md and one without isn’t model quality; it’s the signal-to-noise ratio of the context the model reasons from.

Context engineering extends beyond instruction files. In agentic workflows, the schema and description of every tool the agent can call is also context. A tool described as “searches the repository” gives the model far less to work with than one described as “searches Go source files using ripgrep, returns file paths and matching line numbers, limited to the src/ directory.” Tool design is context design. Teams building AI-powered development workflows are doing context engineering whether they call it that or not.

Session management is the third dimension of this. Long-running conversations accumulate context that dilutes signal. A three-hour debugging session may carry a dozen false starts, abandoned approaches, and superseded assumptions. The model reasons over all of it. Knowing when to start a fresh context, when to inject a retrieved document rather than relying on drifting conversation history, and when to summarize and compact are all context engineering decisions applied to the session layer rather than the project layer.

Architectural Constraints as Legibility

The second pillar connects software architecture to AI reasoning in a way that should reframe how teams think about technical investment. The argument is that architectural constraints which improve human readability, small modules, clear interfaces, explicit dependencies, also improve AI legibility, and for similar reasons.

When an AI coding assistant modifies a 2500-line file with mixed concerns, it has to build a model of which parts of that file are relevant to the current task and which are not. Errors arise not because the model lacks the capability to understand the code, but because reasoning about a large, tangled context introduces ambiguity about intent. A 180-line module with a single clear purpose eliminates that class of reasoning error before the session begins.

This gives teams a new lens for evaluating architecture decisions. “Should we split this module?” used to be answered in terms of human readability and the rate at which different concerns change independently. Both remain valid. There is now a third consideration: whether the current shape of the code lets an AI assistant work in it confidently, or whether the mixed concerns require it to infer intent it should not have to infer.

The corollary is that technical debt has acquired an additional cost. Teams incurring debt used to accept slower future development as the price, paid by humans who had to understand and route around the accumulated mess. The same debt now also degrades AI assistance in those areas, because the model encounters the same ambiguity the debt creates for humans, without the accumulated project knowledge that lets a senior engineer navigate it anyway. The debt is still paid; the interest has increased.

Codebase Garbage Collection

The third pillar is the least discussed and probably the most immediately actionable. Dead code, lingering feature flags, commented-out blocks, duplicate implementations from half-finished migrations, legacy modules kept around after a rewrite: all of these create noise in the signal the model reads when it tries to understand the codebase’s current state.

Humans navigate this noise using contextual cues that are partly cultural. A function named legacyProcessPayment or a flag named ENABLE_NEW_ONBOARDING_2022 carries a signal that practitioners read quickly. Models can read those signals too, but they cannot be certain what “legacy” or “2022” means in the project’s current context. Is legacyProcessPayment a dead path from a migration three years ago, or a fallback that still handles 8% of transactions? Without additional context, the model may incorporate it into its reasoning, generating code that handles a case that no longer exists, or carefully avoiding a function it needed to call.

Deleting dead code is therefore not just cleanliness. It reduces the decision space the model has to navigate on every AI-assisted edit. Every removed dead path is one fewer source of ambiguity in every future session. Teams that treat codebase garbage collection as a regular maintenance activity are building a better harness in the same way that teams writing good instruction files are, just through a different mechanism: reducing the surface area of things the model might misread.

The same principle applies to comments that describe intent that no longer matches the code. An outdated comment is worse than no comment; it introduces a contradiction the model has to resolve. During a refactor that touched the surrounding code but not the comment, the comment became a lie. The model cannot know that. It treats the comment as authoritative and may generate code that aligns with the old intent rather than the current implementation.

What This Formalizes

The deeper observation Böckeler draws out is that none of these three activities are new engineering disciplines. Clear module boundaries, maintained documentation of project conventions, and prompt deletion of dead code were always the practices of high-functioning software teams. What harness engineering contributes is the observation that AI makes the consequences of neglecting them immediate rather than deferred.

Teams doing these things were already building harnesses before the term existed. Teams that were not are now encountering the cost directly, in AI assistants that consistently miss the project’s patterns and in sessions that produce plausible but wrong code regardless of how the prompt is revised. The model is identical in both cases; the environment shaping how it performs is not.

That is the genuinely useful thing about this framing. It locates the leverage where it belongs. Choosing a different model, writing a more elaborate prompt, or switching AI coding tools are all lower-return interventions than improving context quality, module boundaries, and codebase hygiene. The harness sets the ceiling, and the model operates within it.