Why Coding Agents Forget Your Instructions

The core loop that Simon Willison documents in his guide on agentic engineering patterns is not complicated. The model emits a tool call, scaffolding executes it, the result goes back into context, and the model runs again. For a 10-step task with a human reviewing each step, this works reliably. Push the same loop toward autonomy, toward 60 or 80 tool calls without interruption, and structural properties of the architecture start to matter in ways that short-session benchmarks do not capture.

The Model Has No Memory Beyond Its Context Window

The most important thing to understand about an agent’s “memory” is that it does not exist outside the context window. The model has no persistent representation of what happened earlier in a session. Everything it knows about the current task is the token stream: the system prompt, prior assistant turns, tool results, and user messages, all assembled into a single sequence passed to the model on each API call.

This sounds obvious but has consequences that are easy to miss. When an agent is partway through a 60-step task and you want it to remember a constraint you mentioned at the start, the model does not retrieve that constraint from memory. It attends to it through the mechanism it attends to everything: its position in the token sequence.

The model is stateless between calls. Each invocation receives the full context window and produces output. There is no hidden state that persists, no internal “memory” of prior constraints that sits outside the token stream. If a constraint is not in the context at inference time, it does not exist.

The Lost-in-the-Middle Effect

In 2023, researchers at Stanford and UC Berkeley published findings on what they called the “lost in the middle” effect (Liu et al., 2023). In multi-document question answering tasks, models with long contexts performed reliably when the relevant document was at the beginning or end of the context window, and substantially worse when it was in the middle. The effect was not subtle. Models with the relevant document in the 10th position out of 20 underperformed models with the same document in position 1 by enough to affect practical reliability.

The implication for coding agents is direct. System prompts and CLAUDE.md instructions load at the beginning of each session, which puts them in the most reliably attended position. That is not accidental.

The less obvious implication is about mid-session instructions. If you tell an agent 30 tool calls into a session “do not modify any files in the migrations/ directory,” that constraint lands somewhere in the middle of a growing context window. The model may attend to it reliably, or may not, and you cannot know from the model’s response whether the constraint has been encoded with the same weight as a system prompt instruction. You will find out when the agent touches a migration file.

What Context Compaction Actually Loses

All major coding agents handle context overflow through some form of compaction: summarizing earlier turns to free up space for new tool calls. Claude Code calls it compaction explicitly. Aider manages context through dynamic resizing of the repository map. The implementation varies; the structural problem is the same.

Compaction collapses a sequence of tool calls and results into prose. The summary captures the gist of what happened: what files were read, what changes were made, what the current task state is. What it loses is specificity. A concrete error message from a test run twenty tool calls back becomes something like “the initial test attempt revealed an import error that was resolved.” If that error message becomes relevant again later in the session, the agent has to re-derive the information through fresh tool calls rather than referencing the original.

More significant is the loss of negative constraints. A system prompt instruction survives compaction because it lives in the system prompt, not the conversation history. A constraint introduced in conversation, such as “leave the nginx config alone for now,” does not survive compaction in its original form. It might become something like “the user deferred changes to certain configuration files,” which is less precise and less binding than the original phrasing.

Claude Code addresses this directly: CLAUDE.md content is re-injected after compaction. If your project constraints live in CLAUDE.md, they survive the session boundary. If they live in conversational instructions, they may not.

Advisory Instructions vs. Enforced Constraints

This points to a distinction that matters for anyone building automations on top of coding agents.

A system prompt or CLAUDE.md instruction is advisory. It tells the model what to do. The model will follow that instruction most of the time, particularly early in a session when the instruction is near the beginning of the context. But there is no mechanism that guarantees compliance. The instruction can drift out of reliable attention over a long session, and compaction can reduce its precision.

A scaffolding-level constraint is enforced. When a PreToolUse hook intercepts a file write, the hook runs regardless of what the model decided. The model cannot reason its way around a hook that exits non-zero. It receives an error and has to try a different approach.

The practical difference looks like this:

## CLAUDE.md (advisory)
Do not write to the migrations/ directory without user confirmation.

#!/bin/bash
# scripts/guard-migrations.sh (enforced via PreToolUse hook)
# Receives the tool input as JSON via stdin
FILE=$(python3 -c "import sys,json; print(json.load(sys.stdin).get('file_path',''))")
if echo "$FILE" | grep -q '/migrations/'; then
  echo 'Blocked: migrations directory requires explicit confirmation'
  exit 1
fi
exit 0

The CLAUDE.md version relies on the model attending to that instruction across a long session, through potential compaction boundaries, and under the lost-in-the-middle attention penalty. The hook version executes at the scaffolding level. If the model generates an edit to a migrations file on tool call 73, the hook intercepts it regardless of what the model remembered from the beginning of the session.

This pattern matters most in exactly the conditions where coding agents are most appealing: long autonomous sessions, CI pipeline integration, background refactors running without supervision. The shorter and more supervised the session, the less the distinction matters. The longer and more autonomous, the more it matters.

Compute State vs. Context State

There is a related problem that the context layer cannot address at all. The model’s context window holds the conversation. The execution environment holds compute state: installed packages, modified files, running processes, environment variables, git state.

When an agent reads a file, that content enters the context window. When the agent modifies the file, the file changes on disk. The context window now contains an outdated version. If the agent reasons from the version it read earlier rather than the current on-disk state, its reasoning may be wrong.

This is manageable for simple cases. It becomes subtle in multi-file refactors where the agent edits file A, changing an interface, and later generates code for file B based on a remembered version of A rather than the modified one. Well-engineered scaffolding reduces this through explicit re-reading before editing and through test runs that catch inconsistencies across files. None of it eliminates the problem, and all of it adds to the context budget.

What This Means for Long-Running Automations

SWE-bench Verified, the standard benchmark for evaluating coding agents, tests individual GitHub issues: read an issue, write a fix, see if the tests pass. It measures bounded repair task performance well. It does not measure performance on 80-step autonomous sessions with constraint structures that evolve through the session.

Agents that work reliably in autonomous operation share a few properties.

They push permanent constraints into CLAUDE.md or tool definitions rather than conversational instructions. Anything that needs to survive the full session belongs in the system prompt layer, where it is re-injected after compaction and sits in the highest-attention position of each context.

They use hooks for hard constraints rather than natural language prohibitions. If a behavior is unacceptable under any circumstances, enforcing it at the scaffolding level is more reliable than relying on the model to apply a prohibitive instruction across dozens of tool calls and multiple compaction cycles.

They scope individual tasks to minimize session length. A task that can be broken into five 20-turn subtasks is more reliable than a single 100-turn session, because each subtask starts fresh, with all constraints at position zero of the context and no accumulated state from prior compaction.

The agent loop is simple. Keeping the model’s behavior predictable across the full length of a real autonomous session is almost entirely a scaffolding problem. The model is doing what it is designed to do: attending to what is in its context, with more reliable attention toward the beginning and end of the window. Building reliable agent automations means working with that constraint, not expecting natural language instructions to carry the weight of enforcement indefinitely.