Why Coding Agents Lose Direction on Long Tasks

Simon Willison’s guide to how coding agents work describes the core loop clearly: a language model reads the conversation history, decides what to do, calls a tool, gets the result appended to the conversation, and repeats. The loop structure is well-understood at this point. What is less discussed is a property of that loop that explains a specific, predictable failure pattern on longer tasks.

The model is stateless. Every API call sends the complete conversation history from the beginning. The model has no memory that persists between calls; everything it “knows” at any moment is the text currently in that conversation. This is also what makes the loop work: the model can always reconstruct context from the full history.

But the conversation history has a different property: it is append-only. Tool results get added. Corrections get added. New observations get added. Nothing is ever removed. The model can say “I was wrong earlier” and add that to the conversation, but the wrong reasoning it’s correcting is still there, unchanged, earlier in the same context.

For short tasks, this does not matter. Read a file, fix a bug, run the tests, done. Four tool calls, clean context, small window for drift.

For longer tasks, it accumulates.

The Wrong Premise Problem

Consider an agent asked to find and fix a performance issue. It reads a profiler trace, draws a conclusion about where the bottleneck is, and starts exploring that area of the code:

Turn 1: bash("python -m cProfile -s cumulative server.py > profile.txt")
Turn 2: read_file("profile.txt")  → model concludes bottleneck is in auth.middleware
Turn 3: read_file("src/auth/middleware.py")
Turn 4: read_file("src/auth/tokens.py")
Turn 5: read_file("src/auth/session.py")
Turn 6: bash("python -m pytest tests/test_auth.py -x")  → tests pass
Turn 7: write_file("src/auth/middleware.py", ...)  → cache token validation results
Turn 8: bash("python -m cProfile -s cumulative server.py > profile2.txt")
Turn 9: read_file("profile2.txt")  → bottleneck unchanged, still at 2100ms

At turn 9, the agent has new evidence: the optimization did nothing. The actual bottleneck is somewhere else. But the conversation context now contains three full file reads from the wrong module, the original wrong conclusion from turn 2, and the now-useless edit from turn 7. When the model reasons at turn 10 about what to try next, it reads all of that first.

Adding a correction at turn 10, “the bottleneck was not in auth.middleware, let me reconsider,” does not remove the wrong prior reasoning from context. It appends a realization to a history that still contains the wrong path. The model now has to reason correctly while managing a context contaminated with irrelevant files and a deprecated hypothesis.

This compounds turn by turn. Each step taken from a wrong premise adds more noise. A ten-turn wrong direction leaves more damage to work around than a two-turn one.

Why This Is Structural, Not Accidental

The ReAct pattern, which formalizes the reason-act-observe loop, assumes that observations consistently move the agent toward the correct solution. That holds when early observations are correct. When they’re not, the loop is running on a contaminated foundation.

Human programmers handle this differently. When you realize you went down the wrong path, you discard the mental state associated with that path and start fresh from a different hypothesis. You do not have to reason around your own wrong earlier conclusions; those are just gone from working memory. An agent’s “working memory” is the conversation log. Starting fresh requires a new conversation, because the old wrong reasoning is still in context and cannot be cleanly removed.

The append-only property is not a design flaw. It is a direct consequence of how stateless LLM inference works. You could build a system that edits the conversation history (trimming wrong branches, splicing in corrections), but that would require either the model itself to flag which of its prior reasoning is invalid, or the human operator to identify and remove bad segments. Neither is clean at scale.

Test Execution Helps, But Not Enough

Running the test suite after every edit is one of the strongest practices in agentic coding workflows. Paul Gauthier’s benchmarking work with Aider documents consistently higher success rates for test-driven agentic workflows versus write-only ones, and it’s the single biggest performance driver in the SWE-bench results for most agents.

Test feedback closes a verification loop that static reasoning cannot close. An agent that writes code and runs tests knows whether the output is functionally correct, not just whether it looks right.

But test feedback arrives after the tool calls it’s meant to validate. If the agent has already explored five wrong files and made three wrong edits, the test failure that arrives at turn 15 says “this approach is wrong” about a direction the agent has been building on since turn 3. The feedback is accurate, but it arrives into a context that is already heavily weighted toward the wrong path.

Catch it early and the feedback is clean. Catch it after twelve turns and the model is correcting itself against a backdrop of its own wrong prior work.

What Partial Mitigations Exist

A few patterns reduce the damage, none eliminate it.

Planning before execution. GitHub Copilot Workspace generates a written plan as a text response before any tool calls execute. The plan can be reviewed and corrected before the agent starts reading files and making edits. Anthropic’s extended thinking makes reasoning visible as a thinking block, which serves a similar purpose: you can see where the model is going before it gets there. The limitation is that once execution starts, you’re back to append-only.

Front-loaded project knowledge. Claude Code reads a CLAUDE.md file at session start and injects its contents into the system prompt before any tool calls run. If that file says “performance-sensitive paths are in the database query layer, not the auth layer,” the agent is less likely to form the wrong initial hypothesis. You cannot inject everything, but you can inject the facts that most commonly cause wrong initial directions on your specific codebase.

Short, well-scoped tasks. A task that requires two turns to complete has no opportunity for drift to compound. A task that requires forty turns has thirty-eight opportunities. Breaking large refactors or investigations into sequential, well-defined subtasks, each run as a separate conversation, trades overhead for cleaner context per task. The SWE-bench evaluation suite is biased toward well-scoped tasks for this reason, and benchmark performance on it is probably optimistic relative to open-ended real-world usage.

Human checkpoints after planning. Interrupting the agent after it has identified which files are relevant but before it has made any edits gives a human the chance to verify the direction. If the agent has identified the wrong files, correcting it at that point is cheap. If the agent has already made edits in the wrong files, correction requires either reverting those edits or explaining why they should be ignored, both of which add more noise to context.

What Tasks to Give Agents

This structural property predicts which tasks agents handle well and which they struggle with. The gap is not about model capability; it’s about how much opportunity there is for a wrong premise to accumulate before grounding feedback arrives.

Agents work consistently well on tasks where the relevant starting point is named or findable in one search: a failing test that points to a specific module, a feature addition to an identified file, a rename or type change where the scope is bounded. These tasks start with a correct premise by construction.

Agents struggle with open-ended investigations: “my app is slow,” “there’s a memory leak somewhere,” “users report intermittent failures.” These require forming and testing hypotheses about where the problem is. Each hypothesis generates tool calls; wrong hypotheses generate wrong-direction tool calls that persist in context. The more hypotheses required, the more contaminated the context becomes before the correct one is confirmed.

This is why good agent use involves front-loading the diagnostic work yourself. Identify the specific failing component before handing the task to the agent. The agent’s loop is well-suited to fixing a known problem; it is less well-suited to discovering an unknown one, because discovery requires branching and backtracking in a system that can only append.