· 6 min read ·

The Message Trace Is Your Debugger: Diagnosing Coding Agent Failures

Source: simonwillison

When a coding agent does something unexpected, most developers reach for print statements or hope the logs say something useful. The agent already produces a complete execution trace: the messages array it uses for every API call. Every tool call, every model response, every error message is captured there. Understanding how to read that trace is what separates diagnosing a failure in five minutes from spending an hour guessing.

Simon Willison’s guide on how coding agents work covers the loop architecture clearly. This post is about what to do when that architecture produces wrong results.

The Messages Array Is the Execution Trace

A coding agent’s entire working state lives in the messages array passed to each API call. The model has no persistent memory between calls, no side-channel state, no hidden context. When something goes wrong, everything the model saw and every decision it made is in that array.

Adding structured logging to the standard agent loop is a one-time change that pays off repeatedly:

import json
import anthropic

client = anthropic.Anthropic()

def run_agent(task: str, tools: list, log_file: str = "trace.jsonl") -> str:
    messages = [{"role": "user", "content": task}]
    
    with open(log_file, "w") as f:
        f.write(json.dumps({"turn": 0, "event": "start", "task": task}) + "\n")
    
    for turn in range(50):  # hard limit
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=8096,
            tools=tools,
            messages=messages
        )
        
        with open(log_file, "a") as f:
            f.write(json.dumps({
                "turn": turn + 1,
                "stop_reason": response.stop_reason,
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "content": [b.model_dump() for b in response.content]
            }) + "\n")
        
        messages.append({"role": "assistant", "content": response.content})
        
        if response.stop_reason == "end_turn":
            return response.content[-1].text
        
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                output = dispatch_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": str(output)
                })
        
        messages.append({"role": "user", "content": tool_results})
        
        with open(log_file, "a") as f:
            f.write(json.dumps({"turn": turn + 1, "tool_results": tool_results}) + "\n")
    
    raise RuntimeError(f"Turn limit reached. Trace saved to {log_file}")

That JSONL file is the closest thing to a debugger you will get for the agent loop. Each line is a turn, each turn records what the model saw, what it decided to do, and how many tokens were consumed getting there.

What a Clean Run Looks Like

Before you can recognize failure, know what success looks like in the trace. A clean bug-fix session on a TypeScript codebase:

Turn 1: stop_reason=tool_use, input_tokens=3200
  → calls Read on auth/session.ts
Turn 2: stop_reason=tool_use, input_tokens=5100
  → calls Read on auth/middleware.ts
Turn 3: stop_reason=tool_use, input_tokens=5800
  → calls Grep for 'expireSession'
Turn 4: stop_reason=tool_use, input_tokens=6100
  → calls Edit on auth/session.ts
Turn 5: stop_reason=tool_use, input_tokens=6400
  → calls Bash: npx jest auth
Turn 6: stop_reason=end_turn, input_tokens=7200
  → returns final answer

The input token count grows steadily but not explosively. Each tool call serves the task. The model reaches a conclusion in six turns. Compare this to a failure pattern.

Common Failure Signatures

Context drift shows up as the model repeating work it already completed. In the trace, you see a Read call on a file that was already read four turns ago, or an Edit attempting to change something the model should know is already changed. The model has not “forgotten” in any meaningful sense: the content is still in the context window. But the lost-in-the-middle research from Stanford and Berkeley demonstrated that LLM recall degrades substantially for content placed in the middle of long contexts. In a 30-turn session, a file read at turn 3 may as well not exist by turn 28.

The trace makes this diagnosable rather than mysterious. You can see exactly when the redundant read happens and how far back the original read was. When context drift appears consistently on tasks above a certain complexity, the architectural response is task decomposition: sub-agents with fresh context windows handle sub-tasks and return summarized results rather than raw transcripts.

Tool call loops appear as the same tool called repeatedly with slight variations, each returning similar errors, without the model changing strategy. Three turns of identical bash failures means the error message is not giving the model enough information to diagnose the problem. Three turns of file path guesses means the model is searching rather than exploring: it needs a glob or directory listing tool to ground itself before attempting reads. Three turns of failed string-match edits usually means the file was modified earlier in the session and the model’s expected content no longer matches reality.

Each of these has a different fix, but all of them are visible in the trace before you start guessing. The tool results contain the error messages the model received; reading them tells you what information the model had when it decided to retry.

Hallucinated inputs show up as tool calls referencing things that do not exist: a path to a file that was never created, a function name not present in the codebase, a bash command with plausible but incorrect flags. These cluster early in sessions where the model attempts to act before grounding itself. A forced orientation step at the start, reading a directory listing or project structure, reduces their frequency. When you see hallucinated inputs mid-session, look at what the model read before that turn: the input it was working from was probably ambiguous or incomplete.

Reading the Token Budget

The input token count per turn is the most useful single metric in the trace. Watch for these patterns:

A sharp jump in input tokens at a specific turn means a tool returned a large output at that turn. If the large output was a file read, check whether the entire file was necessary: if the model was looking for a single function, a grep would have retrieved the relevant code with far fewer tokens. If the large output was command output, check whether it was truncated appropriately or whether the agent received 50,000 characters of test output when the first 2,000 would have been sufficient.

Input tokens growing by similar increments each turn, combined with many turns, usually means the model is making many small exploratory tool calls rather than committing to a plan. This is the “tool call avalanche” pattern. It often correlates with vague task descriptions: more specific tasks produce more direct action.

For agent loops that run repeatedly against the same codebase, the Anthropic prompt caching API is worth tracking in the trace. Mark stable prefix content, like project documentation or large file reads that do not change between sessions, with cache_control: {"type": "ephemeral"}. The response includes cache_read_input_tokens alongside input_tokens, which tells you exactly what was served from cache. Cache hits cost 10% of normal input token pricing on Claude’s API. For batch workflows running dozens of agent sessions against a shared codebase context, this reduces costs significantly and the trace makes the savings visible per run.

External Observability

For anything beyond a personal tool, structured observability at the service level matters. Langfuse and LangSmith both support LLM tracing with turn-level breakdowns: token counts, tool call names, latency per step, and cost estimates. Both have open-source self-hosted options, which matters if your agent runs against sensitive code.

For simpler setups, the Anthropic SDK uses httpx for HTTP transport, which allows request-level hooks:

import httpx
import anthropic

def log_request(request: httpx.Request) -> None:
    # write to your metrics system
    pass

client = anthropic.Anthropic(
    http_client=httpx.Client(
        event_hooks={"request": [log_request]}
    )
)

This gives you raw request/response access without depending on an external tracing library. Combined with the JSONL trace file, it produces enough signal for systematic debugging.

Working Backwards From Failures

The most efficient debugging approach is to start at the end of the trace and work backwards. The last few turns show what the model was doing immediately before it failed, got stuck, or returned a wrong result. That is the immediate cause.

Working backwards, identify the turn where the model made the decision that led to the failure. Was the information it had at that turn accurate? Was the tool result it received complete? Did an earlier tool call return something ambiguous that the model interpreted incorrectly?

Most coding agent failures are not bugs in the loop implementation. The loop is mechanically simple and hard to get wrong. Failures almost always trace back to the model making a plausible decision given incomplete or misleading context. The trace tells you what context the model had at each decision point. That is the information you need, and it is already there.

Was this interesting?