· 7 min read ·

The Engineering Work Hidden in a Coding Agent's System Prompt

Source: simonwillison

The demo version of a coding agent is well-understood by now. You set up a loop, pass tool definitions to the model, dispatch tool calls, append results to the conversation, and repeat. Simon Willison’s guide on how coding agents work covers these mechanics clearly, and the implementation is genuinely straightforward. About forty lines of Python gets you a functioning skeleton. What the forty-line skeleton does not show is the system prompt, and the system prompt is where the actual engineering is.

What the System Prompt Has to Do

In a coding agent, the system prompt is not just an introduction. It is the behavioral specification of the entire agent: what it does when uncertain, how it handles tool failures, when it asks for clarification rather than proceeding, what coding style it enforces, how it formats its output, and which categories of action it should refuse. None of this is handled by the tool loop. The loop is infrastructure; it moves data between the model and the runtime. The model’s behavior at each decision point is determined by what the system prompt has instructed it to do.

A coding agent system prompt has to encode several distinct concerns.

Scope and identity. The model needs to know what kind of agent it is and what its operating context is. For a coding agent, this means something specific: “You are an autonomous software engineering agent. Your job is to implement, debug, and refactor code in response to task descriptions. You operate on a real filesystem and can execute shell commands in a sandboxed environment.” The difference between this and a generic assistant framing is not cosmetic. The model needs to understand that it is taking actions in a real environment, not answering questions.

Decision heuristics for ambiguity. The most consequential behavioral question in any coding agent is whether it asks a clarifying question or proceeds based on its best interpretation. Ask too often and the agent is frustrating to use; ask too rarely and it makes assumptions that waste effort or cause damage. The system prompt has to encode a policy. Claude Code, based on observable behavior, appears to apply something like: proceed when there is a plausible, focused interpretation of the request; ask when multiple materially different interpretations exist, or when a potentially destructive action has ambiguous intent.

Tool usage preferences. Even with well-described tools, a model will use them inconsistently without guidance. Whether to prefer grep over reading a full file when looking for a specific symbol, whether to run tests after every edit or only when asked, whether to verify a file path exists before attempting to read it, these are choices with real consequences for efficiency and correctness. Without explicit guidance in the system prompt, the model makes these choices differently each run, and the variance is hard to debug because the output is often correct even when the process was wasteful.

Error handling policy. When a shell command fails, what should the agent do? When a file it expected to find is missing? When tests fail after an edit that was supposed to fix them? Without guidance, the default behavior is to loop indefinitely, trying variations until the context fills or a token budget is exhausted. A well-engineered agent has explicit policies: retry this class of error up to N times; surface this class of error to the user; abandon and report this class rather than continuing to iterate.

Output format. How should the agent communicate progress? Should it explain every step or proceed silently? Should it produce a final summary or let the sequence of tool calls speak for itself? These decisions affect user experience as much as correctness.

What These Concerns Look Like Written Out

A minimal but representative sketch of what a system prompt fragment looks like when these concerns are made explicit:

You are an autonomous coding agent. Your task is to implement, debug,
and refactor code in response to user requests. You operate on a real
filesystem in a sandboxed environment.

When interpreting a task:
- If the task has one clear interpretation, proceed with it.
- If the task is ambiguous in a way that would materially affect your
  approach, ask one focused clarifying question before proceeding.
- If a task requires a destructive or irreversible action, confirm
  the intent explicitly before proceeding.

When using tools:
- Prefer grep or glob over reading whole files when looking for a
  specific symbol.
- When editing files, include enough surrounding context in old_string
  to uniquely identify the location.
- Run the test suite after any edit that changes runtime behavior.

When a tool call fails:
- File not found: verify the path with glob before retrying.
- Test failure: analyze the output, attempt one fix, re-run, and report
  if the second run also fails rather than continuing to iterate.
- Shell errors: read the full error output before deciding what to do next.

This is illustrative. The structure is representative: it is entirely about behavioral policy, not about what the tools do. Tool descriptions handle the latter separately. The system prompt handles when, why, and how to use them, and what to do when things go wrong.

The Testing Problem

System prompt changes are software changes with no obvious test harness. You cannot unit test a prompt the way you test a function. The output of a model given the same prompt varies between runs, and failure modes are distributional rather than deterministic.

The practical approach is a benchmark suite: a set of representative tasks with defined success criteria, run against each prompt variant multiple times and evaluated by a combination of automated checks (did the tests pass? were the right files modified?) and human review (does the approach look sound?). The SWE-bench benchmark and its successors serve this role for evaluating coding agent capabilities at scale. For a custom agent, you build something smaller and more targeted to your actual use case.

The challenge is that prompt changes interact in non-obvious ways. Adding a constraint for one behavior can degrade another. Making error handling more conservative reduces the rate of abandoning solvable tasks but may increase false negatives for legitimate failures. These interactions only become visible across a large enough and diverse enough sample of runs. A prompt that looks better on your five test cases can be worse across a broader distribution.

This is why the Anthropic documentation on prompt engineering recommends iterating against a representative sample set rather than against individual examples. For a coding agent specifically, the sample set needs to cover the full range of task types the agent will encounter: greenfield implementations, bug fixes, refactors, test writing, documentation, and edge cases like tasks with ambiguous requirements or tasks where the most obvious approach turns out to be wrong.

How System Prompt and Tool Descriptions Interact

The system prompt and tool descriptions are not independent. What the system prompt specifies about tool usage affects how much behavioral specification each tool description needs to carry. If the system prompt instructs the agent to always run type checking before finishing a TypeScript task, the type checker tool description can be shorter, because the model already has context for when to use it. If the system prompt is silent, the tool description has to carry more of the load.

The interaction runs the other direction too. A detailed tool description reduces what the system prompt needs to specify, because the model learns from the description what to expect and how to use the tool effectively. The Anthropic tool use documentation covers how tool definitions are structured; the system prompt design that works alongside them is less documented and usually learned through iteration.

In practice, a maintainable split tends to put behavior that applies across all tools in the system prompt, and behavior specific to one tool in its own description. Cross-cutting concerns like error handling philosophy, output format, and decision heuristics belong in the system prompt. Tool-specific constraints like “this tool only accepts relative paths” or “this tool will time out after 30 seconds, so check the output before declaring success” belong in the tool description.

The Thousand Tokens Nobody Publishes

Claude Code’s system prompt is not publicly documented, but community attempts to observe and reconstruct it through systematic testing suggest it runs to several thousand tokens. The categories that emerge are predictable: detailed role definition, explicit guidance on when to seek clarification versus when to proceed, tool usage preferences, coding conventions, instructions for handling uncertainty, and a long list of edge-case behaviors. The length is not padding; it is the accumulated product of encountering situations where the model made poor decisions and adding guidance to prevent recurrence.

This is the trajectory for any production coding agent. You start with a short system prompt and add to it as you discover the cases where the model does the wrong thing. Over time the prompt grows, because the space of situations a coding agent encounters in real use is large and the model’s default behaviors do not always align with what you actually want.

What This Means in Practice

For someone integrating a coding agent into an existing development workflow, the system prompt is the primary customization surface. The tool set is relatively stable; the desired behaviors vary considerably by codebase, team convention, and task type. A system prompt tuned for a TypeScript frontend project will make different decisions than one tuned for a systems programming project in C or Rust. A system prompt designed for exploratory refactoring will behave differently than one designed for strict test-first development.

Building a coding agent that actually fits a team’s workflow means building a system prompt that encodes that workflow’s preferences, constraints, and norms, and then testing it against enough representative tasks to trust that it generalizes. The scaffolding for the agent loop is straightforward and well-documented. The system prompt is the engineering.

Was this interesting?