What Actually Makes a Coding Agent Work: The Engineering Beneath the Loop

Sebastian Raschka’s Components of a Coding Agent is one of the cleaner technical breakdowns of how these systems are actually built. It covers the agent loop, tool design, edit formats, and context management with the kind of precision you want when you’re trying to build something rather than just talk about it. What follows is my attempt to go a layer deeper on the pieces I find most interesting, particularly the parts where the engineering decisions have outsized effects on whether an agent is useful or just impressive-looking in demos.

The Loop Is Simple; What Runs Inside It Is Not

Every coding agent is, at its core, a ReAct loop: the model reasons about the task, picks a tool to call, observes the result, and repeats until it decides it’s done. This pattern was formalized in the ReAct paper (Yao et al., 2022), but the basic idea is older than that. What changed with modern LLMs is that the “reason” step got good enough to be practically useful.

The loop structure itself is almost not worth discussing. What matters is everything else: which tools you expose, how you describe them, what edit format you use, how you manage the context budget, and how you verify correctness. Those decisions compound.

The Minimal Tool Set

A coding agent needs surprisingly few tools to be functional. The minimal viable set is roughly:

Read a file
Write or edit a file
Run a shell command
Search file contents (grep-like)
List directory structure

Every serious agent has some variant of these five. Claude Code uses Read, Edit, Bash, Grep, and Glob. Aider drives the whole conversation through a structured chat format but wraps similar operations underneath. SWE-agent (Princeton/Stanford, 2024) has open, edit, search_file, find_file, and execute_interactive.

The interesting question is not what tools to include but how to design them. This is where the research gets genuinely useful.

Tool Schema Is LLM UX

The SWE-agent paper introduced the concept of the Agent-Computer Interface (ACI): the idea that the interface layer between an LLM and a computer deserves the same design attention that UI/UX gets for human-facing software. This framing is worth taking seriously.

The paper showed that changing tool names, descriptions, and parameter schemas produced performance swings larger than switching between model versions. A tool called search_file with a clear description of what it returns outperforms a raw bash invocation that happens to run grep, even if the underlying operation is identical. The model’s understanding of what a tool does before it calls it matters enormously.

This has practical implications. A poorly named tool with an ambiguous description creates the equivalent of a confusing UI: the model makes the wrong call, observes a result it didn’t expect, and has to recover. Recovery burns context tokens and introduces drift. Good tool design prevents the problem upstream.

A few concrete examples of what this looks like in practice:

// Weak schema
{
  "name": "execute",
  "description": "Runs a command",
  "parameters": {
    "cmd": {"type": "string"}
  }
}

// Stronger schema
{
  "name": "run_tests",
  "description": "Run the test suite for a specific file or directory. Returns stdout, stderr, and exit code. Use this to verify that your changes pass tests before marking the task complete.",
  "parameters": {
    "path": {
      "type": "string",
      "description": "File or directory to test (e.g. 'src/auth/' or 'src/auth/test_login.py')"
    }
  }
}

The second version tells the model when to use this tool, what it returns, and what the expected outcome looks like. That specificity reduces misuse.

The Edit Format Problem

How an agent edits files is one of the most consequential engineering decisions in the whole stack, and it’s easy to get wrong.

The naive approach is whole-file replacement: read the file, generate a new version, write it back. This works for small files but scales badly. For a 500-line file, the model has to regenerate all 500 lines to change three of them. That costs tokens, introduces transcription errors on the unchanged lines, and creates enormous diffs that are hard to review.

The natural alternative is unified diff format. Diffs are compact and express intent precisely. The problem is that LLMs are unreliable at generating syntactically valid unified diffs. The @@ headers, line counts, and context requirements create enough surface area for subtle errors that the failure rate in practice is high enough to be a real problem.

Aider solved this with its SEARCH/REPLACE block format, which Raschka covers in the article. The model outputs blocks like:

<<<<<<< SEARCH
def authenticate(user, password):
    return check_hash(password, user.hash)
=======
def authenticate(user, password):
    if not user:
        raise ValueError("User not found")
    return check_hash(password, user.hash)
>>>>>>> REPLACE

This is not a real diff format, but it’s much easier for a model to generate correctly. The scaffolding code does the actual file surgery. Aider’s benchmarks on edit format choice show significant variation in task completion rates depending purely on which format is used, with the SEARCH/REPLACE approach consistently outperforming whole-file and unified diff approaches across most models.

Claude Code uses a similar philosophy with its Edit tool: the model specifies an exact old_string to find and a new_string to replace it with. The precision requirement (the old string must match exactly) turns out to be a feature, not a bug. It forces the model to read the file carefully before editing, which reduces hallucinated edits.

Context Budget Is a First-Class Engineering Concern

Real codebases are millions of tokens. Context windows, even at 200K tokens, cannot hold a full repository. Every coding agent has to solve the problem of what to include.

The blunt approach is to inject everything relevant upfront. This works for small projects and fails predictably for large ones. The more principled approach is search-first navigation: before reading any file, the agent uses grep and glob to find what’s relevant, then reads only those files. This mirrors how a developer actually explores an unfamiliar codebase.

Aider takes this further with its repo map: a compact, ranked summary of the codebase’s symbol structure, built with tree-sitter. Every request includes this map so the model always knows what’s in the repository without having to read every file. The map is dynamically sized to fit within a configurable token budget.

A third approach, used in some enterprise tools like Sourcegraph Cody and GitHub Copilot Workspace, is semantic retrieval: embed all code chunks and retrieve by cosine similarity to the current task. This can surface relevant code the agent would never think to grep for, but it adds infrastructure complexity and retrieval latency.

In practice, most agents that work well use a combination: grep and glob for precise lookups, a lightweight repo map or file tree for structural awareness, and explicit file reads only when needed. The key discipline is treating context as a budget and making deliberate choices about what earns a slot in it.

The Verification Step Is What Makes It Real

An agent that edits code but never checks whether the edits work is just an expensive autocomplete. The verification step, running tests or a linter after each meaningful change, is what distinguishes an agent that can be trusted to complete tasks from one that merely attempts them.

The mechanics are straightforward: after editing, call the test runner, parse the output, and loop back if tests fail. What makes it interesting is how the agent interprets failure. A good agent treats test output as structured feedback: which test failed, what the error message says, and what file and line the error points to. This is richer signal than most humans give themselves when debugging.

SWE-bench, the standard benchmark for coding agents on real GitHub issues, measures exactly this: can the agent produce changes that make the tests pass? Early agents in 2024 scored in the 10-15% range. By late 2024 and into 2025, frontier models with well-engineered scaffolding were reaching 40-50% on the verified subset. That gap is not explained by model capability alone; the scaffolding, including the verification loop, accounts for a large portion of it.

What This Adds Up To

The components Raschka enumerates are not individually surprising. The agent loop, tool calls, file editing, context injection: anyone who has played with these systems has a rough mental model of each piece. What the breakdown makes clear is how much the implementation details of each component determine the overall behavior.

A coding agent with weak tool schemas, whole-file edit format, no context budget management, and no verification loop will still feel impressive on simple tasks. It falls apart on anything with real complexity. The agents that hold up under realistic conditions are the ones where someone made considered engineering decisions at each layer, not just assembled the obvious pieces and hoped the model would fill in the gaps.

Building Ralph, my Discord bot with AI capabilities, has made this very concrete for me. The difference between a bot that occasionally does something useful and one that reliably completes tasks comes down to exactly these details: how the tools are described, how context is managed across turns, and whether there is any feedback mechanism to detect when the model has gone wrong. The loop is easy. Everything inside it is the actual work.