The ACI Problem: Tool Design as the Hidden Variable in Coding Agent Performance

The first generation of coding agent prototypes gave language models a shell and told them to use it. bash() did everything: read files with cat, edit with sed, navigate with find. This worked often enough to be compelling and failed often enough to be unreliable. The failure modes clustered around the places where standard Unix tools behave in ways that are difficult to reason about from text alone.

The SWE-agent paper from Princeton in 2024 named this problem and gave it a framework. The authors coined the term “Agent-Computer Interface” (ACI) to describe the tool layer between an LLM and the computer it operates. The parallel to Human-Computer Interface is deliberate: just as UX research showed that interface design shapes what users can accomplish independently of their underlying intelligence, ACI design shapes what models can accomplish independently of their raw capability. Simon Willison’s guide to how coding agents work covers the agent loop and common tool patterns; the ACI framing offers a lens for understanding why similar agents with similar models produce substantially different results.

What Makes a Tool Hard for an LLM to Use

Standard Unix tools were designed for humans who can read terminal output contextually, page through results, and maintain state in their heads across a session. An LLM processing tool output sees text appended to an ever-growing conversation. The differences that follow from this are concrete.

cat file.py dumps the file with no line numbers. An LLM that wants to reference line 47 has to count from the top. If the file is longer than a few hundred lines, the output will be truncated somewhere in the conversation history, and subsequent references to specific lines become unreliable. grep -r "function_name" . returns matches but strips surrounding context by default, so a model trying to understand whether a match is a definition or a call site has to issue follow-up commands. Each follow-up adds a tool call, extends the conversation, and pushes earlier content further from the model’s effective attention.

sed -i 's/old/new/g' file.py fails completely if the pattern contains special characters, and succeeds silently if it replaces more matches than intended. The model receives no structured feedback about what changed, only an exit code.

SWE-agent measured the cost of these friction points and found that purpose-built tools consistently outperformed raw shell access. The key design principles they identified:

Line-numbered output on demand. When the agent opens a file, it sees the file with explicit line numbers. It can then jump to a specific line and see that region with surrounding context. This anchors the model’s references and makes subsequent edits more precise.

Bounded output with explicit truncation markers. A tool that shows 100 lines and says “showing lines 80-180 of 450” gives the model enough information to navigate without filling the context window with irrelevant content.

Edit by range, not by pattern. Specifying edits by line range is unambiguous. There is no failure mode where the wrong occurrence of a string gets modified.

Structured error feedback. When a command fails, the tool returns a structured error with diagnostic context, not just an exit code and stderr dump.

The Philosophical Divide

The ACI insight created a genuine split in coding agent design philosophy. Some agents use purpose-built tools tuned for model cognition; others expose raw shell access and trust capable models to figure out appropriate tool usage. Both approaches produce working systems, but they make different bets about where errors originate.

Aider sits firmly in the structured-tools camp. It does not give models a general bash tool. It has specific operations for reading files, editing via its search/replace block format, running tests, and committing to git. The model works within this structured interface, and the structure prevents entire categories of errors.

Claude Code takes the opposite position. It gives models a full Bash tool capable of running arbitrary commands, alongside purpose-built tools for common operations: Read, Write, Edit, Grep, and Glob. The bet is that a capable model in a real environment can navigate with standard tools when necessary, while the structured tools handle the common cases efficiently. This is more flexible but also means the model can default to cat for large files instead of using the bounded Read tool with line range parameters.

SWE-agent itself uses the most opinionated ACI: its open, goto, scroll_down, scroll_up, edit, search_file, and search_dir commands form a complete interface that bears almost no resemblance to a Unix shell. The tool surface is small and well-characterized, which produces very consistent model behavior at the cost of flexibility.

The performance data from SWE-bench Verified illustrates what is at stake. SWE-agent’s original implementation achieved roughly 12-14% on the benchmark. Recent systems with ACI-optimized tools and better underlying models reach 40-60%. Both model capability and interface design contribute to this gap; improvements in one without the other produce smaller gains than both together.

Tool Schemas Shape Model Behavior

Beyond the operations themselves, how tools are defined in the API call affects how reliably models invoke them. Tool definitions include a name, a description, and a JSON schema for parameters. These are not just documentation; they are instructions the model reads at inference time.

Compare two definitions for a file-reading tool:

{
  "name": "read_file",
  "description": "Read the contents of a file",
  "input_schema": {
    "type": "object",
    "properties": {
      "path": { "type": "string" }
    }
  }
}

versus:

{
  "name": "Read",
  "description": "Read a file's contents. For large files, specify start_line and end_line to read only the relevant section rather than the whole file.",
  "input_schema": {
    "type": "object",
    "properties": {
      "path": {
        "type": "string",
        "description": "Absolute path to the file"
      },
      "start_line": {
        "type": "integer",
        "description": "First line to read (1-indexed). Omit to read from the beginning."
      },
      "end_line": {
        "type": "integer",
        "description": "Last line to read (inclusive). Omit to read to the end."
      }
    }
  }
}

The second version encodes a usage convention in the schema itself. The model is not just told that line-range reading is possible; the description guides it toward that approach for large files. This shifts behavior through interface design, not training.

The description field in tool schemas is a behavioral contract. Poorly-worded descriptions produce inconsistent invocations. A tool called edit_file with no description of when to use it versus write_file leads to arbitrary choices between them. A tool description noting that old_str must be unique produces fewer ambiguous replacements than the same tool without that constraint documented. These are not minor implementation details; they are the difference between an agent that reliably completes multi-step tasks and one that fails at the third or fourth action in a sequence.

What This Means for Evaluating and Building Agents

The ACI framing has a practical consequence: evaluating a coding agent by model capability alone misses most of what matters. A weaker model with a better-designed tool interface will outperform a stronger model with raw shell access on many tasks, because the structured interface prevents failure modes that compound across a multi-step session.

For teams building coding agent infrastructure on top of existing scaffolding, the tool layer deserves treatment as a first-class design concern. A Bash tool that truncates output at 8,000 characters and surfaces a truncation marker is more reliable than one that returns the full output. The model can navigate around a known limit rather than silently working from incomplete data. A dedicated Grep tool with a configurable output mode is more predictable than asking the model to construct a ripgrep invocation from scratch.

For teams evaluating existing agents, the diagnostic questions are less “what model does this use?” and more “what does the agent do when a file is 2,000 lines long?” or “what happens when a tool invocation fails partway through a multi-step edit?” These are ACI questions, and the answers reveal a structural reliability ceiling that sits below model quality.

The connection to interface design more broadly is not superficial. The same principles that make an API easy to use correctly and hard to use incorrectly apply to tool schemas. Defaults that match the common case, descriptions that encode usage conventions, parameter names that are unambiguous: all of these reduce cognitive load for the model in the same way they reduce cognitive load for a human developer reading API documentation. The model is not a special case; it is a very fast reader that makes decisions based entirely on what the interface communicates.

Simon Willison’s guide frames coding agents as tools running in a loop. The ACI research adds a layer to that framing: the tools themselves are a design artifact, and the quality of that design determines how much of the model’s capability translates into completed tasks.