Agents Are Mostly Scaffolding: What Agentic Engineering Actually Is

Simon Willison published a guide on agentic engineering patterns that is worth reading carefully, not because it introduces exotic ideas, but because it names something that has been fuzzy for a while. “Agentic engineering” as a term does real work. It carves out a specific discipline from the broader category of “LLM integration” and forces clarity about what is actually different when you build systems where the model drives sequences of actions.

The short answer is: an agent is a loop. The model produces output, something in the environment changes based on that output, and the model gets a new turn with the result. That loop might run twice or two hundred times. The loop might involve a single model or several. The “something in the environment” might be a function call, a subprocess, a web request, or a write to disk. Strip away the marketing and that is what remains.

But the interesting engineering question is not what an agent is. It is what makes building one hard.

The Tool-Calling Loop

Every major LLM API now supports tool use (also called function calling). You describe a set of tools to the model, the model returns a structured call to one of them, your code executes it, and you append the result to the conversation. Repeat.

A minimal Python loop over the Anthropic API looks something like this:

import anthropic

client = anthropic.Anthropic()
tools = [
    {
        "name": "read_file",
        "description": "Read the contents of a file at a given path",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "Absolute file path"}
            },
            "required": ["path"]
        }
    }
]

messages = [{"role": "user", "content": "Summarize the contents of /tmp/notes.txt"}]

while True:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages
    )

    if response.stop_reason == "end_turn":
        print(response.content[-1].text)
        break

    # Process tool calls
    tool_uses = [b for b in response.content if b.type == "tool_use"]
    if not tool_uses:
        break

    messages.append({"role": "assistant", "content": response.content})

    tool_results = []
    for tool_use in tool_uses:
        if tool_use.name == "read_file":
            try:
                with open(tool_use.input["path"]) as f:
                    result = f.read()
            except Exception as e:
                result = f"Error: {e}"
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": tool_use.id,
                "content": result
            })

    messages.append({"role": "user", "content": tool_results})

This is roughly sixty lines. It works. And it reveals something important: the model’s role in this system is to decide which tool to call and with what arguments. The rest, the loop control, error handling, result formatting, is regular software.

Frameworks like LangChain, LlamaIndex, and CrewAI wrap this pattern with varying levels of abstraction. Anthropic publishes its own Claude Agent SDK. The frameworks are useful for getting started, but they also obscure the fundamental simplicity of the loop. Several engineers I know have replaced framework-based agents with hand-written loops because the framework’s abstractions started fighting them more than helping them.

The Context Window Is the Real Constraint

Every conversation with an LLM is stateless from the model’s perspective. The model sees the messages array, nothing else. As an agent runs, that array grows: the original task, each tool call, each result. In a long-running task, this is where things get expensive and brittle.

At 200,000 tokens of context (which Claude 3.5 and later models support), you can fit a lot. But at $15 per million tokens on the input side for frontier models, a few hundred tool-call cycles with verbose results can cost real money. More importantly, performance tends to degrade as context grows. Models attending over huge windows of prior tool results are not always better at deciding what to do next; sometimes they are worse.

The patterns that emerge to manage this are borrowed from systems programming. You truncate old results. You summarize completed steps. You maintain a separate external memory (a database, a vector store) and retrieve relevant entries rather than keeping everything in-context. You checkpoint state so a crashed agent can resume without replaying everything from the start.

This is not LLM-specific problem-solving. It is cache invalidation, log rotation, and state machine design, applied to a context window instead of RAM or disk.

The Scaffolding Thesis

Willison has been consistent on a point that is easy to underestimate: the non-LLM code is where the real engineering happens. The scaffolding, the loop, the tool implementations, the retry logic, the context management, the output parsing, is where you spend most of your time. The model is a component, not the system.

This has a few practical implications.

First, tool design matters more than prompt design for most agents. A tool with a clear, specific description and a well-typed input schema will be called correctly far more reliably than the same tool with a vague description and loose typing. The model is good at following structure; give it structure.

Second, reliability requires treating tool calls as potentially failing operations. Network requests time out. Files don’t exist. APIs return unexpected status codes. An agent loop that does not handle these cases will get stuck or produce garbage. The model will often try to work around errors in creative and unhelpful ways if you give it a raw exception trace.

Third, testing agentic systems requires testing the scaffolding. Unit tests that mock the LLM and verify that specific tool outputs produce specific next actions are more valuable than end-to-end agent runs for most bugs. The model behavior is not under your control; your scaffolding is.

Multi-Agent Systems and When They Help

Orchestrating multiple models is genuinely useful in a narrow set of cases. The most defensible pattern is specialized subagents: one model that plans, others that execute specific subtasks. This works because planning and execution benefit from different prompt contexts, and because you can parallelize independent subtasks.

The antipattern is adding agents for the sake of perceived sophistication. A single well-prompted model with good tools outperforms a poorly-coordinated multi-agent system reliably. The overhead of agent-to-agent communication (which ultimately goes through context windows too) adds latency and cost for uncertain benefit.

The ReAct paper from 2022 established the basic Reason + Act loop that most agentic systems use, and the academic literature since then has explored multi-agent setups extensively. Most production systems that I am aware of are simpler than the papers suggest is necessary.

The Security Problem Nobody Talks About Enough

Prompt injection is the most underappreciated risk in agentic systems. When your agent reads a webpage, processes an email, or queries a database, the content it retrieves is untrusted input. If that content contains text crafted to look like system instructions, a naive agent will sometimes follow them.

This is not theoretical. Researchers have demonstrated that an agent browsing the web can be redirected to exfiltrate data, call unintended tools, or abandon its original task by adversarial content embedded in pages it visits. Willison has written about this extensively, and it remains an open research problem.

The practical mitigations are architectural: do not give agents more permissions than their narrowest required task, log all tool calls for audit, require human confirmation before irreversible actions, and treat tool results as untrusted input at the application level.

What This Discipline Actually Requires

Agentic engineering is software engineering applied to systems with a probabilistic, text-in-text-out component at the center. The skills it draws on are not primarily machine learning skills. They are: API design, error handling, state management, security thinking, cost modeling, and observability.

If you have built reliable distributed systems, or maintained long-running background workers, or debugged race conditions in async code, you already know most of what agentic engineering requires. The new part is learning to treat the model’s output as a structured but unreliable signal, and building the scaffolding that makes a system reliable despite that.

That framing, taken from Willison’s guide and consistent with what practitioners are reporting in production, is more useful than the agent-as-magic-autonomous-worker framing that shows up in demos. Agents are tools. Their reliability is your responsibility.