The Scaffolding Is the Point: Notes on Agentic Engineering

What Makes an Agent an Agent

The term “agentic engineering” has been circulating more in 2026, and Simon Willison’s guide on agentic engineering patterns pins down what it means better than most definitions I’ve seen. But the definition itself is less interesting than what the discipline demands of you as a builder.

At its core, an agent is an LLM that can take actions in a loop. Instead of sending a prompt and receiving a response, you send a prompt, the model decides to use a tool, you run that tool, you feed the result back, the model reasons about it, and so on until the task is complete or something goes wrong. That loop is the distinguishing feature. Everything else, the memory systems, the multi-agent coordination, the specialized prompts, is built on top of it.

What Willison identifies as “agentic engineering” is the discipline of building and managing that loop reliably. It’s not prompt engineering, though prompts matter. It’s not fine-tuning, though that can help in some contexts. It’s the software engineering work of constructing scaffolding around a language model so it can do useful work over time without you holding its hand at every step.

The Scaffolding Problem

When I first started integrating LLMs into Discord bot work, the natural starting point was one-shot completions. You send a message, you get a response, you post it to the channel. That’s straightforward to reason about and trivial to debug. The LLM is a function: input goes in, output comes out.

As soon as you introduce tools, that model breaks. Now the LLM can call a web search, read a file, query a database, or send a message. The execution path is no longer linear. The model might call three tools in sequence, get an error on the second one, try a fallback, and eventually produce a response that depends on state accumulated across five round trips. Reasoning about that system requires a different mental model entirely.

The scaffolding is the code that manages all of that. It handles the tool dispatch loop, surfaces errors back to the model in a useful form, decides when to truncate context, manages retries, and decides when to give up. None of this logic lives inside the LLM. The model doesn’t know your tool has a rate limit or that its JSON output was malformed. Your scaffolding has to handle that, and handle it in a way that doesn’t confuse the model on the next iteration.

This is the part that most writing about agents underplays. The interesting engineering isn’t teaching the model to reason better through clever prompting. It’s building reliable infrastructure around a component that is fundamentally probabilistic and stateless.

What the Tool Loop Looks Like in Practice

The basic pattern for agentic tool use follows what’s often called the ReAct loop, named for the “Reasoning and Acting” paper from 2022. In practice, with the Anthropic SDK, it looks something like this:

import anthropic

client = anthropic.Anthropic()
messages = [{"role": "user", "content": user_input}]

while True:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        tools=tool_definitions,
        messages=messages,
    )

    messages.append({"role": "assistant", "content": response.content})

    tool_uses = [b for b in response.content if b.type == "tool_use"]

    if not tool_uses:
        text_blocks = [b for b in response.content if b.type == "text"]
        return text_blocks[-1].text

    tool_results = []
    for tool_use in tool_uses:
        result = dispatch_tool(tool_use.name, tool_use.input)
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": tool_use.id,
            "content": str(result),
        })

    messages.append({"role": "user", "content": tool_results})

That’s about 30 lines. A production-ready version is considerably longer, because you need to handle tool execution errors, token budget management (the messages list cannot grow forever), tool call validation (models sometimes produce malformed inputs), infinite loop detection, timeouts, and logging in a form you can actually debug later.

The scaffolding is where most of the work lives. A mature agentic system might have 500 lines of scaffolding for every 50 lines of prompt content.

Memory Is a Systems Design Problem

One of the sharper observations in Willison’s guide is that memory in agentic systems is an architecture problem, not a prompt engineering problem. There are roughly four kinds of memory an agent can use:

In-context memory: whatever fits in the current context window
External retrieval: a vector database or search index queried via a tool
In-weights memory: what the base model learned during training
Cache memory: KV cache reuse across repeated calls

Most discussions of agent memory focus on retrieval-augmented generation, which suits a particular class of problem well. The subtler issue is deciding what to put into context at all. Context windows are not free. Every token used on prior history is a token unavailable for the current task. Agents that naively accumulate context will hit limits, get expensive, and produce worse results as the window fills with noise from earlier interactions.

The practical answer is an explicit policy for context management. You might keep the last N turns verbatim, summarize older turns with a smaller cheaper model, and retrieve relevant documents on demand. The shape of that policy determines how well your agent handles long-running tasks. No prompt is going to fix a context management policy that’s fundamentally wrong.

This is a familiar problem if you’ve worked on stateful systems before. An agent managing a long task is not unlike a process managing a large working set. You need eviction policies, you need to distinguish hot from cold data, and you need to be deliberate about what you keep versus what you can reconstruct on demand.

Why Multi-Agent Isn’t Just About Parallelism

Multi-agent architectures, where one orchestrator agent delegates subtasks to specialized worker agents, get a lot of attention for their apparent scalability. The motivation that matters most in practice is more mundane: context isolation.

A single agent accumulating context across a long task will eventually have a degraded context full of partial results, error messages, and abandoned branches from earlier attempts. Breaking the task into isolated sub-agents, each with a clean context and a specific scope, avoids that degradation. The orchestrator stays focused on high-level coordination while workers handle details they never need to share upward.

There is a genuine cost: inter-agent communication is a new failure surface. The orchestrator has to accurately specify subtasks, workers have to produce outputs in a format the orchestrator can consume, and errors in either direction require handling. Anthropic’s own research into effective agent patterns from 2024 found that the biggest gains from multi-agent setups came from isolation rather than parallelism, which is a useful frame for deciding when the pattern is worth reaching for.

In Discord bot work, this shows up as command handlers that operate independently, with a thin routing layer deciding which handler gets a message. It’s less exotic than “multi-agent” implies, but the underlying principle is the same.

The Part That Gets Skipped: Observability

The agentic systems that survive production aren’t the ones with the cleverest prompts. They’re the ones you can debug when something goes wrong.

Agentic systems fail in non-obvious ways. The model might call a tool with valid-looking arguments that produce garbage output, then reason confidently from that garbage toward a wrong conclusion. The tool might succeed but return more data than the model can effectively process. The context might accumulate enough noise that the model starts ignoring its earlier instructions. None of these failures produce obvious error messages.

Logging every tool call, every model response, and every state transition is not optional. You need to be able to reconstruct the exact sequence of events that led to a bad outcome. Structured logging with unique trace IDs per agent session is the minimum viable approach. Anything beyond that, replay testing, sampling for human review, automated anomaly detection, is worth adding as the system matures.

The Engineering Discipline

What Willison’s framing adds is the word “discipline.” Agentic engineering is not a technique or a framework. It’s a set of practices for building reliable systems that include a language model as an active component over time.

Those practices look a lot like regular software engineering: define clear interfaces between components, handle failure explicitly, log everything, test with adversarial inputs, design for observability. The difference is that one of your components is probabilistic and can produce novel outputs at runtime. Your scaffolding needs to be robust to that novelty without being so restrictive that it prevents the model from doing anything useful.

That balance between giving the model enough latitude to be genuinely useful and enough constraint to be reliable is the central design problem in agentic engineering. The field is still working out which patterns hold up at scale and which look good in demos but fail in production. Building with good observability from the start is the best insurance against finding out the hard way which category your design falls into.