The Tool Loop Is the Program: What Agentic Engineering Actually Requires

The agent loop is where the shift from prompt engineering to agentic engineering becomes concrete. In a standard LLM interaction, you send a prompt, receive a response, done. The model is stateless between calls, the context window is bounded by one exchange, and correctness is easy to reason about. Agentic systems break all three of those properties deliberately.

Simon Willison’s guide on agentic engineering defines the core idea clearly: an agent is a system where an LLM drives a loop of actions, where each action can observe the world, call tools, and produce side effects, and where the loop continues until some goal is met or some limit is reached. That definition is straightforward, but its engineering implications are not.

The Loop Is the Program

In traditional software, your program is the logic. You write the control flow, the conditionals, the loops. In an agentic system, you write scaffolding that runs an LLM repeatedly, and the LLM’s outputs determine what happens next. The LLM is both the decision-maker and the source of nondeterminism. This means the thing you are engineering is not the logic itself, but the environment the logic runs in.

This is analogous to writing a runtime rather than an application. The discipline involved is different. You care less about “what should the code do in step 7” and more about “what invariants must hold across every step, and how do I enforce them.”

A minimal agent loop in Python looks something like this:

def run_agent(system_prompt, user_message, tools):
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-opus-4-6",
            system=system_prompt,
            messages=messages,
            tools=tools,
        )

        if response.stop_reason == "end_turn":
            return extract_final_response(response)

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = dispatch_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

This is under twenty lines of meaningful logic, but the engineering work is almost entirely in what surrounds it: the tool implementations, the system prompt, the error handling, the context management, and the observability layer. The loop itself is trivial. Everything else is where the discipline lives.

Tool Design Is API Design

The tools you give an agent are its interface with the world. Poorly designed tools produce poor agent behavior, not because the model is bad, but because ambiguous APIs produce ambiguous behavior when an LLM is making the calling decisions.

Willison’s guide makes the point that tool descriptions are the contracts the agent works from. If your description says “search for documents,” the model will have to guess what “search” means, what counts as a document, and what format results come back in. If it says “perform a full-text search across the knowledge base and return up to 5 results with title, URL, and excerpt,” the model has enough to work with.

This is exactly the same discipline as writing good library APIs. You want tight argument types, clear error semantics, and no surprising side effects. The difference is that your consumer is a language model that will interpret the schema literally and the description probabilistically.

Anthropic’s tool use documentation recommends keeping tool names unambiguous and using the description field to explain not just what the tool does but when to use it. That “when to use it” clause matters, because it is guidance that helps the model choose between tools at decision time. Without it, the model will infer intent from name and parameter structure alone, which is rarely sufficient when two tools have overlapping domains.

The JSON schema for a well-specified tool ends up doing real work:

{
  "name": "search_knowledge_base",
  "description": "Full-text search over the internal docs. Use this when the user asks about internal processes, policies, or past decisions. Do not use for general web queries.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The search query. Use natural language, not keywords."
      },
      "max_results": {
        "type": "integer",
        "description": "Number of results to return (1-10). Default 5.",
        "default": 5
      }
    },
    "required": ["query"]
  }
}

Every field in that schema is a constraint on the model’s behavior. Remove any of them and you introduce ambiguity that manifests as inconsistent agent decisions.

Context as Shared Mutable State

An agent’s context window accumulates state across every step of the loop. Each tool call appends its inputs and outputs to the message history. In a long-running task, this history can grow large enough to exceed the context limit, which forces you to think about what to keep and what to drop.

This is structurally similar to the memory management problem in long-running services. You cannot keep everything. You have to decide what is still relevant and what can be summarized or discarded. The engineering approach varies: some systems summarize older turns, some implement retrieval over a persistent store so the context window stays shallow, some chunk the task into shorter sub-tasks that each run with limited context.

The ReAct paper (Yao et al., 2022), which formalized the reason-then-act agent pattern, treats the scratchpad as part of the agent’s working memory. That framing is useful: context is not a log, it is an active working space, and what you include or exclude affects model behavior at every subsequent step. A context that accumulates noisy tool results degrades decision quality. A context that aggressively prunes relevant prior state produces agents that repeat mistakes or lose track of their goals.

Observability Is Not Optional

In a standard application, you can add logging after the fact if something goes wrong. In an agentic system, you need structured traces from the start, because the failure modes are harder to reproduce and harder to reason about from logs alone.

When a multi-step agent produces a wrong result, the cause is almost never in the final step. It is usually in an early tool call that returned bad data, or a context construction choice that left out a key fact, or a model decision two turns back that sent the whole trajectory off course. Without a trace that shows you every message, every tool call, every intermediate state, you are debugging blindly.

Frameworks like LangSmith, Anthropic’s own tracing capabilities, and open-source tools like Weave from Weights and Biases all address this need. The underlying requirement is the same across all of them: deterministic, inspectable records of what the model decided, what it called, and what it got back. That trace is your primary debugging artifact, equivalent to a core dump in systems programming, except that you want it before things go wrong rather than after.

The Minimal Footprint Principle

One of the more important practical ideas in Willison’s guide is what he calls minimal footprint: an agent should acquire only the permissions it needs, produce only the side effects its task requires, and prefer reversible actions over irreversible ones.

This matters for two reasons. First, it limits the blast radius when the agent makes a mistake. A research agent that can only read files cannot delete your database. An agent that stages changes for human review cannot merge to production unilaterally. Second, it makes the system auditable. If an agent’s tool set is minimal and its actions are logged, you can reconstruct exactly what it did and verify that each action was appropriate.

This is the principle of least privilege applied to LLM-driven systems, a concept with decades of precedent in operating system security and network architecture. What makes it interesting in the agentic context is that it requires intentional system design, because the default when using general-purpose frameworks is to give agents broad access since it makes them more capable in the short term. The engineering discipline is knowing when to narrow that access and how to structure checkpoints that preserve human oversight without making the agent useless.

In practice, this means designing tool sets around specific task domains rather than providing a general-purpose toolkit. An agent handling customer support tickets should have tools scoped to reading tickets, querying the knowledge base, and drafting responses, not tools for modifying account data or accessing billing systems. The task boundary should define the tool boundary.

What Makes This a Discipline

Agentic engineering is not prompt engineering at larger scale, nor is it traditional software engineering with an LLM added. It is a distinct practice because the executor is a probabilistic model that interprets instructions rather than follows them, because the state is implicit in a context window rather than explicit in data structures, and because the failure modes are emergent rather than predictable from any single component.

The patterns that have stabilized, tool design as API design, context as managed state, structured tracing, minimal footprint, checkpointed human oversight, all come from taking that nondeterministic executor seriously and engineering around its properties. What Willison’s guide is documenting is not a new category of software so much as the specific engineering practices that reliably produce agentic systems that work at all. The gap between a demo that impresses and a system that holds up under real workloads is exactly this layer of engineering discipline, applied consistently.