What the Tool-Use Loop Reveals About Agentic Engineering

The phrase “agentic engineering” is doing a lot of work right now. Simon Willison’s recent guide on agentic engineering patterns gives it a working definition: the discipline of building systems where language models take sequences of actions, observe results, and iterate toward a goal. That is a useful frame, but it undersells how much the operational concerns differ from conventional software.

When I build a Discord bot, most of the interesting engineering is not in the bot commands themselves but in the infrastructure around them: rate limit handling, state persistence, graceful reconnection, error propagation. The command logic is simple; the scaffolding is where you earn your pay. Agentic systems follow the same pattern, with more complex scaffolding and less predictable failure modes.

The Loop Is the Program

An agentic system at its core is a loop. The model receives a context, produces either a final answer or a request to call a tool, the tool runs, its output gets appended to the context, and the model runs again. Anthropic’s API makes this structure explicit with tool_use and tool_result content blocks:

import anthropic

client = anthropic.Anthropic()
tools = [
    {
        "name": "read_file",
        "description": "Read the contents of a file at a given path",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "The file path to read"}
            },
            "required": ["path"]
        }
    }
]

messages = [{"role": "user", "content": "What's in config.json?"}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )

    if response.stop_reason == "end_turn":
        print(response.content[0].text)
        break

    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = dispatch_tool(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result
            })

    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})

This is the skeleton of every agentic system. The model is not a function you call once; it is a participant in an iterative process. What varies between implementations is everything surrounding this loop: how tools are designed, how context grows, what happens when something goes wrong, and when the loop terminates.

OpenAI’s function calling API follows the same basic structure. So does the emerging Model Context Protocol, which standardizes how tools are exposed to models across different hosting environments, decoupling tool servers from the specific model or orchestration layer running the loop.

Tools Are APIs You Write for a Language Model

Most of the interesting design work in an agentic system lives in the tool layer. A tool is, at minimum, a function plus a JSON Schema description that the model uses to decide when and how to call it. Getting this right is harder than it looks.

The model does not read your documentation the way a human engineer would. It forms an understanding of a tool from its name, its description field, and the shape of its input schema. A tool named process_data with no description and a single parameter named input will be called inconsistently. A tool named search_customer_records with a clear description of what it returns and explicit parameter descriptions will be called reliably.

Tool granularity matters too. A tool that does too much forces the model into rigid patterns; a tool that does too little creates unnecessary round-trips through the loop. The ReAct pattern, introduced by Yao et al. in 2022, frames agentic tool use as interleaved reasoning and acting, and provides a useful heuristic for granularity: each tool call should correspond to a discrete, observable action whose result genuinely informs what to do next.

Error returns from tools deserve special attention. If a tool call fails and returns a raw exception traceback, the model will often attempt to recover by calling the tool again with slightly different arguments. Sometimes this is useful, but more often it produces loops that burn tokens without making progress. Structured error responses with an error field and a message that explains what went wrong in terms the model can act on produce far more predictable behavior.

Context Budget Is a First-Class Constraint

In conventional software, function calls are free in the sense that calling a function does not consume a finite, metered resource. In agentic systems, every round trip through the loop appends to a growing context. Tool results, intermediate reasoning, and prior messages all accumulate. For long-running agents, context budget becomes a genuine design constraint.

Current frontier models support large context windows: Claude Sonnet 4.6 supports 200,000 tokens, GPT-4o supports 128,000. That sounds like a lot until you have a coding agent that loads several source files, generates diffs, runs tests, and iterates on errors. A moderately complex automated task can exhaust 50,000 tokens before producing any useful output.

Practical mitigations include summarizing tool results before appending them to context, using explicit memory tools that let the model store and retrieve information across a compressed representation, and decomposing large tasks into sub-tasks with fresh context windows. Some systems use a hierarchical architecture, where an orchestrator agent delegates to specialized sub-agents each running in its own context, and communicates results upward through structured summaries. This pattern trades latency for token efficiency and makes the overall system easier to reason about.

Reliability Looks Different Here

Conventional software fails deterministically. Given the same inputs, a broken function produces the same error every time. Agentic systems fail probabilistically. The same task, with the same tools and the same starting context, can succeed on one run and fail on another. This property makes traditional debugging less effective and makes testing significantly harder.

Some failure modes are familiar from distributed systems. Partial completion, where an agent finishes five of seven steps before hitting a context limit or a tool error, requires idempotent tool design: a tool called twice for the same operation should produce the same result as calling it once, or at least fail safely. Network timeouts, rate limits, and transient API errors all need handling in the tool layer, with enough structure in the error response for the model to decide whether to retry.

Other failure modes are specific to agentic systems. Prompt injection, documented systematically by Riley Goodside in 2022 and expanded on considerably since, becomes a serious concern when agents process untrusted content as part of their tool results. A web-browsing agent that reads a page containing embedded instructions to ignore its original task is vulnerable in a way that a conventional web scraper is not. Mitigating this requires both careful prompt construction and healthy skepticism about actions that diverge from the original task scope.

There is also the problem of cascading tool calls. An agent that searches for files, reads their contents, searches for more files mentioned in those contents, reads those files, and so on can produce an exponential fan-out of work that is expensive and often unproductive. Setting explicit depth limits and requiring confirmation for operations above a certain scope are useful guardrails.

The Frameworks and What They Embed

The ecosystem of frameworks for building agentic systems has grown rapidly: LangChain, LlamaIndex, LangGraph, AutoGen, CrewAI. Each imposes a different level of abstraction over the basic loop and embeds opinions about the right way to structure agentic systems.

LangGraph’s graph-based orchestration is notable for making control flow explicit and inspectable. You define nodes and edges, the graph executor manages state transitions, and you can see at any point exactly where in the execution graph the system sits. This helps enormously with the debugging problem. CrewAI’s role-based multi-agent model suits cases where you want specialized agents collaborating on distinct parts of a task, with the model’s context shaped by its declared “role” and “goal”.

The tradeoff with higher-level frameworks is that they can obscure the loop structure in ways that make debugging harder when things go wrong. Understanding what the raw API is doing, before reaching for a framework abstraction, is worth the investment.

The Discipline Is Still Taking Shape

What Willison is capturing with the term “agentic engineering” is that the combination of these concerns, tool design, context management, reliability, security, is becoming a recognizable discipline with its own patterns and best practices. Anthropic’s guide to building effective agents and the Model Context Protocol specification represent attempts to codify emerging consensus, but the field is moving fast enough that any codification is provisional.

The comparison that keeps surfacing is early web development. In 2001, there was no consensus on how to structure a web application. The patterns that held up, MVC, RESTful APIs, middleware pipelines, emerged from practitioners building things and observing what worked in production. Agentic engineering is at a similar inflection point. The primitive is well understood. The higher-level patterns for building reliable, maintainable, secure agentic systems are still being discovered through practice. Documenting what works under real conditions, rather than what looks clean in a demo, is how this discipline matures.