· 6 min read ·

From Prompt to Pipeline: What Agentic Engineering Actually Demands

Source: simonwillison

Prompt engineering and agentic engineering are not the same thing. Prompt engineering is about crafting the right input to get the right output in a single exchange. Agentic engineering is about building systems where a model takes a sequence of actions, makes decisions across multiple steps, uses tools, manages state, and potentially delegates to other models, all in pursuit of a goal that cannot be satisfied in one shot.

Simon Willison’s guide to agentic engineering patterns frames this distinction well: the moment you introduce a loop, you have crossed from prompt engineering into something that requires a different set of skills. The loop is where the engineering challenges live.

The Agent Loop Is a Control Flow Problem

At the core of every agentic system is some variant of the observe-think-act cycle. The model receives context, reasons about what to do next, emits an action (usually a tool call), the result is fed back into context, and the cycle repeats until a stopping condition is met. This is sometimes called the ReAct pattern, from a 2022 paper that formalized interleaving reasoning and acting in LLM systems.

From a systems perspective, this is just a state machine running inside a prompt. The difference from traditional control flow is that the transition logic is opaque: you cannot read the model weights the way you can read an if-statement. The model decides when to stop, which tool to call, and how to interpret results. This is both the power and the source of most agentic reliability problems.

Here is a minimal Python loop using the Anthropic SDK:

import anthropic

client = anthropic.Anthropic()
tools = [search_tool, code_exec_tool, file_write_tool]
messages = [{"role": "user", "content": user_request}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=8096,
        tools=tools,
        messages=messages
    )
    messages.append({"role": "assistant", "content": response.content})

    if response.stop_reason == "end_turn":
        break

    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = dispatch_tool(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(result)
            })

    messages.append({"role": "user", "content": tool_results})

This is about forty lines of code, but it encodes a non-trivial amount of engineering decision: what tools to expose, how to serialize results, when to break, what to do when the model calls a tool that fails. Every one of those decisions affects the reliability of the system in ways that are hard to test exhaustively.

Tool Design Is the Interface Design of Agentic Systems

In traditional software, API design determines how easy it is to build on top of a system. In agentic engineering, tool design plays the same role, except the consumer is a language model rather than a developer. The model reads your tool’s name, description, and schema at inference time and decides whether and how to call it.

This means that the quality of a tool description directly affects the quality of agent behavior. A tool named execute with no description will be used inconsistently. A tool named run_python_in_sandbox with a description that specifies what the sandbox restricts will be used more precisely. The OpenAI function calling documentation and Anthropic’s tool use guide both emphasize this, though in practice it is easy to underinvest in descriptions when the tool works well enough in your own testing.

Tool granularity matters too. Fine-grained tools give the model more precise control but require more decisions per task. Coarse-grained tools reduce the decision surface but make it harder for the model to handle edge cases. There is no universal right answer, but the pattern I have found useful is to start with coarse tools, observe where the model gets stuck or makes poor decisions, and decompose those tools into finer-grained variants.

Context Windows Are the Memory Architecture

Agentic systems accumulate context as they run. Every tool result, every intermediate reasoning step, every prior exchange gets appended to the messages array. On a long-running task with many tool calls, this can exhaust even a large context window, and the cost of each inference step grows linearly with the accumulated context.

This creates an engineering problem that has no clean solution yet. The options are roughly:

  • Summarization: periodically compress prior context into a summary and continue with the summary plus recent messages. This loses fidelity but keeps token counts manageable.
  • Retrieval: store prior context externally and retrieve relevant chunks as needed, similar to RAG. This requires deciding what to retrieve and when, which adds its own failure modes.
  • Structured state: instead of free-form accumulation, maintain explicit state as a structured object that the model reads and updates. This is more predictable but requires designing the state schema upfront.

The Letta project (formerly MemGPT) explored the structured state approach in depth, treating the context window as virtual memory with explicit paging. It is an interesting architecture for systems that need to maintain coherent state across very long interactions.

Multi-Agent Architectures Multiply Failure Modes

Once a single agent works reliably for narrow tasks, the temptation is to compose multiple agents into a pipeline. An orchestrator agent receives a high-level goal, decomposes it into subtasks, delegates those subtasks to specialized worker agents, and synthesizes the results. This mirrors how organizations structure complex work.

The challenge is that errors compound. If each agent step has a 90% success rate, a five-step pipeline has roughly a 59% end-to-end success rate. The Claude documentation on multi-agent architectures notes that multi-agent systems are most useful when tasks are too long for a single context window, when parallel execution provides real speedup, or when specialized subagents can outperform a generalist on specific subtasks.

For most problems I have encountered building agentic systems for Discord bots and personal tooling, a single well-designed agent with good tools outperforms a multi-agent pipeline. Multi-agent architectures add value at the edges: very long tasks, tasks that benefit from parallel execution, and tasks where a second agent reviewing the first agent’s output catches meaningful errors.

Observability Is the Hardest Part

With traditional software, debugging means reading logs, adding trace points, and reproducing the failure. With agentic systems, the failure is often not in any single step but in the sequence of decisions the model made across many steps. The model chose to call the wrong tool, or interpreted a tool result incorrectly, or made a reasonable-seeming decision that cascaded into a bad outcome several steps later.

This requires a different kind of observability. Logging tool calls and results is necessary but not sufficient. What you really want is to understand the model’s reasoning at each step, which means capturing not just actions but the scratchpad content that preceded them. Projects like LangSmith and Weights and Biases Weave have built tooling specifically for this, treating each agent run as a trace with annotatable spans rather than a flat log stream.

Cost is intertwined with observability. Because each loop iteration involves at least one inference call, and because context grows with each step, agent runs can be expensive in ways that are hard to predict from testing. Tracking token counts per run, per tool type, and per task category gives you the data to make informed decisions about where to summarize, where to cache, and which tasks are cost-effective to automate at all.

What Makes This Engineering Rather Than Prompting

Simon Willison’s framing is that agentic engineering is a genuine engineering discipline, not just prompt optimization. I think this is right, and the reason is that the artifacts you produce are systems with emergent behavior, not prompts with predictable outputs.

Building a reliable agentic system requires the same skills as building any reliable distributed system: thinking about failure modes, designing for observability, managing state carefully, testing at the boundaries, and accepting that you cannot exhaustively verify behavior. The model is a component in the system, not the system itself. How you wire it up, what tools you give it, how you manage context, and how you handle failures determines whether the system works in production.

The field is young and the patterns are still settling. But the core insight, that adding a loop to an LLM call is a qualitative change that requires a different engineering mindset, is one that holds up the more time you spend building these systems.

Was this interesting?