The idea of giving a language model access to tools and letting it loop until done is not new. The ReAct paper from Yao et al. formalized it in October 2022, showing that interleaving reasoning traces with tool calls produced better results than either chain-of-thought or action-only approaches. The architecture was simple enough to explain on a whiteboard: think, act, observe, repeat. What happened next taught the field more about engineering than any paper could.
In April 2023, AutoGPT became the fastest-growing GitHub repository in history. BabyAGI followed shortly after. Both gave users a single field: enter your goal. The LLM would then plan sub-tasks, spawn agents to complete them, observe results, revise the plan, and continue until the goal was reached or the API bill became unsustainable. The demos were compelling. The production systems were a different story.
What the Hype Phase Revealed
AutoGPT’s core failure modes were instructive precisely because they were not random. They followed predictably from the architecture.
Unbounded loops were the most visible problem. Without a clear termination condition the model could reason about, agents would work themselves into planning spirals, repeatedly decomposing tasks that could have been done directly. The tool loop accumulated context with every iteration, inflating cost while degrading coherence as older context fell out of effective attention range.
Prompt injection from tool outputs was the security failure nobody had properly anticipated. When an agent reads a webpage to answer a question, the webpage content becomes part of the context. A page crafted to include instructions would redirect the agent mid-task. The attack required no access to the model or the system prompt, only the ability to put content somewhere the agent would read.
The trust hierarchy was undefined. If an agent spawned a sub-agent to complete a task, should the orchestrator trust the sub-agent’s outputs? AutoGPT had no principled answer. Sub-agents that encountered adversarial content could corrupt the parent agent’s state by returning plausible-looking but manipulated results.
These were not edge cases. They were inherent to the architecture of autonomous agents operating with broad tool access and no principled limits on scope. The engineering problems were visible in the failure modes, and the field took a few years to systematically address them.
How the Discipline Responded
The shift from 2023’s autonomous agent experiments to what Simon Willison describes in his agentic engineering guide happened incrementally, not through any single breakthrough.
The first correction was scope reduction. Rather than “give the LLM a high-level goal and let it figure out everything,” teams found more success with narrowly scoped agents that handled one class of task with a small, well-defined tool set. A code review agent that can read files and comment on pull requests performs better and fails more gracefully than a general-purpose agent with access to the same tools plus email, web browsing, and arbitrary shell execution.
The minimal footprint principle emerged from this. An agent should prefer reversible actions over irreversible ones, ask for confirmation before operations with large blast radius, and request only the permissions it actually needs. This is least-privilege applied to autonomous systems. It does not prevent failures, but it makes failures recoverable and auditable.
Context management went from an afterthought to a first-class design concern. The context window is the agent’s only working memory during a run. Long-running tasks fill it. Modern approaches include structured summarization of tool outputs before they enter the context, explicit memory tools that let agents persist facts to an external store and retrieve them when needed, and session checkpointing so a failed run can resume from a known state rather than starting over. The frameworks that handle this well, including LangGraph and CrewAI, treat context architecture as a configuration surface rather than an implementation detail.
Human-in-the-loop checkpoints became standard for high-stakes operations. An agent that autonomously deploys to production, sends emails on your behalf, or commits to a shared repository needs explicit confirmation points. This is not a concession to the model’s limitations; it is the correct engineering choice for any system making irreversible external changes on behalf of a user.
Tool Design Matured Into an API Discipline
The shift from informal tool descriptions to careful API design was one of the quieter improvements in the field. Early agent tools had descriptions like “search the web.” Production tools now read more like precise API documentation: they specify what the tool does, what it does not do, what preconditions must hold, and what the caller should expect in return.
This matters because tool descriptions are load-bearing. When a model chooses which tool to call, it is doing a form of retrieval over the descriptions. Ambiguous descriptions produce creative but wrong invocations. Overlapping tools produce inconsistent choices across runs. A tool named search_codebase with a description that distinguishes it from search_files and specifies it operates over indexed symbol names will be called correctly far more reliably than a tool named search with a vague description.
Both Anthropic’s tool use format and OpenAI’s function calling spec expose parameter-level description fields specifically because they affect model behavior. Using enum types to constrain choices, providing examples in parameter descriptions, and distinguishing required from optional parameters all produce measurable improvements in tool call accuracy without changing the model.
The Named Discipline
Willison’s framing of agentic engineering as a distinct discipline, separate from prompt engineering and separate from traditional software engineering, is useful because the problems genuinely differ. Prompt engineering optimizes single-inference outputs. Traditional software engineering works with deterministic systems. Agentic engineering sits in between: you are building systems with deterministic tool execution around a non-deterministic reasoning core.
The engineering skills transfer, but the mindset does not always. Developers who expect to trace failures through a deterministic call stack find agent debugging unfamiliar. The failure happened somewhere in a probabilistic chain of tool calls and model inferences, and the same sequence might succeed on the next run. Evaluations over many runs, trace inspection, and behavioral contracts replace unit tests as the primary correctness tools.
What the AutoGPT era proved is that the problems are tractable. The failure modes from 2023 are understood well enough now to design around them. Context windows are managed. Tool access is scoped. Injection defenses exist. Multi-agent trust hierarchies have documented patterns. The discipline is still forming around testing and observability, but the foundation is solid enough that teams are shipping production agents that work reliably.
The lesson from the hype phase is that autonomous agency and engineering discipline are not in tension. The agents that work are the ones built with the same rigor applied to any system that manages state, calls external services, and fails in ways that matter.