· 7 min read ·

What Agentic Engineering Inherited From Five Decades of AI Research

Source: simonwillison

Simon Willison’s guide to agentic engineering patterns defines the field precisely: agentic engineering is the discipline of building systems where a language model drives a loop, taking actions through tools in pursuit of a goal that cannot be resolved in a single prompt exchange. The framing is useful and correct. What it does not address, reasonably for a practical guide, is how these ideas came to be or where they were worked out before large language models existed.

Agentic engineering did not emerge fully formed with GPT-4 function calling. The problems it addresses, goal decomposition, handling partial failure, managing knowledge for reasoning, and deciding when to act versus when to ask, have been central to AI research since at least the 1970s. The LLM revolution solved some of them in ways no prior approach managed. It left others exactly as hard as they were before. Knowing which is which changes how you design these systems.

STRIPS and the Planning Problem

The conceptual ancestor of every agentic system is STRIPS (Stanford Research Institute Problem Solver), developed by Richard Fikes and Nils Nilsson in 1971. Their formulation: an agent exists in some initial state, has a goal state to reach, and possesses a set of operators, each with preconditions that must hold before the operator can apply and effects that define the resulting state change. Planning meant finding an operator sequence that transforms initial state into goal state.

The vocabulary maps directly onto modern agentic design. Initial state is the system prompt and user request. Operators are tools. Goal state is task completion. The difference is that STRIPS required each operator to be explicitly formalized with exact preconditions and effects. That was also its limitation: representing novel situations required encoding them explicitly in advance. When situations arose outside the encoded domain, the planner failed.

Language models sidestep the formalization requirement by using approximate semantic reasoning instead of formal logic. The trade-off is that the model’s understanding of preconditions and effects is probabilistic, not verified. This is why tool descriptions matter as much as they do in modern agentic systems. The description is the closest thing to an explicit precondition specification the model has to work with. A vague tool description is structurally equivalent to a STRIPS operator with an underspecified precondition: the planner will misapply it.

The other inheritance from the planning tradition is the frame problem: how does an agent track what has changed and what has not after each action? In STRIPS-derived planners, this required explicit bookkeeping. In LLM-based agents, the context window plays this role, and imperfectly. Research on LLM recall, including the Lost in the Middle study from Stanford and UC Berkeley, showed that information placed in the middle of long contexts is retrieved less reliably than information at the edges. The frame problem did not disappear with LLMs; it became probabilistic.

Expert Systems: The First Deployed Reasoning Agents

Expert systems represent the first wide-scale deployment of goal-directed reasoning in production. MYCIN, developed at Stanford from 1972 to 1976, diagnosed bacterial infections using a backward-chaining inference engine over roughly 600 medical rules. R1/XCON, developed at Carnegie Mellon and deployed at Digital Equipment Corporation in the early 1980s, configured minicomputer orders using over 2,000 production rules and was reportedly saving DEC around $40 million annually by the mid-1980s.

These systems were recognizably agentic by modern definitions: they reasoned about a domain, asked clarifying questions when they lacked information, and produced recommendations with an auditable chain of reasoning. The distinguishing limitation was the knowledge acquisition bottleneck. Encoding domain knowledge as explicit rules was expensive, brittle, and required sustained access to domain experts. Updating a deployed expert system when the rules of its domain changed was a significant engineering effort.

Language models solve the knowledge acquisition problem by learning from text at scale. But they introduce the inverse problem: the knowledge is in the weights, not in an auditable rule base. You cannot read an LLM’s inference path the way you can step through an expert system’s rule applications. This is precisely why the observability infrastructure in modern agentic engineering, logging full tool call sequences and intermediate reasoning, is a first-class concern. Expert system practitioners could inspect their knowledge bases directly. Agentic engineers cannot, so the trace log becomes the surrogate.

Behavior Trees and Explicit Failure Handling

Game AI development contributed a pattern that directly addresses a gap in the planning tradition: structured fallback logic. Behavior trees, widely adopted in game development through the mid-2000s, represent agent decision-making as a hierarchy of tasks with explicit priority and fallback ordering. A selector node runs its children in sequence and returns success as soon as one child succeeds. A sequence node runs its children and returns failure as soon as one child fails. This gives you a clean representation for “try this approach first; if it fails, fall back to this alternative.”

The practical value is that failure handling is structural, not improvised. A game AI character navigating around an obstacle has a tree that specifies exactly what happens if the primary route is blocked: try the secondary route, then wait, then communicate an inability to proceed. None of this relies on the character figuring out recovery through general reasoning.

Most agentic systems do not have this. When a tool call fails, the model falls back on whatever recovery heuristics it learned in training. This produces behavior that is sometimes reasonable and not reliably so. Encoding explicit fallback sequences in the system prompt, in the spirit of behavior tree design, is one of the higher-leverage things you can do to improve reliability on tasks with known failure modes. If your agent frequently encounters a specific error from a specific tool, writing out the recovery procedure explicitly in the system prompt produces more consistent behavior than relying on the model’s general problem-solving. The behavior tree tradition suggests that ad-hoc fallback reasoning is a design smell, not a feature.

Reinforcement Learning and the Limits of Generalization

The other major predecessor is the reinforcement learning tradition. Deep RL systems demonstrated that end-to-end learned policies could reach impressive performance on specific, well-defined tasks: AlphaGo for board games, OpenAI Five for Dota 2. The limitation for general agentic use is that RL policies specialize to their training environment. Generalizing to novel task domains requires either retraining or significant transfer learning work.

Language models bring broad prior knowledge that makes them usable as agentic controllers for novel tasks without task-specific training. The trade-off is that RL policies trained on a well-defined task have provable properties relative to their reward function, while LLM-based agents have empirically-observed properties over a sample of tasks. You can analyze an RL policy’s behavior analytically for the training domain. You can only accumulate evidence about an LLM agent’s behavior through testing.

This is why evaluation is such a central concern in agentic engineering, and why golden traces and LLM-as-judge approaches matter. The ReAct paper from Yao et al. in 2022, which formalized interleaved reasoning and acting in LLM systems, had to evaluate model performance by scoring the quality of reasoning traces, not just final answers. That evaluative complexity is structural to the approach, and it was visible from the first paper that defined it.

What Language Models Genuinely Changed

The ReAct formulation synthesized planning and acting into a clean abstraction: interleave reasoning traces with actions, and feed observations back as context. The interleaving of reasoning and acting was not new; classical AI architectures had done versions of this. What was new was demonstrating that a language model could serve as both the planner and the adaptive reasoner, using the same token prediction mechanism it uses for everything else. It established that the context window could substitute for the explicit state representations that prior planning systems required.

What changed is the mode of knowledge encoding: from explicit rules and operators to statistical learning from text. This makes LLM-based agents usable in open-ended domains without the knowledge acquisition work that limited expert systems. It makes their behavior approximate and probabilistic rather than formally verified. And it makes the system prompt the primary interface for shaping behavior, in a way that has no direct analog in classical AI architectures.

The problems themselves have not changed. Goal decomposition is the same problem STRIPS addressed. Failure handling is the same problem behavior trees addressed. Auditable reasoning is the same problem expert system designers addressed. Knowledge management is the same problem every planning system faced. The mechanism for approaching those problems is different, and the mechanism introduces specific constraints, context window limitations, probabilistic recall, non-auditable inference, that require different engineering responses than the explicit-representation architectures that came before.

For building anything with agentic capabilities, including Discord bots that loop over tool calls to answer multi-step questions, this history is practically useful. The behavior tree tradition suggests encoding explicit fallback logic for the failure modes you know you will encounter. The expert systems tradition suggests treating the full reasoning trace as a first-class artifact, not an afterthought. The planning tradition suggests thinking carefully about what the model needs to track as state across a long-running task, because the context window is doing the job that explicit state management did in prior systems, and it does that job with less reliability as the context grows.

The engineering discipline that Willison defines exists because LLMs solved the hardest problem in prior agentic architectures (general-purpose knowledge for open-ended domains) while leaving the secondary problems (failure handling, observability, state consistency, evaluation) as open questions. Those secondary problems are not new. They are the same problems that every generation of autonomous AI systems has worked through. The patterns that emerged from that work are still applicable; they just need to be expressed in system prompts and scaffolding code rather than formal operators and rule bases.

Was this interesting?