The Primary Lever in Agentic Engineering Shifts at Every Level

Bassim Eledath’s Levels of Agentic Engineering, published in early March 2026 and gathering 267 points on Hacker News, maps AI agency across five levels: stateless text generation at Level 1, single tool calls at Level 2, multi-step planning at Level 3, persistent memory at Level 4, and multi-agent coordination at Level 5. The taxonomy is useful, but what it underspecifies is the nature of the primary engineering work at each stage and how that work changes character as you move up.

Each level shifts what kind of expertise matters most, and the skills that make you effective at Level 2 apply in narrower ways starting at Level 3.

Levels 1 and 2: Language Is the Lever

At Level 1, the system is a stateless LLM call. The engineering work is almost entirely in the language: system prompt construction, few-shot example selection, output format specifications, guardrail design. Model selection matters, but the quality of the prompt is the primary factor distinguishing a useful system from a frustrating one.

Level 2 extends this with tool use. Anthropic’s tool use API and OpenAI’s function calling introduced function schemas as a structured extension of prompting. Writing good tool descriptions follows the same discipline: the tool’s name and description shape when the model calls it and how it fills the arguments. Anthropic’s documentation on tool definitions explicitly frames this as a prompting concern, because it is.

Teams strong at prompting can typically move to Level 2 by applying the same instincts to tool schemas. Clearer descriptions reduce wrong-tool calls. Adding examples to descriptions helps. The prompt remains the primary lever.

Level 3: Where Prompt Returns Plateau

Multi-step planning introduces a structural change in how systems fail, not just what they can do. At Level 3, each step’s output feeds the next step’s input. The ReAct pattern (Yao et al., 2022) formalized this structure; LATS (Zhou et al., 2023) extended it with tree search to support backtracking when plans need to revise themselves. Both patterns share a fundamental property: errors do not stay localized.

The failure arithmetic is straightforward. With a 95% per-step success rate, which is optimistic for any operation involving external state, a 10-step chain succeeds with probability 0.95^10 ≈ 0.60. A 20-step chain drops to 0.36. Fifty steps: 0.08. Most of these failures are silent rather than immediate. A subtly wrong tool result at step 3 becomes a plausible premise for step 4, which reasons coherently from that wrong premise, and by step 8 the agent’s model of the situation has drifted far enough from reality that its outputs are wrong in ways that require tracing the full execution to understand.

Better prompts can raise a 90% per-step rate to 95%, which matters. But that improvement doesn’t change the structural property that linear execution chains without error containment allow silent failures to propagate forward. The ReWOO paper (Xu et al., 2023) identifies error compounding as the dominant reliability challenge in multi-step reasoning and points toward planning structure and verification steps, not prompt quality, as the primary mitigation path.

What reduces failure at Level 3 is workflow architecture: steps modeled as named states with explicit failure transitions, checkpointing that allows resumption from partial execution, idempotent tool operations that allow safe retries, and rollback semantics for mid-chain failures that leave external state inconsistent. LangGraph’s checkpointing and Temporal’s durable execution model both address this from different angles. The Anthropic Building Effective Agents guide from late 2024 makes the underlying principle explicit: prefer reversible actions and minimize footprint. Both prescriptions are about workflow design, not prompt design.

Context management adds another dimension. As a multi-step run accumulates tool results and intermediate outputs, the context window fills with content of varying relevance. The lost-in-the-middle problem (Liu et al., 2023) describes how model attention degrades when relevant content is buried within a long context. Deciding what to keep verbatim, what to summarize, and what to discard across a running execution is a code-level decision with direct consequences for per-step reasoning quality. No prompt instruction compensates for context that has been poorly managed by the surrounding code.

These are software engineering problems, and prompt tuning operates within the space they define rather than expanding it.

Levels 4 and 5: Architecture Is the Lever

At Level 4, with persistent memory across sessions, the primary engineering work shifts to storage design and retrieval strategy. Vector databases like Qdrant and Chroma handle semantic retrieval; structured stores handle typed state. The key design question is what the retrieval query surfaces and what the model does when retrieved content conflicts with current observations. No amount of prompt refinement compensates for retrieval that surfaces stale or contradictory memories, because the model can only reason over the context it receives.

At Level 5, multi-agent coordination introduces distributed-systems concerns directly. Microsoft’s AutoGen and LangGraph both model multi-agent systems as graphs with explicit edge semantics and failure handling. Correctness properties at this level depend on the topology: which agents see which context, how conflicting outputs from parallel agents are reconciled, and what the trust model is between an orchestrator and its subagents. Two teams with identical agent prompts but different graph topologies will get different reliability outcomes, and the gap between them will not close with better prompting.

Prompt quality remains necessary at every level. At Level 4 and 5, it is no longer the primary source of variance between systems that work reliably and systems that don’t.

What This Means in Practice

The levels taxonomy describes a capability trajectory. Its complement is a question the framework doesn’t address directly: where does a team’s effective leverage sit relative to the level they’re trying to operate at?

A team with strong prompting skills and limited software architecture experience can build reliable Level 2 systems. To build reliable Level 3 systems, that team needs engineering discipline around state machines, idempotent operations, and distributed tracing. Those skills don’t transfer from prompting; they have to be developed. A team that spends three weeks tuning prompts on a Level 3 system that lacks checkpointing is applying its strongest skills to the wrong layer of the problem.

Eledath’s framework is most useful as a diagnostic tool for exactly this kind of mismatch. Where is the system failing, and is that failure a prompting problem or an architecture problem? The answer changes at every level, and the tooling you reach for should change with it. Level 3 agent unreliability is rarely a prompting problem. The teams that treat it as one tend to discover that through production incidents rather than test runs, which is a more expensive way to learn it.