Stages of Agency: What Each Level of Agentic Engineering Demands in Practice
Source: hackernews
Bassim Eledath’s Levels of Agentic Engineering landed on Hacker News earlier this month and collected 267 points and 128 comments, which is a sign the framing resonated with people actually shipping these systems. The core move is borrowed from SAE’s six levels of vehicle autonomy: instead of treating AI agency as a binary, map it onto a progression from fully human-controlled to fully autonomous.
The analogy is useful up to a point. Where it starts to mislead is in implying that the engineering difficulty scales with the level the same way that autonomous driving difficulty scales with environmental complexity. In practice, the hard problems in agentic systems aren’t about how complex the environment is. They’re about error propagation and recovery, and those problems hit differently at each stage.
The Levels and What They Actually Require
At Level 1, you have a stateless LLM responding to prompts. No tools, no memory, no external calls. The engineering surface is minimal: system prompt design, model selection, maybe basic output parsing. This is where most teams start and where most demos live.
Level 2 introduces tool use. The LLM can call functions, query APIs, run code. Anthropic’s tool_use content blocks and OpenAI’s function calling are the primitives here. A basic tool definition looks like this:
response = client.messages.create(
model="claude-opus-4-6",
tools=[{
"name": "search_codebase",
"description": "Search for a pattern across project files",
"input_schema": {
"type": "object",
"properties": {
"pattern": {"type": "string"},
"file_glob": {"type": "string"}
},
"required": ["pattern"]
}
}],
messages=[{"role": "user", "content": "Find all usages of the deprecated auth function"}]
)
The engineering shifts to writing reliable tool schemas, handling errors from external calls, and managing context growth as results accumulate. A single tool call going wrong is visible to the user and easy to recover from. The blast radius is small.
Level 3 is multi-step planning: the model reasons over a sequence of actions before committing, executes them serially, and uses results from previous steps to inform the next. The ReAct pattern (Reasoning + Acting) from Yao et al. formalized this and it remains the dominant approach. LATS extended it with tree search, which is useful when you need to backtrack.
This is where the phase transition happens. At Level 2, the LLM makes one decision and you see the result. At Level 3, errors compound. A 5% per-step failure rate becomes roughly a 40% failure rate over 10 steps, because each wrong output feeds into the next decision. The LLM doesn’t become less capable; your error budget shrinks with every step you add. Recovering from a mid-chain failure requires infrastructure that most teams don’t build until they’ve been surprised by it: checkpoints, rollback logic, idempotent operations, and dead-letter handling for stuck workflows.
Anthrop’s Building Effective Agents post from late 2024 put it concisely: prefer reversible actions, keep the footprint minimal, and build in confirmation steps at high-stakes decision points. That advice is specific to Level 3. It’s unnecessary overhead at Level 2, and it’s insufficient scaffolding at Level 5.
Level 4 brings persistent memory across sessions. The agent needs to store and retrieve state from external systems: vector databases like Qdrant or Chroma for semantic retrieval, key-value stores for structured state, or hybrid approaches. The engineering questions shift again: what should the agent choose to remember, what triggers retrieval, how do you prevent stale memory from corrupting current reasoning, and how do you manage a growing external store without retrieval latency blowing up your response time.
Level 5 is multi-agent coordination. Multiple LLM instances, each with distinct roles, passing work between them. Frameworks like LangGraph and AutoGen target this level explicitly. The complexity multiplies because you now have distributed state, potential deadlocks, and agents that can produce mutually inconsistent outputs without any single node noticing the contradiction. Observability, which is optional at Level 2, is mandatory at Level 5. You need to trace which agent made which decision, with what inputs, and why.
What Building Ralph Taught Me
I’ve gone through most of this progression building Ralph, a Discord bot that started as a simple command handler and has grown into something with autonomous workflows, event-driven processing, and scheduled tasks.
The Level 2 to 3 transition was the expensive one. Once I started letting Ralph execute multi-step tasks autonomously, failures that used to be isolated single-call errors became workflow failures that left state inconsistent. A task that started but didn’t finish, a file that got partially written, a message that went out before the follow-up was ready. I had to build explicit state machines for workflows, with defined failure transitions, before I could trust the multi-step behavior in production.
The Level 3 to 4 transition was architecturally simpler but required new infrastructure. Deciding what to persist, building retrieval that was fast enough not to bottleneck responses, and cleaning up stale state were all solvable problems, but they had to be solved before the memory feature was actually useful rather than just a source of confusing outputs.
I haven’t pushed hard into Level 5 territory yet. The multi-agent patterns I’ve read about, and the LangGraph documentation in particular, make it clear that it requires treating agent coordination like distributed systems work: with explicit contracts between agents, retry logic, circuit breakers for pathological loops, and a clear notion of what happens when one agent goes silent.
What the Framework Gets Right
The levels taxonomy is most useful as a diagnostic tool. Teams regularly try to build Level 4 or 5 systems without having solved Level 3 problems. The symptom is agents that are impressively capable in demos and unreliable in production. The cause is usually that no one built the error recovery infrastructure that multi-step execution requires.
Having a number to point at helps make that conversation concrete. “We’re operating at Level 3 but we don’t have checkpointing” is a more actionable diagnosis than “our agent is flaky sometimes.” The taxonomy also sets expectations about cost and complexity that are genuinely useful for project planning: Level 3 and above is infrastructure work comparable to building a distributed system, not just a prompting exercise.
The SAE parallel does eventually strain under scrutiny. Driving levels describe a continuous environmental complexity spectrum; agentic engineering levels describe discrete changes in the infrastructure you need to support reliable operation. But as a shared vocabulary for a field that’s still developing its vocabulary, it earns its place.
The HN discussion around Eledath’s post reflected this tension. Commenters pushed back on whether the levels were granular enough, whether the ordering was right, and whether some transitions were more discontinuous than others. Those are all legitimate critiques. The framework doesn’t need to be perfect to be useful, and right now practitioners need a way to talk about this that doesn’t start from scratch every time.