· 6 min read ·

Agentic Engineering Has a Phase Transition, and Most Teams Hit It Unprepared

Source: hackernews

Bassim Eledath published Levels of Agentic Engineering in early March 2026, and the Hacker News response, 267 points and 128 comments, reflects that the framing resonates with practitioners who are actually shipping these systems. The taxonomy maps AI agency onto five levels, borrowing structure from SAE’s six driving automation levels: a stateless prompt-response loop at Level 1, single tool use at Level 2, multi-step planning at Level 3, persistent memory at Level 4, and multi-agent coordination at Level 5. As shared vocabulary for a field that has been operating without it, the framework earns its place. Where it misleads is in implying that engineering difficulty scales evenly as you move up the numbering. There is a discontinuity between Level 2 and Level 3 that a smooth five-step taxonomy obscures, and that discontinuity is where most production deployments reveal their architectural gaps.

The First Two Levels Are Bounded Problems

Level 1 is a stateless LLM call: you send a prompt, you get a completion. Level 2 introduces tool use. The model can invoke functions, call external APIs, read and write to external systems. Anthropic’s tool_use content blocks and OpenAI’s function calling are the standard primitives. The engineering surface at Level 2 is bounded: write reliable tool schemas, handle errors from external services, manage context growth as tool results accumulate. When a Level 2 tool call fails, the failure is visible, isolated, and contained. The blast radius is a single call.

Level 3 is multi-step planning: the model reasons over a sequence of actions, executes them serially, and uses results from earlier steps to inform later decisions. The ReAct pattern from Yao et al. (2022) formalized this and remains the dominant structure for sequential agent execution. LATS extended ReAct with tree search for cases where plans need to backtrack rather than commit to a single chain. Levels 4 and 5 add persistent memory and multi-agent coordination respectively; frameworks like LangGraph and Microsoft’s AutoGen explicitly target the upper end of the stack.

What Changes at Level 3 Is Not Scope, It Is Failure Structure

The Level 2 to Level 3 transition changes how the system fails as much as it changes what the agent can do. At Level 2, a failure is isolated. A tool returns an error, the model sees it, you see it, you handle it. The failure has no downstream consequences because there is no downstream. At Level 3, every intermediate output becomes the input to subsequent reasoning. A subtly wrong result at step 3 does not announce itself with a stack trace. It becomes a plausible premise for step 4, which builds on it, and by step 8 the agent’s internal model of the situation may have drifted far enough from reality that its outputs are wrong in ways that require tracing the full execution to understand.

The error arithmetic is uncomfortable. With a 95% per-step success rate, which is optimistic for any operation involving external state, a 10-step execution succeeds with probability 0.95^10 ≈ 0.60. Twenty steps drops that to 0.36. Fifty steps: 0.08. The model does not become less capable as you add steps. The error budget shrinks because each step introduces a new opportunity for silent failure that feeds forward into subsequent reasoning.

Anthrop’s Building Effective Agents post from late 2024 addresses the mitigation principles: prefer reversible actions, minimize footprint, insert confirmation steps at high-stakes decision points. Those principles are correct, and they are also insufficient without the infrastructure that durable Level 3 execution actually requires.

What Production Level 3 Requires

Three things that are optional at Level 2 become load-bearing at Level 3.

Workflow state machines with explicit failure transitions. A multi-step workflow modeled as a linear sequence that either completes or fails leaves you unable to resume, diagnose, or recover from partial execution. Each step needs to be a named state with defined transitions for success, failure, and timeout. Building Ralph’s autonomous task runner eventually required exactly this: a typed WorkflowStep interface where each step optionally declares its rollback command, and a WorkflowRun that tracks the full set of step statuses across an execution:

export type StepStatus =
  | "pending"
  | "running"
  | "passed"
  | "failed"
  | "skipped"
  | "rolled-back";

export interface WorkflowStep {
  label: string;
  command: string;
  rollback?: string; // command to undo this step if a later step fails
  timeoutMs?: number;
}

export interface WorkflowRun {
  id: string;
  plan: string;
  steps: WorkflowStep[];
  results: StepResult[];
  startedAt: number;
  completedAt: number | null;
  status: "running" | "completed" | "failed" | "rolled-back";
}

On failure, the executor walks backward through completed steps and runs their rollback commands in reverse order. This is not a sophisticated pattern. It is the minimum necessary to ensure that a failed multi-step execution leaves external state consistent rather than partially modified. LangGraph’s checkpointing provides a higher-level version of this for Python agents, persisting graph state between invocations so that a crash mid-execution leaves you with a recoverable state rather than an unknown one. The Temporal workflow engine builds durability and exactly-once semantics into its execution model at the framework level. The specific tool matters less than the presence of the pattern.

Idempotent tool operations. When a workflow resumes after a crash or timeout, it may re-execute steps that already completed. If tool implementations are not idempotent, resumption causes double-writes, duplicate API calls, or inconsistent external state. The fix requires designing tool implementations with idempotency keys, checking prior execution records before performing mutations, and modeling external state so that repeating an operation is safe. This is the kind of infrastructure that is absent from most demos but determines whether a Level 3 system can be trusted across the failure modes that production environments produce.

Structured execution traces. At Level 2, logging the LLM’s inputs and outputs is sufficient for debugging. At Level 3, you need causal traces: which reasoning step produced which output, which tool result influenced which downstream decision, where in a 20-step plan the agent’s reasoning diverged from reality. OpenTelemetry spans work for this if you instrument every tool call and reasoning transition. Purpose-built agent observability tools like LangSmith and Arize Phoenix handle the trace structure automatically for agents built on LangChain or LlamaIndex. Without some form of this, debugging a failed multi-step execution means reading flat logs backward through a complex execution, which is slow and frequently inconclusive.

What the Framework Gets Right

Eledath’s taxonomy is most valuable as a diagnostic tool. The symptom of teams that attempt Level 4 or Level 5 systems without solving Level 3 problems is agents that perform well in controlled demos and behave unpredictably in production. The cause is consistently the same set of gaps: no checkpointing, no idempotent operations, no execution traces, and silent error compounding across multi-step plans. Having a number to point at makes that conversation concrete. “We’re operating at Level 3 and don’t have checkpointing” is a specific, actionable diagnosis. Without the taxonomy, the same situation is usually described as “the agent is unreliable,” which doesn’t point toward a fix.

The HN discussion around Eledath’s post reflected real tensions. Commenters pushed back on whether the levels were granular enough, whether the ordering was fully accurate, and whether some transitions were more significant than others. Those critiques are fair. The SAE driving analogy does strain under scrutiny: driving levels describe a continuous spectrum of environmental complexity, while agentic engineering levels describe discrete changes in the infrastructure needed for reliable operation. But as shared vocabulary for a field still developing its terminology, the framework works.

The Level 2 to Level 3 boundary deserves a sharper signal than the linear numbering suggests. That transition is not an incremental capability addition. It is the point where the engineering model has to change, and where teams discover that their error handling assumptions from Level 2 are not sufficient for the execution structure Level 3 introduces.

Was this interesting?