· 7 min read ·

Confused Deputies and Ambient Authority: The Frame AI Agent Security Has Been Missing

Source: openai

The Confused Deputy

In 1988, Norm Hardy wrote a short paper describing what he called the confused deputy problem. A “deputy” is a program that holds authority delegated from a principal. A “confused deputy” is one that can be tricked into exercising that authority on behalf of a different, unauthorized principal, because it cannot distinguish who is actually making the request.

Hardy’s original example was a compiler. The compiler held file-system permission to write billing records, because that was part of its legitimate job. A user who could not directly write billing records could trick the compiler into writing whatever they wanted, simply by pointing the compiler’s output parameter at the billing file path. The compiler had the authority. The attacker did not. The compiler could not tell the difference between a legitimate compilation request and an exploitation of its privilege.

This is structurally what happens in prompt injection attacks against LLM agents. The agent holds delegated authority: it can read email, browse the web, write files, call APIs, send messages on the user’s behalf. An attacker plants instructions in content the agent will retrieve during a legitimate task. The agent reads those instructions and, because it cannot reliably distinguish instructions from its operator from text embedded in a retrieved document, exercises its delegated authority on behalf of the attacker.

The decades of security research that followed Hardy’s paper found a principled answer to this class of problem. Understanding that answer is more useful than any specific list of prompt injection mitigations.

Ambient Authority vs. Capability-Based Security

The root cause of confused deputy vulnerabilities is ambient authority. A deputy operating with ambient authority carries all its permissions into every operation, regardless of the specific task it is currently performing. Hardy’s compiler always had permission to write billing files, whether it was compiling a trivial program or performing work that actually produced billing output. The authority was ambient — always present, automatically applied.

The principled alternative is capability-based security. Under a capability model, authority is attached to specific objects, not to the deputy globally. To perform an operation, you must hold a capability for it, a bearer credential granting access to that specific resource for that specific purpose. The compiler would only hold a capability to write billing records when the specific compilation task had been authorized to produce billing output, and that capability would come from the principal who initiated that specific operation.

LLM agents almost universally operate with ambient authority. When an agent is deployed with access to email, web browsing, file-system access, and external API calls, all of those capabilities are present for every task the agent runs. An injected instruction telling the agent to exfiltrate data via outbound HTTP succeeds not because the model fails to recognize the attack, but because the HTTP tool is already present in the agent’s tool registry. The attack does not need to grant authority. It only needs to redirect authority the agent already holds.

OpenAI’s guidance on designing agents to resist prompt injection correctly identifies capability restriction as a primary defense: an agent that cannot send email cannot be injected into sending phishing messages, regardless of what instructions appear in retrieved content. The structural guarantee does not depend on the model correctly classifying the injection as malicious. It depends on the capability being absent.

CSRF and the Browser’s Partial Resolution

The web encountered the same problem through Cross-Site Request Forgery. A browser that automatically attaches session cookies to every request for a given origin operates with ambient authority. An attacker controlling any third-party page can cause the browser to submit authenticated requests to the user’s bank, because the browser carries the credential into every request automatically, regardless of which page actually initiated the request.

The fixes for CSRF — SameSite cookie attributes, CSRF tokens, Origin header validation — all move toward a capability model. A CSRF token is a bearer credential proving that a specific request originated from the specific page the user actually loaded. Requests without the token lack the capability and are rejected. The browser’s ambient authority is constrained by requiring proof of specific authorization for state-changing operations.

For agents, this parallel points toward a design approach that most agent frameworks have not yet adopted: per-task capability scoping rather than session-wide tool access. In a strict capability model, each task the agent performs would carry only the specific capabilities required for that task, derived from the principal who authorized it:

{
  "task": "summarize-email-thread",
  "granted_capabilities": {
    "read_email": {"scope": "thread:abc123", "expires": "2026-03-11T22:00:00Z"},
    "write_output": {"scope": "response-only"}
  }
}

An agent executing this task cannot make outbound HTTP requests, write to the filesystem, or send email — not because a policy says “do not do that,” but because those capabilities are not present in the current task context. A successful injection instructing the model to exfiltrate data via HTTP fails at the tool-call layer, not the model layer. No amount of clever phrasing makes the absent tool callable.

Most production agent frameworks, including the tool-calling APIs exposed by current LLMs, do not support per-task capability scoping natively. The tool registry is configured at session initialization and remains constant throughout. Building genuine per-task capability scoping requires wrapping the tool layer externally to enforce scoping before tool calls reach the model. It is more engineering overhead than deployment-time minimization, but it is a different category of guarantee.

Policy vs. Capability: Two Different Threat Classes

OpenAI’s instruction hierarchy work trains models to treat content from different sources with different levels of trust. Instructions in the system prompt outrank user messages; user messages outrank content retrieved from external sources. A model trained on the synthetic data from their IH-Challenge approach is more likely to recognize and resist injected instructions embedded in retrieved content, because the training signal teaches it that low-privilege content attempting to override high-privilege instructions is suspicious.

This is genuinely valuable, and it addresses attacks that pure capability restriction cannot reach. Some injection attacks do not try to trigger immediate tool calls. They try to manipulate the model’s visible output to socially engineer the human supervisor, or they gradually shift the model’s behavior across multiple turns, or they attempt to write into long-term memory stores to persist the attack beyond the current session. Capability restrictions alone do not address these because no restricted tool call is involved.

But instruction hierarchy is a policy-layer defense. It requires the model to correctly classify every injection attempt across adversarial pressure and novel phrasings. The attacker only needs to find one path past that classifier. Perez and Ribeiro’s early work (2022) established that complete prevention through training alone is unlikely; the same flexibility that makes models useful for following nuanced instructions makes them susceptible to instructions embedded in content.

Capability restriction does not require the model to classify anything. An attempt to invoke an absent tool fails deterministically, regardless of how cleverly the injection is phrased. The attack surface is the tool registry itself, not the model’s inference-time judgment.

The architecture these two properties suggest is clear: use capability restrictions as the primary defense against tool-based exploitation, covering the attacks where deterministic failure is achievable, and use instruction hierarchy training and input marking techniques like Microsoft’s Spotlighting to handle the residual cases where the attack surface cannot be closed structurally. Not just “apply both mitigations,” but understanding which threat class each defense actually covers and sizing your investment accordingly.

What the Confused Deputy Frame Adds

Most writing on prompt injection frames the problem as data-versus-instruction confusion: the model cannot reliably distinguish content it should process from instructions it should follow. This framing is accurate, and it points toward solutions focused on the model’s ability to make that distinction at inference time — instruction hierarchy training, input delimiters, the dual-LLM quarantine pattern that Simon Willison described in 2023.

The confused deputy frame points at a different set of interventions. The confused deputy problem was not resolved by training deputies to better classify incoming requests as legitimate or illegitimate. It was resolved by structural changes to how authority is granted: moving from ambient authority, where the deputy always holds its full permission set, to capability-based authority, where the deputy holds only the capabilities specifically granted for the current operation by the appropriate principal.

Applied to agents, the critical question becomes not only “how do we train the model to resist injected instructions” but “how do we structure the agent’s authority so that a successful injection causes limited damage even when the model is fooled.” The answer involves the same principles that resolved Hardy’s original problem: minimize ambient authority, scope capabilities to specific tasks and specific principals, treat each operation as requiring fresh authorization rather than inheriting global session-level permissions.

Human-in-the-loop checkpoints for irreversible actions fit this frame directly. They are not primarily training-dependent defenses. A confirmation step before sending email, writing to persistent storage, or making financial transactions inserts a principal back into the authorization chain at the moment of execution. The human’s approval is the capability the agent needs to proceed. Without it, even a fully compromised model cannot complete the action.

The OWASP LLM Top 10 lists prompt injection as LLM01, framing it as the top risk for LLM-integrated applications. The practical mitigations listed there — privilege separation, input validation, minimal permission scoping — are recognizable as applications of capability discipline to a new substrate. Norm Hardy described the problem in 1988. The AI agent security field is working toward the same set of answers, through the same reasoning, arriving at similar conclusions with language adapted for language models rather than compilers. The underlying structure has not changed.

Was this interesting?