· 6 min read ·

Model-Native Execution and Why the Agents SDK Redesign Matters

Source: openai

The history of agent frameworks is largely a history of wrappers. A model sits at the center, and developers build scaffolding around it to manage context, dispatch tool calls, chain outputs into inputs, and persist state between invocations. LangChain popularized this pattern in 2022, and the ecosystem spent the next two years adding layers on top of layers. The resulting systems worked, often well, but they also accumulated a kind of architectural debt: the framework’s opinions about how agents should work frequently collided with what the underlying model had actually been trained to do.

OpenAI’s latest Agents SDK update moves in a different direction. The two headline changes, native sandbox execution and a model-native harness, are worth taking seriously not because they’re novel in isolation but because of what they imply about where the right abstractions should sit.

Where the SDK Came From

OpenAI shipped the openai-agents Python package in March 2025, positioning it as the production successor to their experimental Swarm framework from late 2024. Swarm was explicitly a minimal reference implementation, not a production library. It demonstrated the coordination patterns OpenAI cared about, principally handoffs between agents and lightweight context passing, without trying to be a complete framework.

The Agents SDK kept those patterns but added the infrastructure for real deployments: structured tracing, input and output guardrails, built-in tools like WebSearchTool and FileSearchTool, and a Runner abstraction that manages the agentic loop. The core primitives stayed simple.

from agents import Agent, Runner

agent = Agent(
    name="assistant",
    instructions="You help with code questions.",
    tools=[WebSearchTool()]
)

result = await Runner.run(agent, "How does Python's GIL work?")

The loop itself is straightforward: send a message, collect tool calls from the response, execute them, append the results, repeat until the model stops requesting tools. What the Swarm paper called “routines and handoffs” became first-class SDK concepts.

The Problem with External Tool Execution

The original tool execution model in most agent frameworks treats tools as opaque external functions. The framework calls them, collects results, and injects those results back into the conversation as text. This works for simple tools, lookups, API calls, quick computations, but it creates problems for anything stateful or long-running.

Code execution is the clearest example. If an agent writes a Python script and runs it via a subprocess call, nothing prevents that script from writing to the file system, spawning network connections, or touching environment variables that belong to the host process. The agent’s code and the host system share a trust boundary, which means every tool call that executes arbitrary code is a potential security issue. In practice, developers either accept this risk or build their own sandboxing, which means reinventing isolation primitives for every new project.

Native sandbox execution addresses this at the framework level. Rather than leaving developers to manage isolation themselves, the SDK ships with a sandboxed execution environment where code runs by default. The agent can write files, execute scripts, and invoke tools within a contained environment without exposing the host system to arbitrary side effects. This is architecturally similar to what Anthropic’s computer use implementation does with containerized desktops, though the mechanism differs. OpenAI’s sandbox appears to target code execution specifically rather than full desktop environments.

The practical consequence for developers building on top of this is significant. You can let agents write and run code as part of their reasoning loop without auditing every possible execution path for safety issues. The trust boundary is drawn by the framework rather than by you.

What Model-Native Actually Means

The phrase “model-native harness” is the more interesting architectural claim. To understand it, consider how most frameworks handle the execution loop.

In a typical wrapper framework, the loop is implemented entirely in application code. The framework decides when to call the model, which tool results to include in context, how to truncate history when the context window fills, and when the task is complete. The model is treated as a stateless function: input goes in, output comes out, and the framework manages everything else.

This creates a subtle mismatch. Models trained with specific tool-use patterns have expectations about how their context will be structured. If the framework’s context management differs from the training distribution, tool call accuracy and reliability degrade. This is one reason why naive implementations of multi-step agents often perform worse than expected: the scaffolding fights the model’s internal assumptions.

A model-native harness flips this. Instead of the framework imposing its own execution model on the API, the harness is designed around what the model was trained to do. The execution environment reflects the model’s actual capabilities rather than a generic agent pattern layered on top of them. For OpenAI’s models specifically, this likely means tighter alignment between how the harness manages context and how the models were fine-tuned for agentic use.

This is a more principled approach than most frameworks have taken, though Anthropic has pushed in a similar direction by exposing tool use as a first-class API capability rather than a simulated pattern. The difference is that OpenAI is embedding this alignment into the SDK itself rather than just the API contract.

Long-Running Agents and State Management

The update’s emphasis on long-running agents addresses a real deployment problem. Most agent tasks in production are not single-turn interactions. A coding agent might need to read multiple files, run tests, interpret failures, and revise code over the span of minutes. A research agent might query sources, synthesize findings, and generate a structured report across dozens of tool calls.

Managing state across this kind of extended execution requires more than a conversation list. Files need to persist between tool calls. Partial results need to survive context window truncation. If the agent is interrupted, whether by a timeout, an error, or a deliberate pause for human review, its state needs to be resumable.

Building this infrastructure on top of the basic completion API is tedious and error-prone. Each framework has invented its own approach: LangChain has memory backends, AutoGen has its own conversation management, and various production systems use Redis or SQLite to externalize state. The result is a fragmented landscape where interoperability is poor and switching costs are high.

Building state management into the SDK itself, particularly in conjunction with sandbox execution, creates a more coherent foundation. The sandbox can maintain a persistent file system across the agent’s lifetime. The harness can checkpoint conversation state at meaningful boundaries. Developers don’t have to bolt these capabilities on after the fact.

The Comparison to Other Frameworks

It’s worth being concrete about where the Agents SDK sits relative to alternatives.

LangChain remains the most widely used framework, but its abstraction surface is enormous and its execution model is framework-owned rather than model-native. It supports many models and tools precisely because it doesn’t assume anything about how any particular model works. That breadth is also its constraint.

AutoGen focuses on multi-agent coordination, with sophisticated patterns for human-in-the-loop workflows and code execution. Its code execution has always included Docker-based sandboxing as an option, which gives it a head start on isolation, but its orchestration model is also framework-owned.

CrewAI and similar frameworks are higher-level still, optimizing for ease of use at the cost of low-level control.

The Agents SDK occupies a different position: it’s tightly coupled to OpenAI’s models and infrastructure, which limits portability but enables the model-native alignment that more general frameworks cannot achieve. If you’re building on OpenAI’s models and care about reliability in long-running tasks, this tighter coupling is a reasonable trade.

What This Changes in Practice

For developers already using the Agents SDK, the sandbox and harness updates don’t require rewriting existing agents. The SDK’s existing Agent and Runner primitives remain the primary interface. What changes is what’s available inside the execution environment and how reliably the harness manages the agent’s lifecycle.

For teams evaluating frameworks, the update makes the Agents SDK a more defensible choice for production deployments involving code execution. The security properties are clearer, the state management is more coherent, and the alignment between the framework and the underlying models is tighter than most alternatives offer.

The broader pattern here is worth noting. The agent framework landscape has been converging toward a recognition that the execution environment is not a separate concern from the model. Where you draw the trust boundary, how you manage state, and how the harness structures context all affect what the model can reliably do. OpenAI building these decisions into the SDK rather than leaving them to application developers is a reasonable response to what the ecosystem has learned since 2022.

Was this interesting?