Shell Access Is the Easy Part: What Model Training Determines for Agent Runtimes

The infrastructure question has a clear answer now. OpenAI’s Responses API with hosted containers and shell tool means you declare {"type": "shell"} and the execution environment manages itself. Container provisioning, stdin/stdout marshaling, session state, cleanup: all of it happens inside the API. That removes a significant chunk of engineering work for agent builders.

But infrastructure is the tractable part of the problem. The harder question, which gets less attention in discussions of agent runtimes, is whether the model using those tools is actually capable of doing useful work with them. The answer varies significantly depending on which model you use, how it was trained, and how you structure the interaction.

The ReAct Pattern and Why It Exists

The standard framework for understanding how models use tools comes from Yao et al.’s 2022 ReAct paper. The name is a portmanteau of “reasoning” and “acting.” Instead of generating tool calls directly, the model interleaves explicit reasoning traces with tool invocations: it writes a thought about what it needs to do, executes a shell command, observes the result, writes another thought, and continues until it can produce a final answer.

In a shell context, this looks like:

Thought: I need to find which test file covers the payment module.
Action: find . -name "*.py" -path "*/tests/*" | xargs grep -l "payment"
Observation: tests/test_payment_gateway.py

Thought: Let me run those tests and see what is failing.
Action: python -m pytest tests/test_payment_gateway.py -v
Observation: FAILED test_retry_logic - AssertionError: Expected 3 retries, got 1

The explicit reasoning step before each action serves two purposes. It forces the model to articulate what it is trying to accomplish before committing to a specific command, which catches errors where the model knows a command’s syntax but not what it does in context. It also gives the model a reference for interpreting observations: the thought establishes the expected outcome, making deviations easier to recognize.

This pattern is not enforced by the Responses API. It is something you need to either instruct in the system prompt or rely on fine-tuning to produce automatically. Base models will often skip straight to action generation, which is faster but loses the self-correction benefit.

The Gap Between Base Models and Trained Agents

Base GPT-4o can invoke shell tools. It will also, under certain conditions, chain commands in ways that are syntactically valid but semantically wrong, fail to interpret tool output errors correctly, and generate plausible-looking shell commands that produce unexpected side effects.

codex-1, the model underlying OpenAI’s Codex product, is an o3 variant specifically fine-tuned for software engineering tasks. The fine-tuning matters for several distinct capabilities:

Error interpretation: A shell command that exits with code 1 does not always mean the same thing. A test runner exiting 1 means tests failed; git clone exiting 1 on an already-cloned directory means something different from a network failure. Models trained on software engineering tasks have seen enough of these patterns to interpret them correctly. Base models sometimes treat non-zero exit codes as generic failures and take the wrong recovery action.

Idempotency awareness: Agents often need to check whether a step already completed before repeating it. pip install -r requirements.txt on an already-configured environment is harmless. git stash on a clean working tree is not always. Fine-tuned models handle this more reliably.

Termination decisions: An agent that can run an unbounded number of shell commands needs to decide when it is done. Base models in agentic settings tend to either terminate too early, before verifying their output is correct, or generate unnecessary additional steps, running the same verification multiple times. This is primarily a training problem, not a prompting problem.

SWE-bench Verified gives a concrete measure of the gap. The benchmark presents real GitHub issues and asks agents to produce patches that fix them. Early agents in 2024 scored around 13-15%. codex-1 scores approximately 72% on the same benchmark. That gap is not explained by the container environment: previous agent evaluations used shell execution as well. It is primarily explained by the model’s ability to reason coherently through multi-step software tasks.

Reasoning Models Change the Pattern

o3 and related models use extended inference-time reasoning that is not surfaced to the caller. In agentic settings, this changes how the model handles situations where its approach has gone wrong.

A standard model that has taken five shell steps in the wrong direction will often continue because the context contains evidence supporting it: the commands ran, even if they produced nothing useful. A reasoning model is more likely to recognize that the current context does not contain expected evidence and revise its approach. This is not guaranteed, but the failure mode is different and often more recoverable.

For the Responses API, this means gpt-4o and o3 are not interchangeable for shell-using agents. For tasks with clear, bounded scope (write a function and run the tests), gpt-4o performs adequately and is faster and cheaper. For tasks requiring sustained reasoning across many steps with uncertain intermediate states (trace a performance regression, diagnose a flaky test), the reasoning model’s ability to revise its own plan mid-task is a meaningful advantage.

Failure Modes You Will Actually Hit

Command over-engineering: Models trained primarily on text sometimes generate elaborate command chains where simple commands would do. find . -name "*.py" | xargs grep -l "pattern" | head -5 versus grep -rl "pattern" --include="*.py" -l | head -5. Both are correct, but the first suggests first-principles composition rather than idiomatic shell knowledge. This is harmless until the same pattern gets applied to file manipulation.

Silent failure handling: Tools that fail silently are a recurring problem. cp source dest returns 0 even if the destination is a directory and the copy landed somewhere unexpected. Models that check $? after every command are handling this. Models that chain commands with && will stop if any step fails but will not investigate why. Whether the model checks return codes and examines what actually happened is highly model-dependent.

Scope creep: Given a task to fix a specific failing test, a model with shell access may read configuration files, check git history, install additional packages, and modify files unrelated to the failure. OpenAI’s own guidance on designing agents for production recommends preferring reversible actions and minimizing footprint, but this is enforced through prompting and fine-tuning, not through structural constraints on the tool.

Hallucinated command behavior: Models sometimes invoke flags that do not exist, call tools not installed in the container, or assume directory structure that has not been established. These errors are recoverable when the model reads the error output and adjusts. They are problematic when the model interprets missing output as success.

What You Can Actually Control

When you declare the shell tool in the Responses API, you have limited direct influence over how the model uses it. The constraints you care about need to be in the system prompt:

from openai import OpenAI

client = OpenAI()
response = client.responses.create(
    model="codex-mini-latest",
    instructions="""
        You are a code repair agent.
        Only modify files within the src/ and tests/ directories.
        Before taking any action, state your plan.
        After modifying any file, re-run the relevant tests to verify.
        Do not use: rm -rf, git push, curl to external hosts.
        """,
    input="Fix the failing retry logic test in tests/test_payment_gateway.py",
    tools=[{"type": "shell", "container": {"type": "auto"}}]
)

None of this is enforced structurally. The model follows these instructions because it was trained to follow system prompt instructions, not because the API prevents violations. For sensitive applications, wrapping each shell_call event in a validator before execution, or routing specific command patterns to a human review step, provides actual structural enforcement rather than just instructed behavior.

The Evaluation Problem

SWE-bench Verified measures one specific capability: fixing isolated software defects in well-maintained open-source repositories. It does not measure whether models handle ambiguous tasks well, maintain consistent state across long sessions, or recognize when a task is out of scope.

Building reliable shell-using agents requires evaluation beyond published benchmarks. That means defining realistic task templates for your specific domain, measuring the fraction of tasks where the model’s output is correct and safe (not just plausible-looking), and tracking how failure modes distribute across task types. The Microsoft Research spotlighting paper and OpenAI’s instruction hierarchy work provide useful frameworks for thinking about evaluation of instruction-following under adversarial conditions, which is adjacent to evaluating shell-using agents under realistic workloads.

The infrastructure to run agents safely is one problem. Knowing whether they are working correctly is another, and it is the one that does not get easier just because OpenAI is managing the container.