Why Coding Agents Need to Run Code, Not Just Write It

Most discussions of coding agents focus on the loop: call the LLM, get a tool call, execute it, repeat. Simon Willison’s guide to agentic engineering patterns lays out this structure clearly. But the loop is a mechanism. The more useful question is which tool calls in that loop actually move the needle on output quality.

The answer, consistently across SWE-bench results, open source implementations, and practical experience, is code execution. Not file reading, not codebase search, not even file editing. The capability that changes the quality curve is running tests and observing their output.

What Changes When an Agent Can Execute Code

A code-generating model that cannot run its own output operates in a closed world. It writes code that looks right, follows training patterns, and passes a syntactic read. But it has no way to verify any of it. Every file it writes is a prediction, and there is nothing in the loop to correct a wrong one.

Code execution breaks that closure. The agent writes a change, runs the test suite, reads the failure output, and adjusts. This is not sophisticated behavior; it is the same tight loop that developers use every day. The model is running a simple empirical process: make a prediction, test it against ground truth, update based on evidence.

The SWE-bench benchmark evaluates agents on real GitHub issues from popular Python projects, and the data has consistently backed this up. The original SWE-agent paper from Princeton NLP, which put shell access at the center of its design, reported solving 12.5% of issues, well above earlier file-modification-only approaches. Current top entries on the leaderboard push above 30%, with the improvements coming from better scaffolding around the execution loop rather than from qualitative changes in model reasoning.

The Cycle Step by Step

A typical edit-run-observe-fix cycle looks like this:

1. Read the failing test to understand what behavior is expected
2. Read the relevant implementation files
3. Run the failing test to see the exact error and traceback
4. Edit the implementation
5. Run the test again to see whether the edit fixed it
6. If still failing, read the new error, adjust, iterate
7. Once passing, run the full suite to check for regressions

Each item is a tool call. Steps 3, 5, and 7 are doing something the read steps cannot: they generate ground truth. A failing assertion is unambiguous. A traceback points to a specific line. The model does not have to infer correctness from static analysis; it receives explicit binary feedback.

The signal is also compact. A test failure typically produces 20 to 100 lines of output. Compared to reading a source file with thousands of lines to reason about statically, test output is extremely token-efficient per unit of information. One failing test run tells the model more about what is wrong than reading the entire module could.

Scaffolding Requirements

Exposing a bash tool is necessary but not sufficient. Several things need to be in place for the cycle to work reliably.

Execution isolation. The agent should not run tests against production systems or fire real network requests. E2B provides managed microVMs with per-session isolation and is the most common managed option in production agent deployments. OpenHands uses configurable Docker-based sandboxes. Both give you a clean execution environment that can be discarded after each session without side effects.

Output truncation. A test suite that generates 50,000 characters of output on failure will consume context budget before the agent can do anything with it. Effective scaffolding truncates to the relevant lines: test names, failure messages, and immediate tracebacks. Aider strips ANSI escape codes and applies a configurable line limit before injecting test output into the conversation, which keeps each iteration’s token cost bounded.

Targeted re-running. Rerunning the full suite after every edit is wasteful when only one test is failing. Agents that run the targeted failing test first, then the full suite as a final check, use significantly fewer tokens per iteration. Pytest’s -k flag for filtering by test name and -x for stopping on first failure are straightforward mechanisms here:

def run_tests(repo_root: str, target_test: str | None = None) -> str:
    if target_test:
        cmd = f"python -m pytest -k '{target_test}' -x --tb=short -q"
    else:
        cmd = "python -m pytest --tb=short -q"
    result = subprocess.run(
        cmd, shell=True, capture_output=True, text=True,
        cwd=repo_root, timeout=120
    )
    output = result.stdout + result.stderr
    # Truncate from the end to keep the most recent failure
    return output[-4000:] if len(output) > 4000 else output

Stateless environment. Each test run should produce the same result given the same codebase state. Tests that depend on external services, specific filesystem state, or timing produce noisy signal. The scaffolding should either mock those dependencies before the agent starts, or flag them so the agent knows to discount that test’s output when reasoning about whether its edits are correct.

Where the Cycle Breaks Down

The edit-run-observe-fix cycle assumes the tests encode correct behavior. When they do not, the failure modes are predictable.

Low test coverage means the agent makes changes, sees tests pass, and returns code that is wrong in ways no test catches. This is not a model reasoning failure; it is the tests failing to constrain the solution space. The agent is working with the feedback it has, and incomplete tests are incomplete feedback.

Flaky tests cause the cycle to loop without converging. The agent sees a failure, makes a change, sees a pass, makes a further change, sees a failure again, and concludes the first fix was wrong. False signal compounds across iterations because the agent has no reliable way to distinguish a flaky test from a genuine regression without running the same test multiple times, which multiplies token cost.

Expensive test environments slow down iteration and inflate context usage. An agent waiting for database migrations or external service startups before each test run will exhaust its context budget before converging on a fix, because each iteration takes longer and the accumulated tool output grows faster than useful signal.

None of these are agent-specific problems. They are exactly the issues that make maintaining test suites difficult in ordinary development. The agent surfaces them faster because it iterates more aggressively than a developer would, and does not get frustrated and stop.

What This Means in Practice

If you are integrating a coding agent into a real workflow, the execution environment is where investment pays off most directly. The agent’s reasoning capability and loop structure matter, but they matter in service of the edit-run-observe-fix cycle, and that cycle is only as good as the signal it receives.

A clean, isolated execution environment, compact test output, targeted re-running, and deterministic test behavior are not agent-specific requirements. They are foundational software engineering hygiene that agents expose, require, and benefit from more visibly than human developers do, because the agent will exercise the test suite at a rate no human iteration pace would reach.

Simon’s guide makes the point that the tool set shapes how an agent approaches problems. Give an agent a bash tool and a working test suite, and it approaches code as something to be verified, not just written. Take away execution access and it falls back to static reasoning. The difference in output quality between those two configurations is substantial, and it is not primarily a model capability gap. It is a feedback loop gap.