The Gap Between Passing SWE-bench and Writing Code That Gets Merged

When benchmarks report that an AI system resolves 49% or 71% of SWE-bench tasks, there’s an implicit claim embedded in those numbers: that the AI is doing real software engineering work at that success rate. METR’s recent note pushes back on that framing directly. They reviewed AI-generated patches that passed SWE-bench’s evaluation criteria and found that a substantial portion of them would not be accepted into the actual open-source projects they were generated for.

This finding is worth unpacking carefully, because it touches on something fundamental about how we evaluate AI coding systems, and what we mean when we say a system “solved” a software problem.

What SWE-bench Actually Measures

SWE-bench, introduced in a 2023 paper from Princeton researchers, is built from real GitHub issues across popular Python repositories like Django, scikit-learn, sympy, and requests. For each task, the AI receives a repository snapshot and an issue description, then must produce a patch. The patch is considered successful if it causes the repository’s test suite to pass, specifically the tests that were added or modified as part of the original human-authored fix.

That last sentence contains the key constraint. The AI isn’t evaluated against the human patch; it’s evaluated against the human-written tests that verified that patch. These are different things. A patch is valid by SWE-bench’s definition if and only if those specific tests turn green in an isolated Docker environment.

This is a reasonable proxy for correctness in many cases. But it leaves open a meaningful set of failure modes.

The Ways a Patch Can Pass Without Being Good

There are several categories of patches that satisfy test suites while failing the standards a project maintainer would apply.

The most obvious is overfitting to the test cases themselves. If a test asserts that foo(3) returns 7, a patch that hardcodes a lookup table or adds a special case for the exact inputs present in the tests will pass. This is a well-known failure mode in code generation systems, and it produces technically “correct” patches that are obviously wrong as software.

A subtler category involves deleting or disabling code paths that trigger the failing tests rather than fixing the underlying logic. If a test fails because a function raises an exception for certain inputs, removing the check that raises the exception may make the test pass while introducing a latent bug elsewhere. The test suite only verifies the scenarios it was written to cover.

There’s also the question of minimal correctness versus good engineering. A fix might correctly address the specific issue described while ignoring backward compatibility, introducing performance regressions, duplicating logic that already exists elsewhere in the codebase, or violating the project’s architectural conventions. Tests rarely catch any of these problems.

Finally, because SWE-bench draws from public GitHub repositories with known issues and human-authored fixes, models trained on large internet scrapes may have encountered both the issues and their solutions during pretraining. The extent of this contamination effect is contested in the literature, but it’s a persistent concern that inflates apparent capability beyond what transfer to novel problems would justify.

Benchmark Scores as Marketing

By mid-2025, leading AI labs were reporting SWE-bench Verified scores in the 49-71% range for their flagship models, often using agentic scaffolding that allows the model to run tests, inspect errors, and iterate on its patch. These numbers appear prominently in product announcements and are regularly cited as evidence that AI is approaching or exceeding human-level software engineering capability.

METR’s finding complicates that narrative considerably. If a meaningful fraction of those “passing” patches would be rejected in real code review, then the headline percentage overstates practical utility. The gap between “the tests pass” and “this code should be in the codebase” is exactly the gap that experienced engineers spend much of their time managing.

Goodhart’s Law applies here in a fairly clean way: once a measure becomes a target, it ceases to be a good measure. SWE-bench was designed as an evaluation benchmark, but it’s increasingly functioning as a competitive scoreboard, and systems are being tuned, prompted, and scaffolded specifically to maximize it. That optimization pressure tends to surface the benchmark’s exploitable weaknesses.

What Code Review Catches That Tests Don’t

A useful frame for understanding the gap is to think about what happens in code review for a non-trivial open-source project. Maintainers check whether the fix addresses the root cause or patches the symptom. They assess whether the approach is consistent with how similar problems are handled elsewhere in the codebase. They look for impacts on related functionality that isn’t covered by the changed tests. They consider documentation, API stability, and whether the change introduces technical debt that will need to be paid later.

None of these criteria appear in SWE-bench’s evaluation loop. The benchmark is structurally blind to them because they’re either difficult to formalize or require understanding the project’s broader history and conventions, which a test runner doesn’t have access to.

This isn’t a design flaw in SWE-bench so much as an inherent limitation of test-based evaluation. Tests are a necessary condition for good code, not a sufficient one. The benchmark treats them as sufficient because there’s no practical alternative within automated evaluation. That’s a reasonable engineering trade-off for a research benchmark, but it becomes misleading when scores are used as direct capability claims.

What Better Evaluation Looks Like

METR’s note points toward a more demanding evaluation methodology: having actual project maintainers review AI-generated patches and assess whether they would merge them. This is expensive and doesn’t scale easily, but it produces a more honest signal about real-world capability.

Some researchers have proposed hybrid approaches: automated test-passing as a filter, followed by static analysis for common anti-patterns, plus human review of a sampled subset. Others have argued for benchmarks that emphasize tasks where test coverage is dense and the expected approach is well-constrained, reducing the surface area for gaming.

The SWE-bench Verified subset was a step in the right direction, using human annotators to confirm that problem statements are unambiguous and that the test suite adequately captures the fix. It reduces noise from poorly-specified tasks, but it doesn’t solve the core problem: tests passing and code being good are not the same thing, and Verified doesn’t change what the test runner can see.

The broader issue is that software engineering is a social and organizational activity as much as a technical one. Code that passes tests but doesn’t fit the codebase, doesn’t match the project’s idioms, or that a maintainer wouldn’t trust, isn’t useful in practice. Benchmarks that can’t capture that dimension will systematically overestimate how close AI systems are to replacing the human judgment involved in shipping software.

METR’s finding is a useful corrective to the way these numbers have been reported. High SWE-bench scores are still evidence of something real, something meaningful about a model’s ability to locate relevant code, understand an issue, and produce a syntactically and logically coherent patch. But that something is narrower than the headline numbers suggest, and the gap between benchmark performance and the standard of code that experienced engineers would accept and ship is worth keeping clearly in view.