Passing SWE-bench and Writing Mergeable Code Are Different Skills

A benchmark score is a claim about what a system can do, but what it measures and what it promises are often different things. METR’s recent study makes this concrete: a substantial portion of AI-generated patches that pass SWE-bench, the dominant benchmark for AI software engineering ability, would be rejected if submitted as real pull requests to the projects they fix.

This is not a surprising finding if you understand how SWE-bench works. It is, however, an important one to make explicit, because benchmark scores have become a primary way AI coding tools are marketed and compared.

What SWE-bench Actually Measures

SWE-bench, introduced in a 2023 paper by Jimenez et al. at Princeton, is built from real GitHub issues across 12 popular Python repositories: Django, Flask, scikit-learn, sympy, and others. Each task presents a model with the repository at the state before a known bug fix, the issue description, and the task of producing a patch.

Evaluation is binary. A patch “resolves” the task if and only if the tests that were failing before the original fix now pass, and the tests that were passing before remain passing. The benchmark uses the test changes from the original PR, so you are checking whether the model can reproduce the functional outcome a human developer achieved.

That is a well-defined and genuinely useful signal. Writing a patch that makes a broken test suite pass without breaking other tests requires understanding the codebase, locating the right code, and producing a functionally correct change. SWE-bench Verified, a human-annotated subset developed with the original authors, filtered out tasks where the test suite did not faithfully represent the issue, making the signal cleaner still.

But “makes the tests pass” is not the same as “would be merged by a maintainer.”

The Gap Between Passing and Merging

Consider what a patch can do to pass tests without being acceptable code:

# Original failing behavior: negative indices crash the function
def get_item(lst, idx):
    return lst[idx]  # IndexError on negative out-of-range

# Benchmark-passing "fix" — special-cases the specific test value
def get_item(lst, idx):
    if idx == -999:
        raise ValueError("invalid index")
    return lst[idx]

This is contrived, but the structural failure it represents is real. A model optimizing for test passage can produce a fix that handles the precise inputs the tests use without addressing the general case. The tests pass; the patch is wrong. More subtle versions of this appear regularly in AI-generated code.

Beyond correctness-in-disguise, several dimensions of code quality that automated tests simply do not address matter to reviewers:

Style and convention. Most active projects enforce a consistent style: specific naming patterns, required type annotations, docstrings in a particular format, a project-specific way of organizing imports. A patch that works correctly but violates these conventions will require revisions before merging. Pre-commit hooks and linters catch some of this, but not all of it, and many style decisions are unenforceable automatically.

Architectural placement. Where code lives in a codebase matters as much as what it does. A fix that adds validation logic directly to a view function in a project that keeps validation in a separate layer, or that duplicates functionality already available in a utility module, is a working but wrong solution. Reviewers catch this and ask for the change to be moved or refactored before it goes in.

Scope. AI-generated patches frequently modify more files than necessary, touching adjacent code that is related but not causal to the bug. This expands review surface, introduces risk, and signals to maintainers that the author did not fully understand what needed to change. Well-scoped patches are a learned skill, and models optimizing for test passage have no incentive to minimize scope.

Documentation. If a function changes its behavior, its docstring should change too. If a public API changes, the changelog needs an entry. If a non-obvious workaround is applied, an inline comment explaining why is expected. None of these affect test outcomes, so they are systematically absent from patches that optimize only for passing tests.

Security. A patch can pass all existing tests while removing input validation that previously prevented certain attack vectors, bypassing a permission check through a control flow shortcut, or introducing a subtle injection risk. Tests cover the happy path; security issues live in the edge cases and sad paths that tests do not exercise.

Why Benchmarks Have Not Addressed This

Benchmark designers face a real constraint. Automated, reproducible evaluation is tractable; human code review is expensive and slow. The pass/fail signal from running a test suite can be computed in minutes at scale. Getting senior engineers to review hundreds of patches and rate their mergeability is a different logistical problem entirely, and it introduces subjectivity and inter-rater disagreement.

The result is that the field has converged on a metric that is efficient but incomplete. This creates an incentive structure where AI coding systems are optimized, evaluated, and compared on a signal that does not fully capture what engineers care about in practice.

Scores on SWE-bench climbed from roughly 3% in 2023 to well above 50% for frontier models and agentic systems like SWE-agent and OpenHands by 2025. Whether those gains correspond to proportional improvements in code that a team would want to maintain is exactly the question METR is asking, and it is one the benchmark score alone cannot answer.

METR’s study is valuable precisely because it quantifies this gap directly rather than inferring it. Human reviewers examining benchmark-passing patches and finding a substantial fraction unmergeable establishes an empirical floor on what the benchmark misses. It also reframes how benchmark improvement should be interpreted: a system going from 45% to 55% on SWE-bench may or may not be producing code that is more acceptable to real reviewers.

What Better Measurement Looks Like

The most direct improvement would be to incorporate human review signals into the evaluation pipeline. This could mean having engineers rate patches after automated tests pass, adding a secondary “would you merge this” judgment to a subset of solved tasks, or tracking real-world acceptance rates when AI-generated patches are submitted to actual open source projects.

Static analysis provides a cheaper middle ground. Running the project’s own linter configuration, checking for cyclomatic complexity regressions, or verifying that the patch does not introduce new pylint or mypy warnings adds signal without requiring human time per evaluation. Some researchers have begun constructing SWE-bench extensions with these secondary gates, though none have become standard parts of the published leaderboard.

There is also a structural argument for measuring patch minimality: the number of lines changed, files touched, and functions modified relative to the smallest correct fix. A patch that changes 200 lines when 10 would suffice is not equivalent to a surgical fix, even if both pass the same tests. Models that tend toward minimal changes are more likely to produce patches that land cleanly in review.

Perhaps the most interesting direction is automated review simulation using a second model as a critic, asking it to evaluate the patch as a maintainer would. This is circular if the reviewer is the same model family, but with independent architectures it could add a useful secondary signal. CodeBLEU and AST similarity metrics attempt something adjacent, measuring structural distance from the reference solution, though they conflate style differences with substantive ones.

Reading Benchmark Scores Carefully

None of this means SWE-bench is useless. It is the closest thing the field has to a standardized evaluation of whether a model can understand, navigate, and modify a real codebase in a meaningful way. A system that cannot pass SWE-bench tasks at a reasonable rate will not produce mergeable code either.

The issue is interpretation. A benchmark score tells you something about one dimension of capability: functional correctness on a fixed set of tasks with a specific evaluation method. It does not tell you about code style, architectural judgment, documentation habits, security posture, or the contextual awareness that makes code feel like it belongs in a codebase rather than like a foreign object inserted to make a test pass.

METR’s finding makes explicit what careful readers of the benchmark literature have long suspected. The number in the leaderboard table and the number of PRs a real engineering team would accept are related, but not equivalent, and the gap between them is large enough to matter when choosing and deploying AI coding tools.

Treating them as equivalent is how you end up with impressive numbers on paper and a growing pile of technically-passing-but-unmaintainable code in production.