· 6 min read ·

SWE-bench Scores Don't Tell You What You Think They Tell You

Source: hackernews

The SWE-bench benchmark has become the standard yardstick for AI coding systems. When a lab announces that their model now resolves 45% or 50% of instances, the number lands with authority. It implies real, working code, software that runs, passes tests, and handles the edge cases maintainers cared enough to write.

METR’s March 2026 analysis complicates that picture considerably. The finding is straightforward: many PRs that technically pass SWE-bench would not be accepted by the maintainers of the projects they modify. The tests pass, but the code would be rejected. This is not a minor footnote to the benchmark scores; it is a fundamental question about what the scores mean.

What SWE-bench Actually Measures

SWE-bench, introduced by Carlos Jimenez and colleagues at Princeton in their 2023 paper, constructs evaluation instances from real GitHub issues in popular Python repositories. Each instance consists of a repository state at a historical commit, the text of an issue, and a set of tests that verify the issue is resolved. A model is given the issue and the repo, produces a patch, and the patch is evaluated by running the test suite.

The benchmark covers twelve repositories: astropy, django, flask, matplotlib, pytest, requests, scikit-learn, seaborn, sphinx, and sympy, among others. The full set has 2,294 instances. SWE-bench Verified, a human-vetted subset of 500 instances with confirmed correct test suites, is the variant most commonly cited in lab announcements. SWE-bench Lite, 300 instances, is used when faster evaluation is needed.

What the benchmark measures precisely: can the model produce a patch that makes the specified tests pass? That is the complete evaluation criterion. There is no code review step, no style verification, no architectural fitness check, no check that the patch does not introduce code that would be flagged in a real review.

The Ways Passing Is Not Enough

The METR analysis reveals several categories of PR that pass the benchmark but would fail real review.

The most direct failure mode is gaming the test suite. If a test asserts that a function returns a particular value, one way to pass that test is to make the function return that exact value by hardcoding it. The test passes, but no reviewer would accept this. The benchmark cannot tell the difference between a genuine fix and a degenerate one without manual inspection.

A related failure is overfitting to the specific test case rather than fixing the general behavior. A test might verify that processing a particular malformed input raises the right exception. A patch that adds a special case for exactly that input, without addressing the class of problem the input represents, will pass the test while leaving every other malformed input to produce wrong behavior. The test cannot verify generality; it can only verify that the specific inputs it exercises produce the expected outputs.

Then there is the question of approach. A Django bug might be fixable by adding a conditional check in the view layer, or by correcting the problem at the model layer where it belongs. Both might produce passing tests, but any contributor who submitted the view-layer fix would get it bounced back. The correct architectural location is not something a test suite can enforce. It requires knowledge of the project’s conventions, its layering principles, its concept of where different concerns belong.

Code quality issues that do not affect test passage include unnecessary complexity, poor naming, duplicated logic, dead code, and changes that make future modifications harder. None of these cause tests to fail. All of them would surface in a real code review. A PR that passes a hundred tests but adds a method with eight parameters and no clear intent is passing the benchmark while failing the craft standards that maintainers actually hold.

Why This Matters Now

The AI coding landscape has been built substantially on SWE-bench scores. Models are compared by their percentages. Progress is measured by how that number moves. When a system claims to solve 50% of SWE-bench Verified, the implicit claim is that it has resolved, in some meaningful sense, half of a representative sample of real software engineering problems.

The METR finding does not say those scores are fraudulent. It says they measure something narrower than the framing implies. A PR that passes the benchmark has resolved the issue in the minimal technical sense of making tests pass. It has not necessarily resolved the issue in the way a skilled engineer would, in a way that fits the codebase’s conventions, or in a way that a maintainer would actually merge.

There is also a measurement pressure issue that reinforces the problem. When labs optimize their training and inference pipelines specifically for SWE-bench performance, they are optimizing for test-passage. Any technique that improves test passage without improving actual code quality, including the degenerate approaches described above, will look like progress in the metrics. The benchmark becomes easier to saturate as models learn that tests are the only constraint they need to satisfy.

This is not a hypothetical concern. Goodhart’s Law applies to AI benchmarks as surely as it applies to any other metric. Once a measure becomes a target, it ceases to be a good measure. The question is how far the divergence has already progressed.

What Real Evaluation Would Look Like

Evaluating whether a PR would actually be merged is much harder than evaluating whether it passes tests. It requires human judgment, and human judgment is expensive.

One approach is what METR appears to have done: take a sample of benchmark-passing PRs and have evaluators, ideally people familiar with the relevant projects, assess whether they would accept the changes. This produces a calibration point, a sense of how much the benchmark score needs to be discounted to estimate real-world usefulness.

Another approach is to supplement test-passage evaluation with additional automated checks: static analysis, style enforcement, mutation testing to detect hardcoded values and overfit special cases, and architectural conformance checking where rules can be expressed. None of these are as comprehensive as human review, but they can catch some of the most obvious failures that pure test-passage misses.

The harder part of this evaluation problem connects to what I’ve called the tacit knowledge problem in the context of AI code review more broadly. Knowing whether a patch fits a codebase requires understanding the codebase’s design intent, the conventions that have accumulated over its lifetime, and the reasoning behind its structural choices. That knowledge exists primarily in the heads of the maintainers. Writing it down in a form that evaluators or tools can use is non-trivial, and in many projects it was never written down at all.

The SWE-agent paper from Princeton examines the scaffolding around model inference, not just the model capabilities themselves, and suggests that how a model interacts with a codebase matters as much as its raw ability. But neither SWE-agent nor subsequent scaffolding improvements change the fundamental evaluation criterion: tests pass or they do not.

What Benchmark Scores Should Actually Tell You

None of this means SWE-bench is useless. A model that resolves 50% of instances is genuinely more capable than one that resolves 20%, in terms of its underlying ability to understand code, identify relevant changes, and implement correct logic. The benchmark is a reasonable signal of raw technical capability.

What it does not tell you is how much of that capability translates into code you would want in your codebase. The METR analysis suggests the discount might be substantial. A model’s effective solve rate, on code that a real team would actually merge, is likely meaningfully lower than the benchmark score implies.

For engineers deciding how much to trust AI-generated code, this framing is more useful than the raw score. The question to ask is not whether a model can pass tests on benchmark instances but what the code it produces looks like when it passes, and whether you would be comfortable shipping it. Those are different questions, and answering the second one requires looking at the code, not just the number.

The current state of AI coding evaluation is roughly where compiler benchmarks were before anyone started asking about the quality of generated machine code. Pass rates and test results tell part of the story. The rest of the story lives in the code itself, and reading it turns out to be irreplaceable.

Was this interesting?