Code Review Captures Things That Tests Cannot, and That's Why the Gap Exists

A recent note from METR makes a pointed observation: many patches that pass SWE-bench would not be merged by the maintainers of the projects those patches target. The finding has prompted a lot of commentary about benchmark validity, but the more interesting question is why the gap exists at all, and what it reveals about what code review is actually doing.

SWE-bench is not a fuzzy or hand-wavy evaluation. It takes real GitHub issues from popular Python repositories, applies a candidate patch using git apply, runs the repository’s existing test suite inside a Docker container, and records whether the relevant tests pass without breaking others. The methodology is precise and the scope is honest. The benchmark measures exactly what it says it measures: does this patch make these specific tests pass.

That precision is also the limit. Tests verify behavior at the specific points a test author anticipated and encoded. They say nothing about whether the code is the right approach, whether it fits the project’s architectural direction, whether it creates maintenance burden, or whether it solves the stated problem without leaving the underlying cause intact. Code review checks all of those things, and maintainers weight them heavily.

What Reviewers Are Actually Checking

When a maintainer rejects a patch that passes tests, they are usually not disputing correctness at the tested boundary. They are asking a different set of questions.

Does this fix address the root cause or patch a symptom? A bug often has a narrow surface manifestation and a wider structural origin. A patch that eliminates the symptom without touching the structure passes tests but leaves related bugs open and signals to reviewers that the submitter did not fully understand the code. In a well-maintained project, maintainers prefer a larger, slower fix to a targeted workaround that obscures the problem.

Does this approach fit where the project is going? Repository history contains decisions, but it does not contain the decisions that are in progress. A maintainer might be midway through a refactor that would make the patched area obsolete, or might have already decided to replace a subsystem. A patch that deepens investment in code that is slated for removal is technically correct and practically counterproductive. No benchmark has access to that context.

Is this the right abstraction? A common failure mode is solving a specific case when the project’s convention calls for a general solution, or adding a general solution when the project favors keeping things concrete. Neither is universally wrong; both are matters of fit with the existing codebase’s style and philosophy. Tests do not capture this dimension at all.

Does the code match implicit norms the test suite does not enforce? Every active project accumulates conventions around naming, function size, error handling patterns, and documentation that are not enforced by linting or tests. These conventions exist because consistency reduces cognitive load for the people who maintain the code. A patch that violates them in small ways is a source of friction, and experienced maintainers notice.

The Tacit Knowledge Problem

A substantial fraction of what maintainers know is not in the repository. It lives in rejected PRs, in design discussions that happened in issue comments two years ago, in Slack threads, and in the accumulated intuitions of people who have spent months in a codebase. When a maintainer says a patch takes the wrong approach, they are often drawing on context that cannot be extracted by reading the current tree.

This is not a solvable problem for a benchmark. A benchmark can only observe what is in the repository at evaluation time. It cannot observe that the project lead considers a particular pattern an antipattern, or that a dependency the patch adds is one the team is actively trying to remove. The METR findings are partly a direct consequence of this: an AI agent working from the repository state alone is missing the same context a new external contributor would be missing, but without the ability to ask clarifying questions before submitting.

Specific Failure Modes That Fool Tests

It helps to be concrete about the mechanisms. A patch can pass a test suite by manipulating the tested behavior without addressing the broader problem. Consider a function that computes a value incorrectly under specific input conditions. A patch might special-case those conditions with a hardcoded correction that makes the test pass. The behavior at the test boundary is now correct; the function is still wrong for untested inputs, and the special case will puzzle anyone reading the code later.

A more subtle failure mode is scope creep in the wrong direction. A bug report describes problem X. An AI patch fixes X and also silently changes behavior Y, which has no associated test. The test suite passes, Y’s regression is invisible, and the next person to touch Y encounters a broken invariant with no explanation in the commit history.

Patches can also fail at the interface design level. If a function’s signature or return contract is changed to fix a bug, callers that were not part of the test harness may now behave differently. This is a category of problem that integration tests and type checkers reduce but do not eliminate, and it is exactly the kind of thing a reviewer who knows the codebase catches on first read.

What This Means for How AI Tools Get Deployed

The METR finding is specifically damaging to the “automated PR” workflow, where an AI agent opens a pull request without human involvement in the drafting stage. That workflow depends on the agent producing code that is not just correct but mergeable, and mergeability requires context the agent does not have.

For assisted workflows, the finding is less damaging and more clarifying. When a developer uses an AI to draft a solution and then reviews and revises it before submitting, the human reviewer is supplying exactly the missing context: project direction, maintainer preferences, tacit conventions, architectural fit. The output of that process is more likely to be mergeable because a person with the right context was in the loop.

This is a practical distinction worth making when teams evaluate AI coding tools. A tool that improves the speed and quality of human-reviewed code is a different product from a tool that autonomously commits and opens PRs. The benchmark performance of these tools does not differentiate between those use cases, but the METR evidence suggests the outcomes will diverge significantly.

The Evaluation Gap Reflects a Real Difficulty

None of this is an argument that SWE-bench is broken or that AI coding tools are less capable than advertised. It is an argument that measuring coding ability is harder than running tests, and that the field has largely avoided confronting that difficulty by treating test passage as a proxy for the full problem.

Building a benchmark that captures mergeability would require ongoing human review of generated patches, coordination with project maintainers, and longitudinal data about whether merged patches held up over time. That is expensive and slow. Test passage is cheap and fast. The asymmetry explains why the field converged on the easier metric.

The SWE-bench paper was explicit about its scope when it was published. The inflation of its results into general claims about software engineering capability happened in the gap between what the benchmark measures and how it gets discussed in announcements and leaderboards. METR’s note is a useful intervention, but the underlying issue is interpretive: a precisely scoped benchmark is being used to make broader claims, and those claims are now being tested against what practitioners actually care about when they merge code.