The Institutional Knowledge Gap That SWE-bench Can't Close

METR’s analysis of SWE-bench-passing patches found that many would be rejected in real code review. The community discussion has focused on what this means for benchmark interpretation, which is correct but incomplete. The finding also points at something more specific: the twelve repositories that make up SWE-bench have documented standards for what constitutes acceptable code, and those standards are substantially richer than “the tests pass.”

These are not informal standards. Django’s contributing documentation runs to several pages covering code style, documentation requirements, and the principle that a patch is more than just code. New features require documentation. Bug fixes touching user-visible behavior require entries in the changelog. API additions go through a deprecation cycle. Non-trivial behavioral changes should be preceded by discussion on the django-developers mailing list. A patch that makes the test suite green while skipping any of this is not complete, regardless of CI status.

scikit-learn’s contributor guide is similar in spirit and adds performance requirements: changes to computationally intensive code should include benchmarks. The project requires NumPy-style docstrings, consistency with the existing estimator API design, and adherence to what the guide calls “the scikit-learn philosophy,” a set of design principles about API consistency, minimal user configuration, and backward compatibility that experienced contributors internalize over time.

These standards are the mechanisms by which a project maintains coherence across years, hundreds of contributors, and a codebase that thousands of engineers depend on. A patch that violates them is not just stylistically wrong; it makes the codebase harder to maintain, harder to document, and harder to learn for future contributors.

What Tests Cannot Transmit

The test suite captures what the code must do. It does not capture what kind of code the project wants to be.

This distinction matters because the same functional behavior can be implemented in many ways, and mature projects have preferences among those ways. A Django bug might be fixable in the view layer or in the model layer. Both approaches can pass the same test. One of them is consistent with Django’s architectural philosophy; one is not. Tests cannot adjudicate this. A maintainer familiar with the project’s design can.

Similarly, when a scikit-learn estimator raises an exception, there is an established pattern for which exception type to use, what the message should contain, and at what point in the call chain to raise it. A patch that raises ValueError where TypeError is expected, or that raises at the wrong layer, can pass the relevant test while violating the project’s API contract. The test might not check exception types at all.

The same issue applies to naming. When Django’s ORM has an established convention for how query methods are named and documented, a new method with a technically correct implementation but inconsistent naming requires revision. When sympy has a pattern for how symbolic simplification functions are organized, a fix that introduces a new function at the wrong level of the module hierarchy will be redirected. When pytest’s plugin API has specific protocols for hook implementations, a patch that approximates those protocols without following them precisely will be caught in review.

Routine code review in any mature project surfaces these things constantly. What SWE-bench measures is whether a model can identify the code that needs to change and produce a change that satisfies the associated tests. What it does not measure is whether the model understands the norms of the community that maintains that code.

The Institutional Knowledge Problem

Software engineering is a practice embedded in institutions. The knowledge required to contribute well to Django is not purely technical; it includes understanding what Django is trying to be, what decisions have been made for what reasons, and which directions are off-limits regardless of technical merit. This knowledge is partly documented in contributing guides and partly transmitted through participation: reading review comments on others’ PRs, observing which patches get merged and which get revised, developing an intuition for what the project values.

The SWE-bench paper describes its evaluation as measuring whether models can “resolve GitHub issues” in real repositories. This is a valid and useful capability. What it does not claim is that passing this evaluation is equivalent to understanding the communities and conventions those repositories represent.

When METR applied reviewer judgment to benchmark-passing patches, they were applying precisely this institutional knowledge: does this patch fit here? Would we be comfortable maintaining it? Those questions cannot be answered by the evaluation harness, because the harness has no representation of community standards. It knows whether the tests pass. It does not know whether the approach is right.

This creates a structural gap unlikely to close simply because models get better at passing tests. A model optimizing for SWE-bench receives feedback when tests pass and when they fail. It receives no feedback about whether the approach is consistent with the project’s architecture, whether documentation was updated appropriately, or whether the fix addresses the right layer of the stack. These are things you learn by being embedded in the community, reading review comments, and observing what gets accepted and what gets revised.

Goodhartian pressure compounds this. As Goodhart’s Law predicts, making test passage the optimization target creates pressure toward models that are specifically good at test passage, not necessarily good at the broader thing test passage was meant to proxy. The history of NLP benchmarks illustrates this: GLUE saturated in two years, SuperGLUE faster, BIG-bench faster still. Each benchmark was designed to be harder to game than the last; each was gamed. SWE-bench was designed around real codebases specifically to resist saturation. The test oracle at its core creates the same optimization surface.

What the Gap Looks Like in Practice

Consider the actual submission surface. Suppose a model produces a patch for a Django issue that makes the relevant tests pass. The patch modifies a method in django/db/models/query.py to handle a specific edge case, but does so by adding a conditional branch at the call site rather than fixing the underlying utility method that three other callers also use incorrectly. The tests pass. Django’s CI is green.

A Django core developer reviewing the patch recognizes that the fix papers over a symptom rather than addressing the cause. They request that the utility method be fixed instead, and that the other callers be updated at the same time. This is standard review feedback in a project of Django’s maturity. None of it is encoded in the test suite.

Or consider documentation. If the patch changes the behavior of a public queryset method, Django expects a corresponding update to the documentation in docs/ref/models/querysets.txt. The test suite does not check whether documentation was updated. A reviewer does. A patch missing this update gets a comment requesting it before merge.

These scenarios are not contrived. They reflect the ordinary criteria applied in the ordinary course of reviewing contributions to these projects. The gap METR measured is the aggregate effect of these ordinary criteria being unmet.

What Better Evaluation Requires

Automating a closer approximation of institutional knowledge is hard, but some steps are tractable. Each of the SWE-bench repositories has its own linter configuration; running the candidate patch through that configuration before scoring would catch a class of convention violations. Checking whether the changed code is consistent with the exception types used in adjacent code is a pattern-matching problem that static analysis can partially address. Verifying that documentation files are updated when public API behavior changes is checkable automatically.

None of this substitutes for human review, but it eliminates some of the largest gaps between the automated oracle and what reviewers actually care about. METR’s own evaluation work favors longer-horizon, task-completion evaluations rather than single-patch generation, which has the property that gaming strategies become harder as the task complexity increases.

For teams using AI coding tools, the practical implication is straightforward: run model output through your actual review process on a sample before trusting benchmark numbers as a proxy for production-readiness. The fraction that survives your review is the number that matters. It will be lower than the headline benchmark score, and by how much depends on how tightly your codebase enforces its own conventions.

SWE-bench scores correlate with real coding capability, and a model that cannot pass SWE-bench at a reasonable rate will not produce useful patches in practice. But “passes tests in a Docker container” and “fits into a project’s institutional context” are different criteria, and the METR finding is the empirical measure of how different they currently are.