Benchmark Rot: Why SWE-bench Verified Couldn't Survive Its Own Success
Source: openai
In late February 2026, OpenAI published a post explaining why they no longer report SWE-bench Verified scores. The two-sentence version: the benchmark is contaminated and the tests are flawed. The longer explanation is more interesting, because this pattern has played out before and will keep playing out unless the field changes how it thinks about evaluation design.
What SWE-bench Actually Tests
SWE-bench was introduced by Carlos Jimenez and colleagues at Princeton in a paper that appeared at ICLR 2024. The benchmark consists of 2,294 tasks drawn from 12 popular Python repositories: Django, Flask, scikit-learn, requests, sympy, and others. Each task presents a GitHub issue, and the model must produce a unified diff patch that causes the repository’s existing test suite to pass when applied to the codebase at the specific commit predating the original fix.
The evaluation is hands-off by design. The harness runs the model’s patch inside a Docker container with the repository checked out at the right state, executes the tests, and records pass or fail. This made SWE-bench feel rigorous compared to benchmarks that relied on string-matching or simple unit tests written specifically for the evaluation.
SWE-bench Verified came later as a curated subset of roughly 500 tasks that human annotators had checked to confirm: the task is actually solvable with a reasonable patch, the test suite verifies the right behavior, and the problem statement is unambiguous. The motivation was sound. The original SWE-bench contained tasks that turned out to be unsolvable, had underspecified requirements, or had tests that didn’t actually validate the core fix. Verified was supposed to be the cleaner, more reliable version.
For a while, it served that purpose. Early 2024 performance from top agents sat around 10-20%. The benchmark was hard enough to meaningfully separate models and scaffolding strategies. That gap closed quickly.
How the Contamination Happened
The core problem is that SWE-bench tasks are drawn from public GitHub history. The issues, the discussion threads, the pull requests that resolved them, and the final patches are all publicly available and have been for years. Every major web crawl used to build large-scale training corpora includes GitHub. The Stack, RedPajama, The Pile, and the proprietary datasets used by the major labs all contain substantial GitHub content.
A model trained on data with a cutoff anywhere in 2023 or 2024 has, with high probability, seen the text of the GitHub issues used in SWE-bench Verified. It has likely seen the associated PR descriptions. In many cases, it has seen the actual patch. When the same model is then evaluated on SWE-bench, it is not necessarily reasoning through the problem from first principles. It may be recognizing a pattern it encountered during training and reproducing a solution that resembles the one it saw.
This is different from deliberate benchmark overfitting. No one necessarily trained a model specifically on SWE-bench tasks. The contamination is structural: the benchmark draws from a data source that is already in most training corpora, so the overlap is unavoidable.
The second issue OpenAI identified is flawed tests within the Verified subset. Some tests passed for the wrong reasons: they checked surface behavior that could be satisfied by a superficially plausible patch without actually fixing the underlying issue. Some were flaky across runs. A model that learns to generate patches that look syntactically similar to the original resolution, without deeply understanding the code, will score well on these tests.
The combination of training leakage and weak tests produces a benchmark that reliably rewards pattern recognition over genuine problem solving. By the time scores were clearing 50% and climbing toward 70%, the signal had degraded badly.
The Same Lifecycle, Again
HumanEval, introduced by OpenAI in 2021, contained 164 hand-written Python functions testing standard algorithmic and string-processing tasks. It was the standard code generation benchmark for roughly two years. By 2024, frontier models were scoring above 90% and the benchmark had essentially zero discriminative power at the frontier.
MBPP (Mostly Basic Programming Problems) followed the same arc. GSM8K, a grade-school math benchmark released in 2021, was approaching saturation within three years. MMLU saw growing concern about test set contamination from researchers who found significant overlap between benchmark questions and content in common training datasets.
The lifecycle is consistent. A benchmark is created and earns trust through rigorous design. It becomes the standard metric, cited in papers and press releases. Scores on that benchmark begin to appear in marketing copy. The community optimizes for it. Training data adjacent to the benchmark gets more attention. Scores rise faster than underlying capability. The benchmark stops measuring what it was designed to measure.
LiveCodeBench tried to break this cycle with a rolling window approach, continuously ingesting new competitive programming problems published after a specified date. Because part of the benchmark is always in the future relative to any model’s training cutoff, hard contamination is structurally harder. The tradeoff is that you lose stable comparisons over time, since the benchmark content itself changes.
What SWE-bench Pro Does Differently
SWE-bench Pro, which OpenAI now recommends, is designed with contamination resistance as an explicit goal. The tasks are drawn from more recent repository activity and skew toward higher complexity: multi-file changes, longer reasoning chains, less dependency on recognizable fix patterns.
The more important design decision is that SWE-bench Pro is built to maintain a difficulty ceiling well above current model capabilities. One of the failure modes of SWE-bench Verified was that the score distribution compressed near the top. When multiple frontier models are all clearing 60-70% of a benchmark, the benchmark tells you very little about which model is actually better.
Pro’s stricter test verification also addresses the flawed-test problem. If the tests are good, a model that pattern-matches without actually solving the issue will fail more consistently.
These are meaningful improvements. But they don’t address the structural problem: SWE-bench Pro will also become contaminated eventually. If it becomes the standard evaluation for coding agents, its task distribution will appear in future training data. The GitHub issues it draws from will age into training corpora. The community will document the patterns. Continued exposure to the benchmark through evaluation pipelines will create implicit pressure to optimize for it. The timeline may be longer than SWE-bench Verified’s, but the endpoint is the same.
The Incentive Structure Is the Problem
Benchmark contamination is accelerated by the fact that benchmark scores have become marketing material. The release posts for major models routinely lead with SWE-bench numbers. A one-point improvement on a coding benchmark is worth a paragraph in a launch announcement.
This creates an incentive structure where labs are motivated to maximize benchmark scores independent of whether those scores reflect genuine capability improvements. Even without deliberate manipulation, the pressure to score well influences which training data gets prioritized, which fine-tuning tasks get chosen, and which evaluation results get published versus buried.
The research community is aware of this. The people who built SWE-bench Verified and SWE-bench Pro understand benchmark contamination deeply. But the deployment of benchmark scores in press releases follows different incentives than the design of the benchmarks themselves.
What would help is treating benchmark scores the way security researchers treat CVSS scores: as one input in a larger assessment, always accompanied by methodology disclosure, never as the summary headline. That means publishing training data cutoffs alongside benchmark results, disclosing what contamination checks were run, and acknowledging when a score reflects a benchmark that has known limitations.
What This Means If You’re Building, Not Competing
For most developers building on top of coding agents rather than competing on leaderboards, the contamination debate is mostly background noise. The practical question is whether a model solves problems in your codebase, which is by definition uncontaminated, since your code is not in anyone’s training data.
The most honest evaluation of a coding agent for your specific use case is a small benchmark drawn from real closed issues in your own repository. Record the model’s success rate over time, use consistent prompting and scaffolding, and track it as you upgrade models. This does not scale to a leaderboard, but it measures the thing that actually matters for your use case.
OpenAI’s decision to stop reporting SWE-bench Verified results is worth taking seriously as a gesture of epistemic honesty. It is easier to keep citing a benchmark than to acknowledge it has become misleading. The fact that they did it publicly, with an explanation, sets a useful precedent. Whether SWE-bench Pro escapes the same fate depends on whether it can maintain the separation between its task distribution and the training data of future models, and on whether the competitive pressure to optimize for popular benchmarks can be kept from overwhelming better evaluation practices. The history of this problem suggests both are hard to sustain.