· 6 min read ·

When the Benchmark Becomes the Training Data

Source: openai

For much of 2024 and into 2025, SWE-bench Verified was the number that mattered in AI-assisted software engineering. Every major lab published scores against it. Press releases led with it. Leaderboard positions shifted meaningfully with each new model drop. It was, by most accounts, a genuinely good benchmark: real GitHub issues, real repositories, real test suites. Then OpenAI announced in February 2026 that they would no longer evaluate against it, citing training contamination and fundamentally flawed test cases. The recommendation: move to SWE-bench Pro.

This isn’t just a benchmark retirement story. It’s a clean case study in a failure mode that shows up every time the AI field converges on a single evaluation, and it’s worth understanding the mechanism rather than just the outcome.

What SWE-bench Verified Actually Tests

SWE-bench, introduced by researchers at Princeton in late 2023, takes a specific approach to evaluating code generation: rather than asking models to write functions from scratch or solve algorithmic puzzles, it presents real bug reports and feature requests drawn from popular open-source Python repositories, then checks whether the model’s proposed patch causes the associated test suite to pass. The repositories include Django, scikit-learn, sympy, Flask, and about eight others, all with substantial test coverage and complex interdependencies.

The original benchmark contained 2,294 tasks. SWE-bench Verified, a collaboration between Anthropic and OpenAI released in 2024, filtered that down to 500 manually reviewed tasks. Human annotators went through each problem, confirmed that the issue description was unambiguous, verified that the reference solution actually fixed the described bug, and checked that the tests correctly distinguished a correct fix from a wrong one. The goal was to remove noise: tasks where the reference patch was wrong, where the tests were broken, or where the problem description was too vague to solve without additional context.

At launch, it was an improvement. Scores on SWE-bench Verified were more interpretable than scores on the raw benchmark. A model that resolved 40% of verified tasks was doing something meaningfully harder than one that resolved 20%.

The Two Problems

OpenAI’s analysis identifies two distinct failure modes, and it’s worth separating them because they have different causes and different remedies.

The first is training leakage. The repositories in SWE-bench are among the most widely forked and discussed codebases on GitHub. Django and scikit-learn have been in the training corpora of essentially every large language model trained on public code. More critically, many of the GitHub issues in the benchmark, along with their associated pull requests and commit messages, are also present in training data. A model doesn’t need to have been explicitly fine-tuned on the benchmark to have effectively memorized the solutions; it may have seen the PR that fixed the bug during pretraining, as part of a diff scraped from GitHub’s event stream.

This is a harder problem than it looks. The standard approach to preventing contamination is to set a training cutoff before the benchmark was created. SWE-bench was released in late 2023, so a model with a training cutoff of, say, mid-2023 shouldn’t have seen the benchmark itself. But it may well have seen the underlying issues and patches, which predate the benchmark by months or years. The benchmark is constructed from historical data, which means the contamination is baked into the source material, not just into knowledge of the benchmark’s existence.

The second problem is test quality. Even the verified subset contains tasks where the tests do not correctly constrain the solution space. A test suite written to catch one specific kind of regression may accept a patch that makes the test pass without actually fixing the underlying issue. Conversely, some tests rely on environment details, timing assumptions, or external state that makes them unreliable indicators of correctness. When a frontier model scores 50% on SWE-bench Verified, it’s genuinely unclear how much of that is correct problem-solving and how much is test suite exploitation.

Why This Pattern Is Familiar

The NLP community went through this cycle with GLUE and then SuperGLUE. GLUE was introduced in 2018 as a multi-task benchmark for natural language understanding. Within two years, models were exceeding human performance on it. The response was SuperGLUE, a harder suite. Models saturated that within another year or two, and the community moved on to BIG-bench and then to ever-larger collections of diverse tasks.

The underlying dynamic is consistent: a benchmark drives model development, which drives training optimization toward the benchmark, which erodes its validity as a neutral measurement. The better a benchmark is at signaling what capable models should be able to do, the faster it gets targeted by training pipelines. Static benchmarks have a natural half-life that shortens as compute and optimization pressure increase.

SWE-bench Verified compressed this cycle unusually fast because the source material was already public. Leakage didn’t require deliberate cheating; it was built into the benchmark’s construction methodology.

What SWE-bench Pro Changes

SWE-bench Pro addresses both failure modes, though the contamination problem is harder to fully solve.

On the contamination front, SWE-bench Pro draws from more recent issues, specifically from after the training cutoffs of current frontier models. This reduces, though doesn’t eliminate, the risk that a model has already seen the solution in pretraining. It also pulls from a broader set of repositories, reducing the concentration on the handful of extremely high-traffic codebases where leakage is most likely.

On the test quality front, SWE-bench Pro invests more heavily in validation. The verification process examines not just whether the reference solution passes the tests, but whether the tests themselves are discriminative: does a clearly wrong patch fail them? This catches a class of underspecified tests that made it through the original SWE-bench Verified review.

The benchmark is also substantially harder in terms of task complexity, involving longer issue descriptions, more intricate dependency chains, and problems that require understanding cross-module interactions rather than isolated function-level fixes. Current frontier models score significantly lower on SWE-bench Pro than on SWE-bench Verified, which is the point: a benchmark that leaves meaningful headroom is more useful for tracking progress over time.

The Deeper Problem With Static Evaluation

Retiring SWE-bench Verified is the right call, but it doesn’t resolve the fundamental tension between static benchmarks and rapidly improving models. Any benchmark built from historical data will face contamination pressure. Any benchmark that becomes widely adopted will attract optimization pressure. The two forces compound: wide adoption accelerates contamination discovery, which accelerates targeted training, which accelerates saturation.

One response is continuous benchmark renewal: keep generating new tasks from recently filed issues, never publish the test set publicly, and run evaluations only through a submission interface that returns scores without exposing inputs. This is the approach used by platforms like EvalPlus for code generation, and it’s essentially what competitive programming judges have done for decades. The trade-off is that it becomes harder for the community to inspect failures and harder for researchers to understand what capability a score actually reflects.

Another response is to move toward evaluation frameworks that are harder to saturate by construction: multi-step tasks with underspecified requirements, interactive debugging sessions where the model must respond to feedback, or tasks that require integrating information across a large, unfamiliar codebase without any retrieval shortcuts. These better reflect what a software engineer actually does, but they’re also harder to score reliably and more expensive to run at scale.

SWE-bench Pro is a pragmatic middle ground. It buys time by moving the goalposts, and it improves measurement validity by tightening the test quality bar. Whether it remains useful for two years or four is largely a function of how quickly training pipelines adapt to its distribution, which is itself a function of how widely the task set gets published and discussed.

OpenAI’s decision to document their reasoning publicly, rather than quietly dropping the metric, is useful for the field. The analysis of specific failure modes in the flawed test cases and the mechanism of training leakage gives other labs and benchmark maintainers concrete things to check for. The recommendation to move to SWE-bench Pro is a reasonable consensus position for the current moment, even if SWE-bench Pro will eventually face the same pressures.

The pattern is predictable enough that it probably makes sense for the field to treat any single static benchmark as a time-limited instrument rather than a durable standard. Track multiple evaluations. Weight recently introduced ones more heavily. Treat saturation as expected, not as a failure of the benchmark designers. The alternative is treating each benchmark retirement as a scandal, which misunderstands what benchmarks are for.

Was this interesting?