You Can Game AI Benchmarks Without Touching the Model

There is a particular kind of result that should make anyone who cares about AI evaluation deeply uncomfortable, and this post is one of them.

The author topped an AI leaderboard without changing a single model weight. No fine-tuning. No new training data. Just a careful study of LLM internals, an understanding of which components drive which behaviors, and inference-time interventions to push the scores up.

The framing here is “LLM neuroanatomy”: the idea that transformer models, like brains, have functionally distinct regions. Specific attention heads, layers, and circuits are responsible for specific behaviors. This isn’t a new observation. Mechanistic interpretability researchers have been mapping these structures for a few years now. What makes this work interesting is applying that knowledge not just to understand a model, but to exploit it, specifically by improving benchmark scores without any of the actual improvements benchmarks are supposed to measure.

What “Without Changing a Single Weight” Actually Means

At inference time, you can do quite a lot. Activation steering, where you add learned or hand-crafted vectors directly to the model’s residual stream, can shift behavior in consistent, targetable ways. You can boost attention heads associated with careful reasoning, suppress ones linked to overconfident or sycophantic responses, or nudge representations toward the kinds of outputs that score well on a particular eval.

None of this requires touching the weights. The model is frozen. What changes is how its activations are shaped during the forward pass.

This is mechanistic interpretability applied backwards: instead of asking “what is this component doing?”, you ask “which component do I need to adjust to get this output?”. The answer, apparently, is enough to climb a leaderboard.

The Leaderboard Problem

Benchmarks are already under suspicion. Training data contamination, overfitting to eval formats, and selective reporting are known issues. But those problems are at least conceptually addressable. You can hold out test sets, rotate benchmarks, and audit training data.

Inference-time manipulation is harder to detect and harder to rule out. If a model ships with activation patches or steered representations baked into its inference pipeline, the published weights are not lying, technically. But the scores are not measuring what they claim to measure.

This is worth sitting with. The model that “performs best” on a benchmark may be the one with the most carefully tuned inference-time interventions, not the one with the best underlying capabilities. And because those interventions can be proprietary, undisclosed, or simply novel, there is no reliable way to audit for them.

Why This Matters for Anyone Building on LLMs

If you are choosing a base model for an application, leaderboard scores are probably already low-signal. This research suggests they may be even lower-signal than commonly assumed. A model that tops a reasoning benchmark through activation steering on that specific eval format may generalize poorly to your actual use case.

The more useful takeaway from work like this is the underlying map: if specific components of the model are responsible for specific behaviors, then understanding those components gives you real leverage. Not for gaming leaderboards, but for building more reliable systems. You can test whether the components associated with careful, calibrated responses are active when your model is deployed on inputs that matter. You can ablate or steer toward behaviors you actually want.

The technique is dual-use in a familiar way. The same knowledge that lets you inflate a benchmark score also lets you build a more honest eval of what your model is actually doing.