· 6 min read ·

Evaluation as Infrastructure: Revisiting Community Evals and the Trust Problem in Benchmarks

Source: huggingface

For most of the past five years, who controlled evaluation infrastructure largely determined what the community understood about model capability. Hugging Face’s Open LLM Leaderboard ran from 2022 to March 2025, evaluated more than 13,000 models, and became the default signal for practitioners comparing open-weight releases. Its retirement prompted a clear question: what do you replace it with, and who owns the replacement?

The answer Hugging Face published in February 2026 doesn’t replace the leaderboard. It replaces the concept of the leaderboard as a distinct application with evaluation results as a property of model repositories. Benchmark scores live in .eval_results/ directories inside model repos as versioned YAML files. Benchmark definitions live in eval.yaml files inside dataset repos. The Hub aggregates them and renders leaderboards automatically. No centralized evaluation queue, no team that runs your model for you, no waiting.

This is a real architectural shift, but understanding what it changes requires starting with why centralized evaluation kept breaking down.

The Reproducibility Gap

The most persistent evaluation problem wasn’t model authors gaming leaderboards, though that happened. It was that scores for the same model on the same benchmark, produced by different teams using different tooling, were not comparable, and nothing in the published record told you this.

The MMLU benchmark became the canonical example. A 65B LLaMA model achieves 63.6% when evaluated using the method from the original Hendrycks et al. paper, which compares the model’s probability assigned to the letter tokens A, B, C, and D. The same model achieves 48.8% when evaluated using EleutherAI’s Language Model Evaluation Harness v1, which computes log-likelihoods over the full option text. A fifteen-point difference on identical weights, from a single choice about which tokens to score over.

The sources of divergence compound quickly. Whether the prompt includes a topic-line header before the question matters. Whether choices are formatted as “A. option” or “Choices: A. option” matters. Whether length normalization is applied to the log-likelihoods matters. None of this was documented alongside scores on model cards or leaderboards because there was no standard schema for it. You couldn’t look at two MMLU numbers and know whether they were measuring the same thing.

Community Evals creates that schema. Every submitted result includes the evaluation framework, a link to the source (paper, model card, or evaluation trace), a dataset revision hash, and a free-text notes field. A minimal result entry looks like:

- dataset:
    id: TIGER-Lab/MMLU-Pro
    task_id: default
    revision: 2c1e4a7
  value: 72.1
  date: "2026-02-10"
  source:
    url: https://huggingface.co/papers/2406.01574
    name: MMLU-Pro paper
  notes: "5-shot, chain-of-thought, greedy decoding"

The configuration that produced a score is now adjacent to the score, with a git history behind it. This doesn’t prevent inconsistent methodology, but it makes the inconsistency visible and attributable.

What the Benchmark Registry Signals

The benchmarks in the initial Community Evals registry are a deliberate statement about where the saturation line has moved. MMLU scores have exceeded ninety percent for frontier models; the benchmark no longer differentiates them. MMLU-Pro extends the format with ten-choice questions that require chain-of-thought reasoning, which reduces even strong models by roughly sixteen percentage points compared to their MMLU scores on equivalent domain coverage. GPQA Diamond contains 448 graduate-level science questions where non-expert humans reach only about thirty-four percent accuracy even with web access. Humanity’s Last Exam spans 2,500 multimodal questions across mathematics and sciences at a difficulty level where current frontier models are still well below human expert performance.

The harder benchmarks buy time before saturation, but the underlying dynamic is the same as it was for the original suite. A benchmark that serves as a selection criterion will be optimized toward. Access controls on HLE’s dataset are specifically intended to slow this process by making contamination harder. Whether access controls slow contamination or just displace it to less visible channels is an open question, and LightEval’s benchmark documentation acknowledges this tension explicitly.

What Community Evals adds to the contamination problem is narrow but concrete: dataset revision hashes in result records mean you can track which version of a benchmark produced a score. When a benchmark is updated to address contamination, you can see whether a model’s score came from the clean version or the version that was likely present during training.

The Trust Tier

The system’s most interesting architectural detail is its badge structure. A result with a valid verifyToken field was produced by running the evaluation in Hugging Face Jobs using Inspect AI, the evaluation framework maintained by the UK AI Security Institute. The token is a cryptographic proof of this; the Hub marks these results “verified.”

A result submitted as an open pull request gets a “community” badge, meaning it’s visible but the model author hasn’t endorsed it. Merged results without a verify token sit in between.

This three-tier structure is more honest than the binary present-or-absent model that most leaderboards use. It makes explicit that a score appearing on a leaderboard means different things depending on how it arrived there.

Two complications in the verified tier are worth noting. The security guarantee depends on HF Jobs being the exclusive path to generate a valid token, which is a centralized dependency inside an otherwise distributed system. And the HLE benchmark configuration uses model_graded_fact scoring with openai/o3-mini as the judge model, meaning a fully verified evaluation on the system’s most challenging benchmark requires an active OpenAI API subscription. For infrastructure built on open-source principles, this is a real constraint on who can participate in the verified tier.

What Distributing Control Changes

The argument that made centralized evaluation look necessary was quality control: if anyone can report scores, you get inconsistent methodology and undetectable manipulation. This argument was accurate about the problem and wrong about the solution. Centralized evaluation doesn’t eliminate inconsistent methodology; it makes the methodology invisible and attributes trust to the team running it rather than to the process itself.

The Open LLM Leaderboard had its own methodological failures. The DROP benchmark was removed entirely because a normalization bug in floating-point answer matching meant that decimal answers were systematically truncated, and fixing it would require rerunning years of evaluations. The benchmark was simply dropped. Under Community Evals’ architecture, that decision would be visible in git history and subject to community challenge; under the old model, it was a unilateral call.

SWE-bench Verified and Terminal-bench 2.0 joining the registry alongside GPQA and HLE reflects this shift in scope. Coding and shell-execution benchmarks have clear failure modes that are hard to game without producing models that are actually useful at those tasks. The community-registration mechanism, which requires submitting an eval.yaml and applying through the OpenEvals discussion board, opens the benchmark registry to practitioners who study specific capability domains rather than restricting it to what the Hugging Face team chooses to run.

Model authors can still close community-submitted score PRs, hiding results they don’t want attributed to their models. The mechanism is the same for rejecting a misconfigured evaluation and for suppressing an unflattering result. This tension isn’t addressed in the announcement. The git history records that a PR was closed, but not why, so the audit trail has a gap exactly where the most contested cases would appear.

What Hasn’t Moved

The gap between leaderboard scores and real-world utility remains what it was. A model that leads MMLU-Pro and GPQA will still produce incorrect code for non-trivial tasks and hallucinate citations under routine conditions. Community Evals is infrastructure for measuring performance on specific tasks under specific configurations, not for measuring whether a model is useful in production. That gap predates centralized leaderboards and will outlast this architecture.

What has shifted is the baseline conditions for evaluation trust. The configuration that produces a score can now be documented, linked, versioned, and challenged. The system makes evaluation a practice with traceable methodology rather than a number appearing from a black box. Whether the community uses the infrastructure this way depends on whether practitioners find it worth their time to document methodology rather than just report numbers. That’s a social question the YAML schema can’t resolve, but the schema at least stops making documentation impossible.

Was this interesting?