Benchmark Scores Without Provenance Are Noise. Community Evals Builds the Audit Trail.
Source: huggingface
The saturation problem in LLM benchmarks is well-documented at this point. Original MMLU, a 57-subject test designed to measure broad academic knowledge, now sees frontier models scoring above 91%. GSM8K, once a meaningful signal for mathematical reasoning, sits above 94%. HumanEval has been effectively conquered. The usual response to saturation is to design harder benchmarks, which is why MMLU-Pro exists: 12,032 questions across 14 disciplines with 10 answer choices instead of 4, cutting the random-guessing floor from 25% to 10%. GPT-4o dropped from 88.7% on original MMLU to 72.55% on MMLU-Pro. That differential reveals real difficulty. But top models are now hitting 87 to 88% on MMLU-Pro too, so the cycle continues.
Hugging Face’s Community Evals initiative, announced February 4, 2026, addresses a distinct problem from benchmark saturation: the opacity around how scores get reported and what methodology produced them. Looking at it now, a few weeks after the announcement, the structural argument for it has become clearer.
The Inconsistency Problem
The saturation problem gets most of the attention, but the inconsistency problem is what makes benchmark scores difficult to use in practice. Consider what determines a model’s score on MMLU: whether you use 0-shot or 5-shot prompting, how you format the answer choices in the prompt, whether you use log-probability scoring or generation-based scoring, what template you use for the system prompt, and how you handle normalization across response formats. None of these have universal standards. A score from a model’s release paper, a score from the Open LLM Leaderboard, and a score from a third-party evaluation platform can all be methodologically defensible and still disagree by several points.
This creates a situation where comparing two models based on reported scores requires knowing not just the scores but the exact evaluation setup behind each one. That information is frequently not published alongside the number.
What Community Evals Builds
The Community Evals system is a decentralized evaluation reporting layer built into Hugging Face’s model hub. Evaluation results for any model live in a .eval_results/ directory within the model’s repository, stored as YAML files:
model-repo/
├── README.md
└── .eval_results/
├── mmlu-pro.yaml
└── gpqa.yaml
Each result file records the score, the methodology, the source (a paper, a platform, or an Inspect AI evaluation log), and metadata about the submission. Results submitted by the model’s own authors appear directly on the model card. Results submitted by anyone else go through a pull request, visible to the model author and the community, and appear as community contributions regardless of whether the author approves them. The author can close or hide a result PR, but that action is logged in the Git history.
This is the accountability mechanism. Every result carries a timestamp and a submitter. If a result appears shortly after a model release with no linked methodology, that fact is visible. If a result is modified after a paper is published, that is visible too.
Benchmarks register by adding an eval.yaml to their dataset repository on the Hub, following the Inspect AI format developed by the UK’s AI Safety Institute. Inspect AI is an open-source evaluation framework built around reproducibility: it specifies tasks, solvers, scorers, and logging in enough detail that anyone with the same spec file can re-run the evaluation and verify the result. When a submitted score links to an Inspect evaluation log, the score is traceable to a concrete reproducible specification, not just a prose description of methodology.
The Benchmark Choices
Community Evals launched with three registered benchmarks: MMLU-Pro, GPQA, and HLE.
GPQA (Graduate-Level Google-Proof Q&A) consists of 448 multiple-choice questions in biology, physics, and chemistry, written by domain experts with the explicit goal of resisting web search. The name targets contamination resistance: questions that require genuine expert reasoning rather than retrieval of facts that appear verbatim online.
HLE (Humanity’s Last Exam), from the Center for AI Safety and Scale AI, contains 2,500 questions across dozens of academic subjects, crowdsourced from PhD researchers working at the frontier of their fields. It was designed with the explicit goal of being the benchmark that frontier models cannot quickly saturate. HLE embeds a canary string in the dataset to detect if the questions surface in training data, addressing contamination more directly than most benchmarks attempt. Frontier models were reportedly scoring in single-digit percentages when HLE launched, though that gap will narrow.
MMlu-Pro sits between the saturated original and the extreme difficulty of HLE. With 10-choice questions and reasoning-intensive problems, it is harder to game through random guessing and requires chain-of-thought reasoning to perform well: GPT-4o’s score drops by roughly 19 percentage points when evaluated without CoT prompting compared to with it. That sensitivity to reasoning approach makes MMLU-Pro a more informative benchmark than its predecessor.
Starting with harder benchmarks is the right call. Building a provenance-aware reporting system around benchmarks where scores are already bunched above 91% would produce transparent records of indistinguishable results.
What This Solves and What It Does Not
Community Evals does not solve contamination. If a model was trained on benchmark questions that appeared in its pretraining corpus, transparent reporting of scores on that benchmark still reflects a contaminated result. The provenance trail helps in one narrow way: you can compare a result’s submission timestamp against a model’s training cutoff as weak corroborating evidence. Contamination detection remains an active research problem that provenance metadata does not close.
It does not solve the benchmark arms race either. The cycle of saturation followed by harder benchmark release will continue regardless of how transparently results are reported. HLE is described as the final closed-ended academic benchmark of its kind, but that framing will likely look optimistic within a few years given the pace of frontier model improvement.
What changes is the accountability structure around evaluation reporting. The Open LLM Leaderboard is a well-run centralized system, but it is one team making all the methodological decisions, and when those decisions encode assumptions about prompt formats, scoring methods, or evaluation infrastructure, those assumptions are invisible in the published scores. Community Evals makes the methodology explicit for each submitted result by requiring a source link and encouraging Inspect AI evaluation logs. The score carries provenance rather than floating free of context.
The Governance Parallel
The structural argument here parallels what open source demonstrated about software development. Open source did not succeed by producing automatically better code. It succeeded partly because the visibility of the code changed how accountability worked: bugs and design decisions could be critiqued by people outside the team that wrote the code, and the review process was itself a public artifact.
LLM evaluation has been running on a proprietary model of development, where the methodology is described in prose but the process is not inspectable. Community Evals moves evaluation reporting toward something closer to public infrastructure: the Git history is the audit trail, the PR is the review mechanism, and the Inspect AI log is the reproducible specification.
Whether this works in practice depends on whether the community actually reviews submissions. PR-based evaluation workflows will attract researchers who care about reproducibility, but may not attract every lab interested in reporting a favorable number. Community review only matters if people do the reviewing. Open source sustained peer scrutiny through the incentive of software maintenance; evaluation peer review will need to develop its own incentive structure to generate consistent oversight rather than sporadic attention.
The initiative is in beta, building in the open with community feedback solicited via OpenEvals discussions, which is the right posture for infrastructure at this stage. The benchmark coverage will expand as more dataset repositories register eval.yaml specs, and the system’s value grows with that coverage. For now, it is the most direct attempt to address the evaluation opacity problem with the right technical primitives: decentralized provenance, reproducible specifications, and a transparent submission history that the community can actually inspect.