· 5 min read ·

The Score Beneath the Score: What Hugging Face's Community Evals Actually Changes

Source: huggingface

Back in February, Hugging Face launched Community Evals, a system that lets anyone submit benchmark scores for any model on the Hub via a pull request. Worth revisiting now that it’s had time to settle, because the headline description — “community-driven leaderboards” — undersells what the infrastructure is actually doing and oversells what it fixes.

The problem it’s responding to is real. MMLU scores are above 91%. GSM8K is effectively saturated. Models that top every leaderboard still make obvious errors on code that runs in production. And the same model often has three or four different scores floating around: in the paper, on the model card, on the Open LLM Leaderboard, and on whatever platform the releasing team chose to showcase. None of these scores were produced with the same prompt format, temperature, or few-shot count, and nothing in any of those places told you that.

Community Evals addresses the last problem directly, and largely ignores the first two. The post is honest about this. It explicitly says the system won’t solve benchmark saturation, won’t close the gap between leaderboard scores and real-world utility, and won’t stop training on test sets. What it does instead is make the evaluation configuration part of the permanent record.

How the Plumbing Works

Results are stored as YAML files inside a model’s repository under .eval_results/. A minimal entry looks like this:

- dataset:
    id: cais/hle
    task_id: default
    revision: a3c98f7
  value: 20.90
  verifyToken: "eyJ..."
  date: "2026-01-15"
  source:
    url: https://huggingface.co/spaces/cais/hle-leaderboard
    name: HLE Leaderboard
  notes: "no-tools"

The notes field is free text, which matters more than it looks. When two reports for the same model on the same benchmark differ, the difference is almost always in the configuration: chain-of-thought enabled or not, tool use allowed or not, prompt template used. That context used to live nowhere. Now it lives next to the number.

On the benchmark side, a dataset repo registers itself with an eval.yaml file that describes the task structure, solvers, and scorers in terms of Inspect AI, the evaluation framework developed by the UK AI Security Institute. The Hub then auto-aggregates any model results that reference that dataset ID and renders a leaderboard in the dataset card. No central team needs to be involved after registration.

The canonical command for running a registered benchmark against a model looks like:

inspect eval hle.py --model hf-inference-providers/meta-llama/Llama-3.3-70B-Instruct:fastest

The :fastest suffix is a provider selection policy. You can also use :cheapest or :preferred, or specify a provider directly. This matters because of a finding that’s easy to overlook.

Provider Variance Is a Real Signal

Running the same model weights through nine different inference providers on an identical task produced scores ranging from 0.80 to 0.84. A four-point spread on identical weights, from hardware differences, batching strategies, and implementation details across providers. That’s not noise; that’s a real source of variance that was previously invisible.

Most published benchmark scores don’t tell you which inference provider was used. With Community Evals, the source URL and notes field create at least the possibility of tracking this down. Whether people will fill those fields in consistently is a different question, but the schema supports it.

This is probably the most underappreciated finding in the initial rollout. Benchmark comparisons assume that running a model is a deterministic operation, and it isn’t.

The Trust Hierarchy

Not all scores in the system carry the same weight, and the badge system makes that explicit. The verifyToken field is a cryptographic proof that the evaluation ran in HF Jobs infrastructure using Inspect AI. Results with a valid token get a “verified” badge. Results submitted via a pull request that hasn’t been merged to the model’s main branch get a “community” badge. Merged results without a verify token sit somewhere in between.

This three-tier structure — verified, merged, community — is more honest than the binary present/absent distinction of traditional leaderboards. A verified score was produced under known conditions on known infrastructure. A community score was produced by someone who cared enough to submit it, but you’re trusting the submitter.

The trust hierarchy only works if the verify token system holds up. Its security depends on HF Jobs being the exclusive path to generate a valid token, and on that path remaining closed to manipulation. For now that seems like a reasonable assumption, but it’s worth noting that the security guarantee is centralized even when the submission process isn’t.

What Doesn’t Get Fixed

Model authors can close a score PR. This hides the community-submitted result from the model’s leaderboard display. There are legitimate reasons to do this — a score submitted with the wrong configuration, or for a model revision that has since been replaced. But the mechanism is the same for suppressing an unflattering result from a competitor or a harsh benchmark run you’d rather not publicize. The post doesn’t address this tension.

Data contamination gets acknowledged and set aside. The YAML schema includes a dataset revision hash, which is useful for reproducibility but does nothing to verify whether that dataset was in the training data. The system makes contamination traceable in the sense that you can see exactly which dataset version was used, but it can’t tell you whether that version was seen during training.

There’s also a dependency buried in the HLE benchmark configuration that’s worth naming: the scorer uses model_graded_fact with openai/o3-mini as the judge model. Humanity’s Last Exam, one of the flagship frontier benchmarks in the official registry, requires a call to OpenAI’s API to produce a verified score. For an otherwise open system where reproducibility is a stated goal, that’s a real constraint on who can run a fully verified evaluation.

Lighteval, Hugging Face’s own evaluation library, now supports Inspect AI as a backend and provides access to over a thousand pre-built tasks without writing any code. The combination of that library and the Community Evals infrastructure means the barrier to submitting a reproducible, framework-native evaluation is genuinely low. The barrier to submitting a verified evaluation for a benchmark with a proprietary judge dependency is not.

What Actually Changed

The benchmarks registered so far represent a wider scope than the initial four. GSM8K and MMLU-Pro are present for historical continuity. SWE-bench Verified and SWE-bench Pro represent real software engineering tasks. Terminal-bench 2.0 covers shell and agent behavior. The registry is growing because the registration mechanism is just a PR to a dataset repo plus an application to the OpenEvals discussion board.

The more significant change is epistemological. Previously, a leaderboard score was a number with a provenance story that you had to track down separately, if you could find it at all. Now the provenance — evaluation framework, dataset version, solver chain, scorer, notes about configuration — lives adjacent to the number in a standardized schema with a git history behind it. That’s the version-controlled audit trail the field has been missing.

Community Evals doesn’t end benchmark gaming. A model author who wants to cherry-pick favorable configurations can still do that, and the notes field is free text rather than a constrained vocabulary, so comparisons across submissions require judgment. But it changes the cost of gaming. The game is now played in public, with the moves recorded. That’s different from playing it in a black box and publishing only the final score.

Was this interesting?