· 7 min read ·

Benchmark Scores Are a Function of Implementation: What Community Evals Is Actually Fixing

Source: huggingface

In early 2025, HuggingFace’s Open LLM Leaderboard v2 was running MATH-Hard evaluations with a SymPy-based answer comparison system. The comparator worked well for simple symbolic expressions, but it could not handle equations with multiple variables, intervals, matrices, or rounding differences. As a result, it marked correct answers wrong for certain model families: Qwen scores were being undercounted by more than 50%, and DeepSeek scores by nearly 67%. The fix, when it came, was three lines of code that swapped in the Math-Verify library. It caused a 4.66-point average score increase across all models and reshuffled rankings significantly.

This was not a benchmark contamination problem. It was not a case of models training on test sets. It was a quietly wrong implementation, sitting inside a centralized evaluation system, distorting rankings for months. No community member could have caught it by inspection because the evaluation code was not part of the benchmark submission process. The canonical score for a model was whatever the central system produced.

That specific failure mode is what HuggingFace’s Community Evals initiative, published in early February 2026, is designed to prevent. Looking back at it now, the architecture is worth examining closely, because the approach is more technically interesting than its surface description suggests.

The Single-Authority Problem in Centralized Leaderboards

The MATH-Hard case is not unique. The DROP benchmark had to be removed from the leaderboard after two implementation bugs were found. A whitespace normalization issue caused numbers followed by a newline character to fail floating-point casting, so correct answers were marked wrong. A separate stop token problem, using . as a stop token, caused floating-point answers to be truncated at the decimal point; a model producing 12.25 would have its answer cut to 12., which fails comparison. More verbosely correct models were penalized proportionally more than weaker ones that produced shorter outputs.

The MMLU benchmark, which HuggingFace’s own analysis documented in detail, produces scores that differ by up to 30 percentage points depending on which implementation is used. Three production implementations, HELM, the EleutherAI Harness from January 2023, and the original paper implementation, differ in prompt format (presence or absence of a topic line, a “Question:” prefix, a “Choices:” label), whitespace handling, and scoring method (single-letter log-probability versus full-sequence log-probability versus text generation). The same model, the same benchmark, the same hardware, and you get numbers that look like different models entirely.

The centralized leaderboard model concentrates implementation authority in one place. When that implementation has a bug, every number the leaderboard has ever published for that benchmark is wrong. There is no independent verification mechanism, no way for external parties to submit an alternative measurement, and no audit trail showing what changed and when. The system operates correctly until it doesn’t, and when it doesn’t, there is no way to know how long it has been wrong.

What Community Evals Changes Architecturally

Community Evals is built on three layers that connect through the HuggingFace Hub’s Git infrastructure.

Benchmarks are dataset repositories that register themselves by adding an eval.yaml file in the Inspect AI format, the evaluation specification system from the UK AI Safety Institute. This YAML file defines the evaluation protocol canonically: prompts, scoring methods, and evaluation configuration. When a benchmark registers via this format, any result submitted against it can be independently reproduced from the specification. The benchmark repository then aggregates submitted results and displays a leaderboard in its dataset card.

Models store their evaluation results in an .eval_results/ directory within their model repository, as YAML files. These files include source attribution, linking the result to a paper, model card, third-party platform, or reproducible Inspect eval logs. Both the model author and any community member can contribute results. Community-submitted results appear immediately with a “community” tag; they do not require merge approval to display. The full Git history provides an audit trail: who submitted what, when, and whether anything was changed.

The connection between these two layers is the Inspect AI format. HuggingFace’s own evaluation library, lighteval, now supports Inspect AI as a backend, making the two interoperable. Any benchmark registered in eval.yaml format can be run through lighteval, and results from that run can be traced back to the canonical specification. Reproduced results earn a verified badge.

Currently live benchmarks include MMLU-Pro from TIGER-Lab, GPQA from Idavidrein, and HLE from the Center for AI Safety. The system is in beta, but the Hub already exposes all scores via API, enabling custom dashboards without requiring permission from HuggingFace.

Why the Specification Format Is the Key Decision

The choice to build around Inspect AI rather than a proprietary format is the most consequential architectural decision in this system. A benchmark registered in Inspect AI format is, by definition, independently runnable by anyone with access to the framework. The evaluation is not a black box; it is a declarative specification that produces deterministic results when executed correctly. If a community member disagrees with a submitted score, they can run the eval.yaml themselves and submit a PR with their result, linking to their execution logs.

This is what makes the verified badge meaningful. It is not a claim that a central authority checked the numbers; it is a claim that the result was reproduced by running the canonical specification independently. The bug in the MATH-Hard evaluator would have been detectable much earlier under this model, because any external party running the same spec against the same benchmark would see different numbers from the central system and have a mechanism to report it.

The Inspect AI framework itself comes from the UK AI Safety Institute, which gives it a degree of institutional neutrality that a HuggingFace-proprietary format would not have. Using a third-party standard also means that benchmarks maintained by academic labs, companies, or independent researchers can register without adapting to a format designed around HuggingFace’s internal tooling.

What This Does Not Solve

Community Evals does not address training data contamination. If a model was trained on MMLU-Pro test examples, its high score under this system is just as inflated as it would be under any other. The Community Evals article is explicit about this: the system “won’t stop training on test sets.” The transparency it offers is about implementation correctness and provenance, not whether the benchmark itself is a valid out-of-sample measurement.

Benchmark saturation is also unaddressed by infrastructure changes. MMLU is above 91%, GSM8K above 94%, and HumanEval effectively maxed out. The current shortlist of MMLU-Pro, GPQA, and HLE represents harder targets, but the saturation cycle will eventually run through these too. That is a problem of benchmark selection, not evaluation infrastructure.

The system also gives model authors veto power over community-submitted results; they can close PRs and hide scores they dispute. This is probably the right default for avoiding abuse, but it creates a mechanism for suppressing inconvenient community measurements. The audit trail at least makes such suppression visible in the Git history, so the action itself becomes part of the public record.

The Comparison to Centralized Systems

LMSYS Chatbot Arena takes a different approach: human preference voting, pairwise comparisons, and Elo ratings. It captures something that static benchmarks cannot, namely which model users actually prefer in open-ended conversation, but it is slow, expensive, and subject to vote manipulation from motivated communities. Community Evals is complementary, targeting the reproducible, automated layer of evaluation that Arena does not cover.

HELM from Stanford is perhaps the closest prior art in philosophy: a centralized system that evaluates models across multiple metrics, including accuracy, calibration, robustness, and fairness. HELM standardizes evaluation scenarios, but it remains a single organization deciding what gets evaluated and how. Community Evals distributes that authority. Any researcher can register a benchmark and any user can contribute results.

The broader field of BigCodeBench, GAIA, and other specialized leaderboards has been moving toward harder, more realistic tasks precisely because simple benchmarks saturate and get gamed. Community Evals is not a replacement for that work; it is infrastructure that could host any of those evaluations and make their results independently verifiable.

The Audit Trail as a First-Class Feature

The infrastructure HuggingFace has built here is essentially applying version control and peer review to model evaluation, the same mechanisms that make open source software trustworthy despite being produced by distributed, uncoordinated contributors. A leaderboard backed by Git history, canonical specifications, and community-submitted reproductions is auditable in a way that a centralized system with opaque implementations cannot be.

The MATH-Hard bug and the DROP removal both took a long time to surface precisely because there was no external verification path. Someone using the leaderboard numbers to make decisions, whether comparing models for deployment, deciding which architecture to pursue in a research project, or assessing a vendor’s claims, had no way to know whether the numbers they were reading reflected actual model capability or a SymPy edge case. Community Evals does not eliminate that uncertainty, but it creates a structure in which the uncertainty can be surfaced, challenged, and resolved in public.

Whether the community adopts it at sufficient scale to realize that potential depends on whether benchmark authors register their eval.yaml files and whether practitioners actually submit results rather than just reading them. The design is sound; adoption is the remaining variable.

Was this interesting?