The Hidden Variables That Make AI Benchmark Scores Incomparable

The leaderboard says LLaMA-65B scores 63.7% on MMLU. The EleutherAI evaluation harness says 48.8%. Both numbers come from running the same model on the same dataset. Neither party made an arithmetic error. The gap comes entirely from implementation details: whether you score by comparing log-probabilities of letter tokens (A, B, C, D) or full answer strings, whether you include a “Question:” prefix in the prompt, whether you normalize by token count. These choices, none of which appear in a typical benchmark citation, shift the reported score by 15 points and change how models rank against each other.

This is the concrete problem HuggingFace’s Community Evals is responding to, published on February 4, 2026, as part of a broader rethinking of how evaluation infrastructure should work. Looking back at it now, it reads as an honest diagnosis of a failure mode the ML community has been accumulating for years.

The Implementation Problem Is Not Theoretical

The MMLU variance was documented in detail by HuggingFace’s evaluation team in a 2023 investigation. They ran the same models through three different implementations: HELM (Stanford CRFM), the EleutherAI harness, and the original UC Berkeley implementation. Falcon-40B scored 57.1% in HELM, 52.7% in the harness, and 55.8% in the original. LLaMA-30B scored 58.3%, 45.7%, and 58.4% respectively. The harness, which uses a different prompt format and a different scoring approach, produced scores 12 to 13 points lower than the other two implementations on some models while producing comparable scores on others.

The conclusion from that investigation was unambiguous: “Evaluations are strongly tied to their implementations — down to minute details such as prompts and tokenization. The mere indication of ‘MMLU results’ gives you little to no information about how you can compare these numbers to others.”

Benchmark scores, in this framing, are implementation artifacts with a measurement attached. When a model card reports an MMLU score and a leaderboard reports a different one for the same model, both may be correct relative to their respective implementations. There is no canonical answer, and there is currently no infrastructure that makes this visible to anyone reading a model comparison.

The DROP benchmark case makes the same point from a different angle. When HuggingFace ran DROP on the Open LLM Leaderboard, the score distribution was bimodal: a small cluster of models with expected scores, and roughly 90% of models stuck below F1 of 10. The anomaly was flagged and investigated by the community. The root cause was that . had been configured as the generation stop token, which caused decimal numbers like 12.25 to be cut off at 12. before generation could finish. Any evaluation instance requiring a floating-point answer failed automatically. Models that closely followed few-shot formatting were penalized because formatted output triggered the stop token in the prompt continuation. DROP was removed from the leaderboard pending a corrected implementation. Without community scrutiny of the score distribution, that systematic failure would have been published as legitimate performance data.

What Community Evals Actually Builds

Community Evals is an accountability layer, not a new benchmark. It does not run evaluations; it provides infrastructure for publishing, attributing, and discussing evaluation results in a way that makes methodology a first-class artifact.

The design is Git-native and deliberately decentralized. Benchmark scores live in .eval_results/*.yaml files in model repositories on the HuggingFace Hub. Benchmarks register themselves by adding an eval.yaml file to a dataset repository. Any HuggingFace user can submit evaluation results for any model by opening a Pull Request. Results submitted via PR appear immediately on the model page, tagged as “community” submissions, without waiting for the PR to merge. Model authors can close PRs they dispute, but they cannot silently erase submitted results.

The score storage format captures the metadata that is almost always missing from reported numbers:

- dataset:
    id: cais/hle
    task_id: default
    revision: <dataset-git-hash>
  value: 20.90
  verifyToken: <cryptographic-proof>
  date: "2025-01-15"
  source:
    url: https://huggingface.co/spaces/SaylorTwift/smollm3-mmlu-pro
    name: Eval traces
    user: SaylorTwift
  notes: "no-tools"

The revision field pins the dataset version. The source field links to evaluation logs. The verifyToken is a cryptographic proof generated when an evaluation runs through HuggingFace Jobs using Inspect AI, distinguishing those results from self-reported scores.

On the benchmark side, the eval.yaml format captures the full evaluation methodology, including the solver pipeline and the scoring approach:

name: Humanity's Last Exam
evaluation_framework: "inspect-ai"
tasks:
  - id: hle
    field_spec:
      input: question
      input_image: image
      target: answer
    solvers:
      - name: system_message
        args:
          template: |
            Your response should be in the following format:
            Explanation: {your explanation}
            Answer: {your chosen answer}
            Confidence: {0-100%}
      - name: generate
    scorers:
      - name: model_graded_fact
        args:
          model: openai/o3-mini

The solver pipeline specifies the full inference chain. The scorer specifies how correctness is determined and which judge model is used. When you read a Community Eval score, you know the methodology that produced it, not just the number.

The Trust Hierarchy

Community Evals does not treat all scores as equivalent. There are four badge types, arranged by verifiability. “Verified” results have a valid verifyToken, meaning the evaluation ran in HuggingFace Jobs using Inspect AI with cryptographic proof of provenance. “Community” results were submitted via open PR and have not been merged to the main branch. “Leaderboard” results link to a registered benchmark’s aggregated view. “Source” results link to external evaluation logs or citations.

This is a meaningful distinction. A verified score is tamper-evident and auditable. A community PR score is an assertion from a named user, visible to the model owner, disputable via PR comments, and traceable through git history. A self-reported number from a model card is neither.

Inspect AI, the canonical evaluation framework for the verified tier, was built by the UK AI Security Institute and separates concerns that most ad-hoc evaluation code conflates: tasks define what is evaluated, solvers define the inference pipeline (including self-critique loops, tool use, and multi-turn prompting), and scorers define correctness criteria. HuggingFace’s lighteval library now supports Inspect AI as a backend, giving the ecosystem a shared infrastructure layer.

The Inspect AI integration also surfaced a source of variance that had been largely invisible: inference provider differences. The same model running on different inference providers produces different accuracy scores on identical tasks, ranging measurably across providers like Groq, Fireworks, Cerebras, and SambaNova. Different providers use different hardware, quantization configurations, and batching strategies. Community Evals captures this through evaluation metadata, making provider-specific variance at least attributable.

What This Does Not Solve

Benchmark contamination is arguably the larger problem, and Community Evals does not address it. When training data overlaps with test sets, benchmark scores measure memorization rather than generalization. The system records when evaluations were run, which provides some temporal context, but it does not verify training data composition. Solving that requires third-party auditing of training corpora and held-out test sets with restricted access, neither of which is within the scope of this project.

The LLM-as-judge bias problem also remains. GPT-4, when used as a judge, systematically favors verbose outputs over concise correct ones and favors outputs stylistically similar to ChatGPT. HuggingFace documented this directly: in pairwise evaluation, human-written responses received lower ELO ratings than Vicuna-13B outputs when GPT-4 was the judge. Human responses rated roughly 940 ELO against Vicuna-13B’s 1148. In some cases, correct concise answers lost to incorrect verbose responses. Community Evals requires scorers to be specified explicitly, including which judge model is used, so the bias is at least visible. But visibility is not a fix.

Benchmark saturation is also unaddressed by this system. MMLU top scores are above 91%, GSM8K above 94%, HumanEval similarly compressed toward the ceiling. When everything clusters there, ranking differences become noise. Community Evals provides infrastructure for submitting scores on newer, harder benchmarks like Humanity’s Last Exam, but it does not itself generate those benchmarks.

Why Infrastructure Matters Here

The current state of AI evaluation has a specific failure mode: the same benchmark name does not guarantee methodological comparability, scores from different sources are published without attribution, and disputes about methodology happen in informal channels with no persistent record. This creates a situation where published benchmark comparisons cannot be trusted at face value, but there is also no established mechanism for surfacing and resolving discrepancies.

Community Evals addresses this by making evaluation methodology a first-class artifact, stored alongside the score and version-controlled in git. The PR-based submission workflow means disputes happen in the open with a permanent record. The cryptographic verification tier creates a meaningful distinction between self-assertion and auditable measurement. The eval.yaml format captures the information that makes scores comparable or reveals that they are not.

The DROP benchmark bug is the clearest illustration of why this matters. The bimodal score distribution was visible in aggregate data, got flagged by someone paying attention, got investigated, and led to a corrected implementation. That sequence requires a community that can see the data, a mechanism to discuss anomalies, and infrastructure that tracks changes. Community Evals is building that infrastructure, and the history of the HuggingFace evaluation ecosystem provides a concrete record of why opacity in this domain causes real harm to how the community understands model capabilities.