The Unsigned Binary Problem in AI Benchmarks

A few years ago, if you wanted to install software without a package manager, you’d download a binary and either trust the source or run a manual checksum. The checksum was optional and rarely provided. You had no way to verify the artifact matched its claimed provenance, whether the same binary appeared on the official site and the mirror you downloaded from, or whether it had been modified between publication and download. You had the number and not the methodology behind it.

Benchmark scores have the same structure. A model card shows 71.4% on MMLU-Pro. You have no way to verify the conditions under which that was produced, whether the same conditions were used for the competing model you’re comparing it against, or whether anyone other than the model authors ran it at all. You have the number but not the methodology.

Hugging Face’s Community Evals, launched in February 2026, is worth examining through this lens because its architecture mirrors how the software ecosystem addressed the unsigned binary problem: by building a trust chain from artifact back to methodology, making the chain auditable, and distributing the authority to verify.

The Unsigned Binary Problem in Benchmarks

The MMLU benchmark provides the clearest documentation of why methodology matters. The same 65B LLaMA model achieves 63.6% via the original paper’s scoring method and 48.8% via EleutherAI’s LM Evaluation Harness, a 15-point gap from a single methodological difference: whether the model is scored over single letter tokens (A, B, C, D) or over full option text. Hugging Face’s analysis at the time enumerated the sources of divergence: prompt format, presence or absence of topic-line headers, whitespace handling, few-shot example count, normalization method.

None of this was documented alongside published scores. If you wanted to compare two models that reported MMLU numbers from different evaluation systems, you were comparing implementations, not models. The benchmark was functioning like an unsigned binary: a result with claimed provenance but no chain of custody.

The DROP benchmark makes the failure mode more concrete. The Open LLM Leaderboard removed it after community scrutiny uncovered bugs in the evaluation code: a whitespace normalization error caused correct floating-point answers to fail matching, and a stop-token bug truncated decimal answers mid-number. Models that gave more complete answers were penalized because longer output was more likely to hit the stop-token condition. The leaderboard was inverting quality, and no external party could detect it because the evaluation code was not part of the published record.

What Community Evals Adds to the Chain

Community Evals adds three components that function as the equivalent of signed packages, public build specs, and audit logs.

Benchmark datasets register an eval.yaml that defines evaluation methodology in Inspect AI format. This is the build spec: a machine-readable description of the evaluation pipeline, including prompt templates, scoring method, and dataset configuration. The HLE eval.yaml shows what this looks like in practice:

name: Humanity's Last Exam
evaluation_framework: "inspect-ai"

tasks:
  - id: hle
    field_spec:
      input: question
      input_image: image
      target: answer
    solvers:
      - name: system_message
        args:
          template: |
            Your response should be in the following format:
            Explanation: {your explanation}
            Answer: {your chosen answer}
      - name: generate
    scorers:
      - name: model_graded_fact
        args:
          model: openai/o3-mini

The evaluation configuration is no longer locked inside a central system. It lives in the benchmark’s dataset repo, alongside the data, versioned in git.

Model repos store results in .eval_results/*.yaml files with a verifyToken field. That token is a cryptographic proof that the evaluation ran in HF Jobs using Inspect AI. A result with a valid token gets a “verified” badge, a claim with a chain of custody behind it, not just assertion.

Community members can submit scores via PR to any model repo. Those results appear with a “community” badge without requiring model author approval. Model authors can close disputed PRs, but the git history records that the PR existed and was closed. This is the audit log: the record of what was submitted, when, and what happened to it.

What the Trust Hierarchy Implies

The three-tier system, verified, self-reported, and community, is more honest than most evaluation systems because it makes trust assumptions explicit rather than hiding them.

A verified score was produced by running the canonical eval.yaml in a known compute environment. If you dispute it, you can run the same spec yourself and submit a PR with your result. The methodology is auditable; disagreement is resolvable by reference to the specification.

A self-reported score is what model authors publish without verification. The score could be accurate and reproducible, or it could reflect an undisclosed configuration advantage. Without a verifyToken, you’re trusting the submitter.

A community score is a third-party measurement. Someone other than the model author ran the evaluation and submitted the result. The model author can close the PR, but the action remains visible in pull request history.

One constraint worth naming: the verified tier has a dependency that is not obvious at first glance. The HLE benchmark configuration uses OpenAI’s o3-mini as the judge model for scoring. A fully verified evaluation on one of the system’s flagship benchmarks requires an active OpenAI API subscription. For infrastructure built on open-source principles, this creates a practical ceiling on participation in the verified tier.

What Doesn’t Transfer from the Analogy

The package signing analogy breaks down at training data contamination. A package signature verifies that an artifact matches its claimed source; it says nothing about whether the source is trustworthy in the first place. Similarly, a verifyToken proves the evaluation ran correctly, not that the model wasn’t trained on the test set. Community Evals is explicit about this: the system won’t stop training on test sets. The transparency it provides is about implementation correctness and provenance, not about whether the benchmark is a valid out-of-sample measurement.

The benchmark-to-production gap also remains unchanged. Models that lead MMLU-Pro and GPQA Diamond still produce incorrect code for non-trivial tasks and fail on multi-step reasoning that matters in production. Community Evals is metadata infrastructure for comparing evaluation results under known conditions, not a ground-truth signal about real-world capability.

The Starting Benchmarks

The initial benchmark shortlist is chosen to stay ahead of the saturation curve. MMLU-Pro uses 10-choice questions and requires chain-of-thought reasoning, producing scores roughly 15 percentage points lower than MMLU for the same models. GPQA Diamond uses PhD-authored questions where non-experts score around 34% even with web access. Humanity’s Last Exam includes a canary string embedded in the dataset to help filter it from training data, a concrete attempt at contamination resistance. Current frontier models score around 20% on HLE.

These benchmarks produce more spread across model scores, which matters for discrimination. A benchmark where every major model clusters above 90% tells you nothing about relative capability, and the Community Evals infrastructure makes these harder targets the first-class options for verified comparison.

The Cultural Question

Technical infrastructure for provenance only works if practitioners use it. Package managers succeeded because the friction of not using them exceeded the friction of adoption: version conflicts, untrusted downloads, no reproducibility. Community Evals is betting that the same shift will happen for benchmark evaluation.

The system is in beta, with the benchmark allow-list still manually curated by the Hugging Face team. Lighteval, Hugging Face’s evaluation library, supports Inspect AI and makes running a verified evaluation straightforward. The harder adoption challenge is social: model providers benefit from being able to report favorable numbers without disclosing methodology. Transparent evaluation removes that flexibility. Whether the community’s demand for provenance becomes strong enough to make opacity costly is a question the infrastructure alone cannot answer, but the community discussions suggest genuine interest in making it function.

The unsigned binary problem in software took years to address, and the solution required both technical infrastructure and a shift in what practitioners expected from software authors. The same dynamic is beginning to play out here. The infrastructure exists; the cultural expectations are still forming.