Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives

A new arXiv paper by Yanan Long applies Bayesian inference and decision audits to public archives of frontier AI evaluations, revealing that terminal leaderboard interpretations can be misleading due to selective time series, reporting rules, and missingness. The study examines archives including LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench, and finds that a candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration. The proposed archive-and-adjudication protocol reconstructs histories and falsifies unsupported claims.

iGEN Editorial

June 16, 2026

Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives

Public AI evaluations are often interpreted as terminal leaderboards, but according to a new arXiv paper by Yanan Long, the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. The paper applies Bayesian inference and decision audits to reveal that commonly cited archives can produce inconsistent results, with timing estimates for reaching performance ceilings differing by a factor of three.

Public AI Evaluation Archives Studied

The paper examines several public archives that serve as the primary longitudinal record for frontier AI evaluations. LiveBench and Open LLM Leaderboard v2 are used as the main data sources. LMArena provides a preference stress test, while GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, a single constructed terminal-only example over 1,000 systems is compatible with two different pre-terminal histories.

Bayesian Inference Findings

Under the same terminal-tail model, the two pre-terminal histories yield estimated times of 23.03 and 75.13 to reach within 0.05 of the performance ceiling. This factor-of-three discrepancy highlights the sensitivity of evaluation timelines to the assumed reporting convention. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes, meaning that policy or procurement decisions based on these archives could vary dramatically depending on how the data is interpreted.

Metric	History A	History B
Time to reach within 0.05 of ceiling	23.03	75.13
Number of systems	1,000	1,000
Terminal-tail model	Same	Same

The paper reports that the candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration. Correspondingly, fixed audit gates reject its stronger claims, indicating that the model's output cannot be trusted for decision-making.

Synthetic Diagnostics and Audit Gates

The decision audit methodology uses synthetic posterior comparisons to evaluate how well different models capture the true underlying performance distribution. The candidate selection-aware frontier model, which incorporates a selection bias correction, consistently underperforms across all four diagnostics: it cannot recover synthetic data, fails to predict held-out archive entries, does not transfer to preference-based data from LMArena, and produces poorly calibrated uncertainty intervals. These failures are detected by fixed audit gates, which formally reject the model's more confident claims.

Proposed Archive-and-Adjudication Protocol

To address these issues, the paper proposes an archive-and-adjudication protocol that reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims. The protocol provides a systematic way to audit AI evaluation archives, making the inference process transparent and reproducible. For CTOs and technology leaders evaluating frontier AI models for enterprise deployment, the findings underscore the necessity of rigorous methodology and independent validation behind benchmark scores, rather than relying on terminal leaderboards at face value.

Sources:

Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives

Public AI Evaluation Archives Studied

Bayesian Inference Findings

Synthetic Diagnostics and Audit Gates

Proposed Archive-and-Adjudication Protocol

Recommended Stories

Primacy Bias in Multimodal RAG: First Retrieved Items Dominate, Study Finds

MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance

Rethinking Human-AI Decision-Making: A Knowledge Framework for Corporations

Developers Prioritize Business Over Societal Risks in Agentic AI, Study Finds