Public AI evaluations are often interpreted as terminal leaderboards, but according to a new arXiv paper by Yanan Long, the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. The paper applies Bayesian inference and decision audits to reveal that commonly cited archives can produce inconsistent results, with timing estimates for reaching performance ceilings differing by a factor of three.
Public AI Evaluation Archives Studied
The paper examines several public archives that serve as the primary longitudinal record for frontier AI evaluations. LiveBench and Open LLM Leaderboard v2 are used as the main data sources. LMArena provides a preference stress test, while GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, a single constructed terminal-only example over 1,000 systems is compatible with two different pre-terminal histories.
Bayesian Inference Findings
Under the same terminal-tail model, the two pre-terminal histories yield estimated times of 23.03 and 75.13 to reach within 0.05 of the performance ceiling. This factor-of-three discrepancy highlights the sensitivity of evaluation timelines to the assumed reporting convention. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes, meaning that policy or procurement decisions based on these archives could vary dramatically depending on how the data is interpreted.
| Metric | History A | History B |
|---|---|---|
| Time to reach within 0.05 of ceiling | 23.03 | 75.13 |
| Number of systems | 1,000 | 1,000 |
| Terminal-tail model | Same | Same |
The paper reports that the candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration. Correspondingly, fixed audit gates reject its stronger claims, indicating that the model's output cannot be trusted for decision-making.
Synthetic Diagnostics and Audit Gates
The decision audit methodology uses synthetic posterior comparisons to evaluate how well different models capture the true underlying performance distribution. The candidate selection-aware frontier model, which incorporates a selection bias correction, consistently underperforms across all four diagnostics: it cannot recover synthetic data, fails to predict held-out archive entries, does not transfer to preference-based data from LMArena, and produces poorly calibrated uncertainty intervals. These failures are detected by fixed audit gates, which formally reject the model's more confident claims.
Proposed Archive-and-Adjudication Protocol
To address these issues, the paper proposes an archive-and-adjudication protocol that reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims. The protocol provides a systematic way to audit AI evaluation archives, making the inference process transparent and reproducible. For CTOs and technology leaders evaluating frontier AI models for enterprise deployment, the findings underscore the necessity of rigorous methodology and independent validation behind benchmark scores, rather than relying on terminal leaderboards at face value.