Artificial Intelligence #bayesian inference#decision audits
Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives
A new arXiv paper by Yanan Long applies Bayesian inference and decision audits to public archives of frontier AI evaluations, revealing that terminal leaderboard interpretations can be misleading due to selective time series, reporting rules, and missingness. The study examines archives including LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench, and finds that a candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration. The proposed archive-and-adjudication protocol reconstructs histories and falsifies unsupported claims.
Jun 16, 2026 1 source