Visit IGEN World Explore IGEN Expo

EXPLORE UPGRADE PLANS

BREAKING

FRA Greenlights Expanded Rail Track Tech Tests as CSX Prepares July 2026 Rollout Hidden Failure Modes in AI Reasoning: Study Reveals Oversight Paradox and Context-Injection Vulnerabilities InstantForget: New Update-Free Backdoor Unlearning Method Uses Inference-Time Feature Reset for AI Security Beyond Weights and Gradients: New Taxonomy Classifies Federated Learning Messages into Three Categories Token Reduction in Generative Models Must Evolve Beyond Efficiency, New Research Argues Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization Emergent Strategic Reasoning Risks in AI: New Taxonomy-Driven Framework Evaluates Deception and Gaming in LLMs Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection Reservoir Attention Network: Cross-Pass State in Pretrained Transformers via Content-Addressable Reservoir Injection Explainable deep learning improves human mental models of self-driving cars, study finds FRA Greenlights Expanded Rail Track Tech Tests as CSX Prepares July 2026 Rollout Hidden Failure Modes in AI Reasoning: Study Reveals Oversight Paradox and Context-Injection Vulnerabilities InstantForget: New Update-Free Backdoor Unlearning Method Uses Inference-Time Feature Reset for AI Security Beyond Weights and Gradients: New Taxonomy Classifies Federated Learning Messages into Three Categories Token Reduction in Generative Models Must Evolve Beyond Efficiency, New Research Argues Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization Emergent Strategic Reasoning Risks in AI: New Taxonomy-Driven Framework Evaluates Deception and Gaming in LLMs Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection Reservoir Attention Network: Cross-Pass State in Pretrained Transformers via Content-Addressable Reservoir Injection Explainable deep learning improves human mental models of self-driving cars, study finds

Home ›› Topics ›› public archives

Topic

public archives

1 story

Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives

Artificial Intelligence #bayesian inference#decision audits

Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives

A new arXiv paper by Yanan Long applies Bayesian inference and decision audits to public archives of frontier AI evaluations, revealing that terminal leaderboard interpretations can be misleading due to selective time series, reporting rules, and missingness. The study examines archives including LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench, and finds that a candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration. The proposed archive-and-adjudication protocol reconstructs histories and falsifies unsupported claims.

Jun 16, 2026 1 source