Diversity Collapse in RLVR Explained by Overtraining in New Study

A new arXiv paper by Yuan et al. (2026) explains diversity collapse in reinforcement learning with verifiable rewards (RLVR) as a symptom of overtraining. The study shows that once a problem's contribution to the reasoning boundary saturates, further updates concentrate probability mass on successful trajectories, degrading high-k Pass@k. The authors propose Bayesian Boundary Gating (BBG) to redirect optimization and improve average Pass@k across multiple benchmarks.

iGEN Editorial

June 17, 2026

Diversity Collapse in RLVR Explained by Overtraining in New Study

A new study from arXiv (2606.15455) by Yuan, Suqin, Chen, Jinkun, Zheng, Jiyang, Muyang, Feng, Lei, Wang, Dadong, Xiang, Tao, Liu, Tongliang, and An Bo reveals that diversity collapse in reinforcement learning with verifiable rewards (RLVR) can be understood as a form of overtraining. According to the paper, once a problem's contribution to the model’s reasoning boundary saturates, further updates no longer expand what the model can solve but instead concentrate probability mass on trajectories favored by on-policy sampling. This leads to improved Pass@1 but degraded high-k Pass@k, a phenomenon the authors call diversity collapse.

The Overtraining Lens

The authors formalize diversity collapse through the concept of overtraining: when a problem's contribution to the reference metric effectively saturates, further updates become counterproductive for boundary expansion. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-k Pass@k. Consequently, most updates in standard RLVR constitute overtraining from the perspective of the reasoning boundary.

New Reasoning Gains Despite Aggregate Decline

The paper offers a nuanced reading of whether RLVR can expand reasoning beyond the base model. Because RLVR is structurally biased against high-k Pass@k, its aggregate decline does not by itself mean that no new reasoning gains occurred. Observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Interventionally, restricting updates to problems with zero observed success lifts Pass@256 above the base model on difficult benchmarks.

Bayesian Boundary Gating (BBG)

Building on these findings, the authors propose Bayesian Boundary Gating (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@k across a wide range of k. The method represents a principled way to maintain diversity in model outputs while still improving overall reasoning performance.

Metric	Standard RLVR	BBG
Pass@1	Improves	Comparable or better
Pass@k (high `k`)	Degrades	Improves
Boundary expansion	Limited	Enhanced

Implications for Practitioners

While the study focuses on language model reasoning, the concept of overtraining and the BBG intervention may have parallels in other domains where reinforcement learning is applied. The authors note that the structural bias against high-k metrics must be accounted for when evaluating RLVR-based systems. For organisations deploying such models, monitoring Pass@k distributions beyond top-1 accuracy could reveal hidden diversity collapse.

What to watch: Future research may test BBG in broader RL settings and assess its scalability to larger models and more complex reasoning tasks.

Sources:

Diversity Collapse in RLVR Explained by Overtraining in New Study

The Overtraining Lens

New Reasoning Gains Despite Aggregate Decline

Bayesian Boundary Gating (BBG)

Implications for Practitioners

Recommended Stories

RL-Index: Reinforcement Learning Shifts Retrieval Reasoning to Indexing Stage for Faster, Better Search

FastMix: Gradient-Based Data Mixture Optimization Reduces Search Cost in AI Training

New AI Training Method Reduces Decision Errors in Stochastic Optimization for Supply Chain and Finance

Neuro-Inspired Vision-Language Models Show Resilience to Membership Inference Privacy Leakage