A new study from arXiv (2606.15455) by Yuan, Suqin, Chen, Jinkun, Zheng, Jiyang, Muyang, Feng, Lei, Wang, Dadong, Xiang, Tao, Liu, Tongliang, and An Bo reveals that diversity collapse in reinforcement learning with verifiable rewards (RLVR) can be understood as a form of overtraining. According to the paper, once a problem's contribution to the model’s reasoning boundary saturates, further updates no longer expand what the model can solve but instead concentrate probability mass on trajectories favored by on-policy sampling. This leads to improved Pass@1 but degraded high-k Pass@k, a phenomenon the authors call diversity collapse.
The Overtraining Lens
The authors formalize diversity collapse through the concept of overtraining: when a problem's contribution to the reference metric effectively saturates, further updates become counterproductive for boundary expansion. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-k Pass@k. Consequently, most updates in standard RLVR constitute overtraining from the perspective of the reasoning boundary.
New Reasoning Gains Despite Aggregate Decline
The paper offers a nuanced reading of whether RLVR can expand reasoning beyond the base model. Because RLVR is structurally biased against high-k Pass@k, its aggregate decline does not by itself mean that no new reasoning gains occurred. Observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Interventionally, restricting updates to problems with zero observed success lifts Pass@256 above the base model on difficult benchmarks.
Bayesian Boundary Gating (BBG)
Building on these findings, the authors propose Bayesian Boundary Gating (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@k across a wide range of k. The method represents a principled way to maintain diversity in model outputs while still improving overall reasoning performance.
| Metric | Standard RLVR | BBG |
|---|---|---|
| Pass@1 | Improves | Comparable or better |
Pass@k (high k) |
Degrades | Improves |
| Boundary expansion | Limited | Enhanced |
Implications for Practitioners
While the study focuses on language model reasoning, the concept of overtraining and the BBG intervention may have parallels in other domains where reinforcement learning is applied. The authors note that the structural bias against high-k metrics must be accounted for when evaluating RLVR-based systems. For organisations deploying such models, monitoring Pass@k distributions beyond top-1 accuracy could reveal hidden diversity collapse.
What to watch: Future research may test BBG in broader RL settings and assess its scalability to larger models and more complex reasoning tasks.