The rapid growth in AI compute capacity is pushing against a fundamental limit: the supply of high-quality text data for training large language models. According to a new paper on arXiv, AI labs are approaching a "data ceiling" where compute power outpaces the rate at which new text is generated, forcing a shift toward a data-constrained, compute-abundant regime. In this setting, standard autoregressive (AR) pretraining—the dominant approach for models like GPT—suffers from severe overfitting when trained for multiple epochs on the same fixed corpus.
The Data-Constrained Regime
Standard AR pretraining, which predicts the next token in a sequence, reaches its optimum validation loss early in multi-epoch training and then continuously deteriorates, the researchers report. To address this, the team—Michael K. Chen, Xikun Zhang, and Zhen Wang—investigated data augmentation as a regularizer. Their goal was to enable productive training for hundreds of epochs on the same data without overfitting.
Three Augmentation Categories
The study introduces three orthogonal categories of data augmentation for AR pretraining:
- Token-level noise: Masking tokens (replacing them with a special [MASK] token) or replacing them with random tokens.
- Sequence permutations: Techniques such as right-to-left prediction (predicting tokens in reverse order) and Fill-in-the-Middle (predicting a masked span within a sequence).
- Target offset prediction: Instead of predicting the next token (x_{t+1}), the model predicts tokens at a fixed offset i > 1 (x_{t+i}).
Each category is designed to increase the diversity of training signals while preserving the underlying linguistic structure.
Experimental Findings
Through systematic ablations, the researchers found that individual augmentations delay the onset of overfitting and achieve lower validation loss compared to a baseline with no augmentation. Among single methods, random token replacement yielded the best minimum validation loss. Crucially, combining augmentations across categories further lowered the minimum validation loss, suggesting synergistic effects.
| Augmentation Method | Type | Performance vs Baseline |
|---|---|---|
| No augmentation (baseline) | — | Overfits early, high validation loss |
| Token masking | Token-level noise | Delays overfitting, lower loss |
| Random token replacement | Token-level noise | Best minimum loss among individual methods |
| Right-to-left prediction | Sequence permutation | Improves over baseline |
| Fill-in-the-Middle | Sequence permutation | Improves over baseline |
| Target offset prediction | Offset prediction | Improves over baseline |
| Combined augmentations | All three | Further lowers minimum validation loss |
The authors conclude that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution for the data-constrained regime. They have released all code and data.