Artificial Intelligence #data augmentation#language model
Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints
As AI labs face a data ceiling where compute capacity outpaces new high-quality text, researchers propose data augmentations to enable productive multi-epoch training on fixed corpora. Three categories—token-level noise, sequence permutations, and target offset prediction—are shown to delay overfitting and lower validation loss compared to standard autoregressive pretraining. Random token replacement achieved the best minimum loss among individual methods, with combined augmentations further improving results.
Jun 16, 2026 1 source