Artificial Intelligence #spokes#diverse pretraining
Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance
Researchers introduce Spokes, a method that directly optimizes diversity in pretraining data selection for large language models. Using a probabilistic framework based on the G-Vendi score and exponentiated gradient descent, Spokes achieves significantly more diverse subsets and improves downstream performance by up to 1.5 points over random sampling.
Jun 16, 2026 1 source