Selecting the right pretraining data is critical for large language model (LLM) performance, but optimizing for diversity is a set-level property that current approaches struggle to handle directly. A new method called Spokes, detailed in a paper on arXiv, directly tackles this challenge by introducing a probabilistic diversification framework based on the G-Vendi score and optimized via exponentiated gradient descent. The approach produces subsets that are substantially more diverse than random sampling and consistently outperforms existing methods on the FineWeb and DCLM benchmarks.
The Diversity Challenge in Data Selection
Diversity is known to improve performance under fixed data budgets by reducing redundancy and repetition. However, as the authors note from arXiv, optimizing for diversity is “inherently challenging, as it is a set-level property that depends on interactions between data points.” Existing approaches rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. Spokes addresses this by directly optimizing diversity through a principled probabilistic framework.
Spokes: Direct Optimization via G-Vendi Score
Spokes computes the G-Vendi score on candidate subsets and uses exponentiated gradient descent to iteratively update selection weights. The method achieves a substantial increase in diversity: a +489 increase in G-Vendi score on a 500k-sample subset compared to random sampling. This direct optimization allows Spokes to select data that captures a broader range of patterns, improving the model's ability to generalize.
Benchmark Results: Consistent Gains Over Baselines
On the FineWeb and DCLM datasets, Spokes consistently outperforms random sampling and other baselines including semantic deduplication and quality filtering. The table below summarizes the average downstream performance improvements:
| Method (on DCLM) | Gain vs. Random Sampling | on FineWeb |
|---|---|---|
| Spokes (diversity-only) | +0.4 points | +0.5 points |
| Spokes (quality + diversity) | +1.5 points | +1.4 points |
According to the arXiv paper, “jointly optimizing for both quality and diversity yields the strongest results,” with Spokes outperforming “all baselines, including semantic deduplication and quality filtering.”
Implications for Enterprise AI Training
For enterprise technology leaders building or fine-tuning LLMs, efficient data selection directly reduces compute costs and improves model capabilities. The Spokes method offers a way to automatically select diverse, high-quality pretraining data without manual curation or simple heuristics. By achieving significant gains on standard benchmarks (up to +1.5 points on DCLM), Spokes demonstrates that investing in smarter data selection can yield measurable improvements in model accuracy. While the paper focuses on pretraining, similar diversity optimization could extend to fine-tuning data, making Spokes a potential tool for enterprise AI teams seeking better performance from limited data budgets.
The approach is methodologically sound and builds on existing diversity metrics, making it accessible to teams with machine learning expertise. As LLMs become central to enterprise applications like supply chain analytics and trade documentation, techniques that enhance training efficiency will be increasingly valuable.