Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Researchers introduce Spokes, a method that directly optimizes diversity in pretraining data selection for large language models. Using a probabilistic framework based on the G-Vendi score and exponentiated gradient descent, Spokes achieves significantly more diverse subsets and improves downstream performance by up to 1.5 points over random sampling.

iGEN Editorial

June 16, 2026

Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Selecting the right pretraining data is critical for large language model (LLM) performance, but optimizing for diversity is a set-level property that current approaches struggle to handle directly. A new method called Spokes, detailed in a paper on arXiv, directly tackles this challenge by introducing a probabilistic diversification framework based on the G-Vendi score and optimized via exponentiated gradient descent. The approach produces subsets that are substantially more diverse than random sampling and consistently outperforms existing methods on the FineWeb and DCLM benchmarks.

The Diversity Challenge in Data Selection

Diversity is known to improve performance under fixed data budgets by reducing redundancy and repetition. However, as the authors note from arXiv, optimizing for diversity is “inherently challenging, as it is a set-level property that depends on interactions between data points.” Existing approaches rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. Spokes addresses this by directly optimizing diversity through a principled probabilistic framework.

Spokes: Direct Optimization via G-Vendi Score

Spokes computes the G-Vendi score on candidate subsets and uses exponentiated gradient descent to iteratively update selection weights. The method achieves a substantial increase in diversity: a +489 increase in G-Vendi score on a 500k-sample subset compared to random sampling. This direct optimization allows Spokes to select data that captures a broader range of patterns, improving the model's ability to generalize.

Benchmark Results: Consistent Gains Over Baselines

On the FineWeb and DCLM datasets, Spokes consistently outperforms random sampling and other baselines including semantic deduplication and quality filtering. The table below summarizes the average downstream performance improvements:

Method (on DCLM)	Gain vs. Random Sampling	on FineWeb
Spokes (diversity-only)	+0.4 points	+0.5 points
Spokes (quality + diversity)	+1.5 points	+1.4 points

According to the arXiv paper, “jointly optimizing for both quality and diversity yields the strongest results,” with Spokes outperforming “all baselines, including semantic deduplication and quality filtering.”

Implications for Enterprise AI Training

For enterprise technology leaders building or fine-tuning LLMs, efficient data selection directly reduces compute costs and improves model capabilities. The Spokes method offers a way to automatically select diverse, high-quality pretraining data without manual curation or simple heuristics. By achieving significant gains on standard benchmarks (up to +1.5 points on DCLM), Spokes demonstrates that investing in smarter data selection can yield measurable improvements in model accuracy. While the paper focuses on pretraining, similar diversity optimization could extend to fine-tuning data, making Spokes a potential tool for enterprise AI teams seeking better performance from limited data budgets.

The approach is methodologically sound and builds on existing diversity metrics, making it accessible to teams with machine learning expertise. As LLMs become central to enterprise applications like supply chain analytics and trade documentation, techniques that enhance training efficiency will be increasingly valuable.

Sources:

Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

The Diversity Challenge in Data Selection

Spokes: Direct Optimization via G-Vendi Score

Benchmark Results: Consistent Gains Over Baselines

Implications for Enterprise AI Training

Recommended Stories

Researchers Identify Shrinkage Bias in LLM FP4 Pretraining, Propose UFP4 Recipe for Stability

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints

Beijing Accuses US AI Firms of Using Chinese Models for Training

Self-Improving AI Isn't Just for Frontier Labs: How Enterprises Can Build Their Own