iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Open-SWE-Traces: 207K Multilingual Trajectories Set New Standard for Autonomous Software Engineering Agents Infant-Inspired Noise Boosts Deep RL Exploration, Research from arXiv Shows Mutual Distillation of Dual Foundation Models Achieves State-of-the-Art PET/CT Segmentation with Only 5 Labeled Cases SPARK Method Activates Latent Security Knowledge in LLMs for Secure Code Generation Apple explains why Siri AI took so long: first version ready last year but rebuilt from ground up New LLM Framework Detects Phishing Emails with Over 90% Accuracy Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives Dali casualty exposes erosion of technical ownership in shipmanagement, warns veteran Kapoor Open-SWE-Traces: 207K Multilingual Trajectories Set New Standard for Autonomous Software Engineering Agents Infant-Inspired Noise Boosts Deep RL Exploration, Research from arXiv Shows Mutual Distillation of Dual Foundation Models Achieves State-of-the-Art PET/CT Segmentation with Only 5 Labeled Cases SPARK Method Activates Latent Security Knowledge in LLMs for Secure Code Generation Apple explains why Siri AI took so long: first version ready last year but rebuilt from ground up New LLM Framework Detects Phishing Emails with Over 90% Accuracy Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives Dali casualty exposes erosion of technical ownership in shipmanagement, warns veteran Kapoor
Home ›› Technology ›› Ai ›› Llms ›› Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Researchers introduce Spokes, a method that directly optimizes diversity in pretraining data selection for large language models. Using a probabilistic framework based on the G-Vendi score and exponentiated gradient descent, Spokes achieves significantly more diverse subsets and improves downstream performance by up to 1.5 points over random sampling.

iG
iGEN Editorial
June 16, 2026
Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Selecting the right pretraining data is critical for large language model (LLM) performance, but optimizing for diversity is a set-level property that current approaches struggle to handle directly. A new method called Spokes, detailed in a paper on arXiv, directly tackles this challenge by introducing a probabilistic diversification framework based on the G-Vendi score and optimized via exponentiated gradient descent. The approach produces subsets that are substantially more diverse than random sampling and consistently outperforms existing methods on the FineWeb and DCLM benchmarks.

The Diversity Challenge in Data Selection

Diversity is known to improve performance under fixed data budgets by reducing redundancy and repetition. However, as the authors note from arXiv, optimizing for diversity is “inherently challenging, as it is a set-level property that depends on interactions between data points.” Existing approaches rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. Spokes addresses this by directly optimizing diversity through a principled probabilistic framework.

Spokes: Direct Optimization via G-Vendi Score

Spokes computes the G-Vendi score on candidate subsets and uses exponentiated gradient descent to iteratively update selection weights. The method achieves a substantial increase in diversity: a +489 increase in G-Vendi score on a 500k-sample subset compared to random sampling. This direct optimization allows Spokes to select data that captures a broader range of patterns, improving the model's ability to generalize.

Benchmark Results: Consistent Gains Over Baselines

On the FineWeb and DCLM datasets, Spokes consistently outperforms random sampling and other baselines including semantic deduplication and quality filtering. The table below summarizes the average downstream performance improvements:

Method (on DCLM) Gain vs. Random Sampling on FineWeb
Spokes (diversity-only) +0.4 points +0.5 points
Spokes (quality + diversity) +1.5 points +1.4 points

According to the arXiv paper, “jointly optimizing for both quality and diversity yields the strongest results,” with Spokes outperforming “all baselines, including semantic deduplication and quality filtering.”

Implications for Enterprise AI Training

For enterprise technology leaders building or fine-tuning LLMs, efficient data selection directly reduces compute costs and improves model capabilities. The Spokes method offers a way to automatically select diverse, high-quality pretraining data without manual curation or simple heuristics. By achieving significant gains on standard benchmarks (up to +1.5 points on DCLM), Spokes demonstrates that investing in smarter data selection can yield measurable improvements in model accuracy. While the paper focuses on pretraining, similar diversity optimization could extend to fine-tuning data, making Spokes a potential tool for enterprise AI teams seeking better performance from limited data budgets.

The approach is methodologically sound and builds on existing diversity metrics, making it accessible to teams with machine learning expertise. As LLMs become central to enterprise applications like supply chain analytics and trade documentation, techniques that enhance training efficiency will be increasingly valuable.


Sources:

Keep Reading

Recommended Stories

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints Technology

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints

As AI labs face a data ceiling where compute capacity outpaces new high-quality text, researchers propose data augmentations to enable productive multi-epoch training on fixed corpora. Three categories—token-level noise, sequence permutations, and target offset prediction—are shown to delay overfitting and lower validation loss compared to standard autoregressive pretraining. Random token replacement achieved the best minimum loss among individual methods, with combined augmentations further improving results.

June 16, 2026
SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions Technology

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions

Researchers have developed SMEPilot, an LLM inference engine that leverages Arm Scalable Matrix Extension (SME) to optimize execution on CPUs. By selecting CPU-only, SME-only, or cooperative SME+CPU execution per operator shape, SMEPilot improves end-to-end inference by up to 3.94x across multiple models and platforms.

June 16, 2026
SPRI: SVD-Partitioned Residual Initialization Boosts Data-Constrained MoE Upcycling for Multilingual Translation Technology

SPRI: SVD-Partitioned Residual Initialization Boosts Data-Constrained MoE Upcycling for Multilingual Translation

Researchers propose SPRI, a method that initializes Mixture-of-Experts (MoE) models from pretrained dense models using SVD-partitioned residuals. Evaluated on multilingual speech-to-text translation, SPRI achieves gains of 2.58 BLEU and 3.32 COMET over fine-tuned dense models, and outperforms prior MoE upcycling baselines by 3.39 BLEU and 4.34 COMET points.

June 16, 2026
LearnOpt Uses Knowledge Graphs and Optimization to Reveal Hidden Structure in Standardized Exams Technology

LearnOpt Uses Knowledge Graphs and Optimization to Reveal Hidden Structure in Standardized Exams

Researchers introduce LearnOpt, a system that recovers latent cognitive structures from standardized examinations using knowledge graphs and constrained optimization. Applied to NEET and JEE Advanced, it reveals stable skill distributions within syllabus regimes and significant shifts after curricular changes.

June 16, 2026