training data

3 stories

Artificial Intelligence #ai#artificial intelligence

Beijing Accuses US AI Firms of Using Chinese Models for Training

The Chinese commerce ministry accused US artificial intelligence firms of using Chinese models to train their own AI systems through a process called distillation. This comes after US Treasury Secretary Scott Bessent threatened sanctions against China over alleged technology theft. China defended distillation as a widely used industry practice and vowed to take all necessary measures to safeguard its interests.

Jul 28, 2026 1 source

Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Technology

Artificial Intelligence #spokes#diverse pretraining

Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance

Researchers introduce Spokes, a method that directly optimizes diversity in pretraining data selection for large language models. Using a probabilistic framework based on the G-Vendi score and exponentiated gradient descent, Spokes achieves significantly more diverse subsets and improves downstream performance by up to 1.5 points over random sampling.

Jun 16, 2026 1 source

Technology

Artificial Intelligence #ai#training data

It’s a Race to Capture Real-World AI Training Data in India’s Unregulated Market

Home-services startup Pronto pilots in-home video recordings to train physical AI, spotlighting a fast-growing but loosely regulated industry in India. Startups like Neocambrian AI and Humyn Labs collect first-person video from kitchens, factories, and warehouses to train robots and world models, catering to robotics OEMs and defence firms. The practice raises privacy and consent concerns, with some factories pausing pilots after backlash.

Jun 15, 2026 1 source