Topic
training data
Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance
Researchers introduce Spokes, a method that directly optimizes diversity in pretraining data selection for large language models. Using a probabilistic framework based on the G-Vendi score and exponentiated gradient descent, Spokes achieves significantly more diverse subsets and improves downstream performance by up to 1.5 points over random sampling.
It’s a Race to Capture Real-World AI Training Data in India’s Unregulated Market
Home-services startup Pronto pilots in-home video recordings to train physical AI, spotlighting a fast-growing but loosely regulated industry in India. Startups like Neocambrian AI and Humyn Labs collect first-person video from kitchens, factories, and warehouses to train robots and world models, catering to robotics OEMs and defence firms. The practice raises privacy and consent concerns, with some factories pausing pilots after backlash.