Artificial Intelligence #ac-odm#actor-critic
AC-ODM: Actor-Critic Online Data Mixing for Sample-Efficient LLM Pretraining – A New Reinforcement Learning Approach
Researchers introduce AC-ODM, an actor-critic online data mixing method that treats data composition as a reinforcement learning problem. On Pythia-1B, it achieves up to 66% fewer training steps to optimal perplexity, 27.5% relative MMLU accuracy improvement, and 2.23× higher HumanEval pass@1, with only 0.4% per-step wall-clock increase and 2% memory overhead. The method supports proxy and non-proxy modes for flexible deployment.
Jun 16, 2026 1 source