AC-ODM: Actor-Critic Online Data Mixing for Sample-Efficient LLM Pretraining

AC-ODM: Actor-Critic Online Data Mixing for Sample-Efficient LLM Pretraining – A New Reinforcement Learning Approach

Researchers introduce AC-ODM, an actor-critic online data mixing method that treats data composition as a reinforcement learning problem. On Pythia-1B, it achieves up to 66% fewer training steps to optimal perplexity, 27.5% relative MMLU accuracy improvement, and 2.23× higher HumanEval pass@1, with only 0.4% per-step wall-clock increase and 2% memory overhead. The method supports proxy and non-proxy modes for flexible deployment.

iGEN Editorial

June 16, 2026

AC-ODM: Actor-Critic Online Data Mixing for Sample-Efficient LLM Pretraining – A New Reinforcement Learning Approach

Optimizing the composition of pretraining data is a critical yet computationally expensive task for large language models (LLMs). While dynamic mixing strategies adapt data proportions during training, they often sacrifice sample efficiency or computational efficiency. A new method called Actor-Critic Online Data Mixing (AC-ODM), introduced by researchers Ma, Jing, Dang, Chenhao, Liao, and Mingjie in a paper published on arXiv on May 29, 2025, approaches this problem from a reinforcement learning perspective, achieving significant gains in convergence speed and downstream accuracy with minimal overhead.

Reinforcement Learning for Data Mixing

According to the arXiv paper, AC-ODM formulates data mixing as a reinforcement learning problem with a parameterized policy. The authors theoretically prove that this policy acts as a dynamic linear surrogate that maximizes the constructive interference of gradients, thereby aligning training dynamics with optimal generalization. The method supports two operational modes:

Proxy mode: A policy learned on a small model is transferred to a larger target model, suitable for fixed, pre-prepared corpora.
Non-proxy mode: Direct end-to-end training from scratch without prior knowledge, offering structural flexibility.

This duality addresses a key limitation of prior methods: the inability to reconcile computational efficiency with sample efficiency and flexibility for diverse data sources.

Performance Benchmarks

Empirical results on the Pythia-1B model demonstrate AC-ODM's effectiveness. The following table summarizes key comparisons against competitive baselines:

Metric	AC-ODM vs. Baselines	Details
Training steps to optimal validation perplexity	Up to 66% fewer steps	Reaches optimal perplexity faster than all baselines
MMLU accuracy	27.5% relative improvement	Outperforms prior dynamic mixing methods
HumanEval pass@1	2.23× higher	Code generation task benchmark
Per-step wall-clock increase	0.4%	Virtually negligible overhead
Additional memory overhead	2%	Minimal extra resource consumption

The paper reports that these gains come with “virtually negligible (0.4%) per-step wall-clock increase and only 2% additional memory overhead,” making AC-ODM practical for real-world deployment.

Architectural Flexibility and Practical Impact

AC-ODM's two operational modes allow it to adapt to different training scenarios. The proxy mode is particularly valuable for organizations that have already curated large corpora and want to transfer a learned mixing policy to a larger model without retraining from scratch. The non-proxy mode, on the other hand, is ideal for end-to-end training on novel data distributions. Both modes maintain the theoretical guarantee of constructive gradient interference, which the authors identify as the core driver of sample efficiency.

Implications for Enterprise AI

For enterprise technology leaders, the primary takeaway is that AC-ODM offers a way to reduce the computational cost of LLM pretraining while simultaneously improving model quality. The 66% reduction in training steps translates directly to lower cloud compute expenses and faster time-to-market for custom LLMs. The 27.5% MMLU improvement and 2.23× HumanEval gain indicate that the method doesn't just accelerate training – it produces more capable models. While the paper focuses on the Pythia-1B architecture, the reinforcement learning framework is architecture-agnostic, suggesting broad applicability across transformer-based models.

Sources:

AC-ODM: Actor-Critic Online Data Mixing for Sample-Efficient LLM Pretraining – A New Reinforcement Learning Approach

Reinforcement Learning for Data Mixing

Performance Benchmarks

Architectural Flexibility and Practical Impact

Implications for Enterprise AI

Recommended Stories

Beijing Accuses US AI Firms of Using Chinese Models for Training

project44 CEO: AI Agents Without Context Are Just Guessing Faster

Self-Improving AI Isn't Just for Frontier Labs: How Enterprises Can Build Their Own

Bi-Anchor Interpolation Solver Cuts Generative Modeling Steps from 100 to 10, Researchers Show