Reinforcement learning (RL) with massively parallel simulations has become a standard way to develop robust, deployable policies, but most existing methods still rely on simple Gaussian policy parameterizations. Diffusion models offer a more expressive policy class, yet are typically designed for offline or off-policy training. New research asks whether diffusion policies can be trained effectively in the massively parallel, on-policy regime—and the answer is a novel method called Trust-Region Diffusion Policies (TruDi).
The Challenge of Massively Parallel On-Policy RL
In on-policy reinforcement learning, the policy is updated using data collected from the current policy. When combined with massively parallel simulations—thousands of environments running simultaneously—the data distribution changes quickly across updates. This makes stable training with complex policy classes like diffusion models particularly difficult. Standard diffusion-based RL methods avoid this by using offline or off-policy training, which reuses past data and decouples data collection from policy updates. However, on-policy methods can be more sample-efficient and are widely used in robotics and simulation-based control.
Introducing TruDi: Trust-Region Diffusion Policies
The researchers introduce TruDi, which stands for Trust-region Diffusion Policies, to address the stability challenge. TruDi integrates a trust-region optimization rule that enforces a Kullback-Leibler (KL) divergence constraint over the entire diffusion trajectory. This constraint ensures that the updated policy does not deviate too far from the previous one, preventing the instability that often plagues on-policy training with complex models. The method allows diffusion policies—which iteratively denoise random noise into a target distribution—to be used in massively parallel simulations for the first time.
'TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory.'
Empirical Results Across 73 Tasks
To validate TruDi, the researchers evaluated it on a diverse set of four massively parallel RL benchmarks comprising a total of 73 tasks. The tasks include standard locomotion and manipulation problems as well as more complex humanoid control. Across these benchmarks, TruDi consistently outperforms or performs on par with strong baselines on standard tasks. On more challenging humanoid control tasks, TruDi achieves clear gains, establishing a strong new baseline for massively parallel on-policy RL. The paper notes that it 'consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks.'
| Aspect | Description |
|---|---|
| Methods compared | TruDi vs strong baselines (Gaussian policies, etc.) |
| Number of benchmarks | 4 massively parallel RL benchmarks |
| Number of tasks | 73 tasks total |
| Performance on standard tasks | Consistently outperforms or on par |
| Performance on humanoid tasks | Clear gains |
Implications for Enterprise AI and Robotics
For technology leaders evaluating AI for automation, this research demonstrates a path to more expressive and capable policies for complex control tasks. While the current work is in simulation, the ability to train diffusion policies in massively parallel settings could translate to more dexterous robot control in warehouse, manufacturing, or logistics environments. The key innovation—trust-region constraints for diffusion trajectories—may be adapted to other domains where stable on-policy training of expressive models is needed. As enterprises seek to automate increasingly complex physical tasks, advances like TruDi represent a step toward more robust and performant AI systems. The method is open for further research and could be combined with existing RL frameworks to push the boundaries of what autonomous systems can achieve in the physical world.