Trust-Region Diffusion Policies Enable Expressive AI for Complex Control Tasks

Researchers introduce Trust-Region Diffusion Policies (TruDi), a method that enables diffusion models to be used in massively parallel on-policy reinforcement learning. By enforcing a KL-divergence constraint over the entire diffusion trajectory, TruDi achieves stable training and outperforms strong baselines across 73 diverse tasks, showing particular gains on challenging humanoid control problems.

iGEN Editorial

June 16, 2026

Trust-Region Diffusion Policies Enable Expressive AI for Complex Control Tasks

Reinforcement learning (RL) with massively parallel simulations has become a standard way to develop robust, deployable policies, but most existing methods still rely on simple Gaussian policy parameterizations. Diffusion models offer a more expressive policy class, yet are typically designed for offline or off-policy training. New research asks whether diffusion policies can be trained effectively in the massively parallel, on-policy regime—and the answer is a novel method called Trust-Region Diffusion Policies (TruDi).

The Challenge of Massively Parallel On-Policy RL

In on-policy reinforcement learning, the policy is updated using data collected from the current policy. When combined with massively parallel simulations—thousands of environments running simultaneously—the data distribution changes quickly across updates. This makes stable training with complex policy classes like diffusion models particularly difficult. Standard diffusion-based RL methods avoid this by using offline or off-policy training, which reuses past data and decouples data collection from policy updates. However, on-policy methods can be more sample-efficient and are widely used in robotics and simulation-based control.

Introducing TruDi: Trust-Region Diffusion Policies

The researchers introduce TruDi, which stands for Trust-region Diffusion Policies, to address the stability challenge. TruDi integrates a trust-region optimization rule that enforces a Kullback-Leibler (KL) divergence constraint over the entire diffusion trajectory. This constraint ensures that the updated policy does not deviate too far from the previous one, preventing the instability that often plagues on-policy training with complex models. The method allows diffusion policies—which iteratively denoise random noise into a target distribution—to be used in massively parallel simulations for the first time.

'TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory.'

Empirical Results Across 73 Tasks

To validate TruDi, the researchers evaluated it on a diverse set of four massively parallel RL benchmarks comprising a total of 73 tasks. The tasks include standard locomotion and manipulation problems as well as more complex humanoid control. Across these benchmarks, TruDi consistently outperforms or performs on par with strong baselines on standard tasks. On more challenging humanoid control tasks, TruDi achieves clear gains, establishing a strong new baseline for massively parallel on-policy RL. The paper notes that it 'consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks.'

Aspect	Description
Methods compared	TruDi vs strong baselines (Gaussian policies, etc.)
Number of benchmarks	4 massively parallel RL benchmarks
Number of tasks	73 tasks total
Performance on standard tasks	Consistently outperforms or on par
Performance on humanoid tasks	Clear gains

Implications for Enterprise AI and Robotics

For technology leaders evaluating AI for automation, this research demonstrates a path to more expressive and capable policies for complex control tasks. While the current work is in simulation, the ability to train diffusion policies in massively parallel settings could translate to more dexterous robot control in warehouse, manufacturing, or logistics environments. The key innovation—trust-region constraints for diffusion trajectories—may be adapted to other domains where stable on-policy training of expressive models is needed. As enterprises seek to automate increasingly complex physical tasks, advances like TruDi represent a step toward more robust and performant AI systems. The method is open for further research and could be combined with existing RL frameworks to push the boundaries of what autonomous systems can achieve in the physical world.

Sources:

Trust-Region Diffusion Policies Enable Expressive AI for Complex Control Tasks

The Challenge of Massively Parallel On-Policy RL

Introducing TruDi: Trust-Region Diffusion Policies

Empirical Results Across 73 Tasks

Implications for Enterprise AI and Robotics

Recommended Stories

BridgePolicy: New Diffusion Bridge Method Improves Visuomotor Policy Learning in Robotics

Reinforcement Learning Foundation Models: Synthetic MDPs Could Bridge the Gap

New Training-Free Method Enables Robots to Follow Personalized Commands Like 'Bring My Cup'

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation