A key limitation of behavior cloning is that it learns to mimic expert demonstrations without directly optimizing for task success. Flow Matching (FM), a powerful technique for behavior cloning in multimodal action spaces, suffers from this same shortcoming. A new research paper introduces FlowMPC, a framework that augments an FM policy with a learned world model to enable test-time planning, boosting performance on manipulation tasks.
The work, posted on arXiv by researchers Hamel and Chandon, builds on the TD-MPC2 model-based reinforcement learning algorithm. It investigates whether a learned world model can improve FM policies by enabling Model Predictive Path Integral (MPPI) planning over candidate action sequences proposed by the policy.
The Challenge with Flow Matching
Flow Matching is an imitation learning method that learns to generate actions by transforming a simple noise distribution into the distribution of expert actions. According to the paper, it has been effective for behavior cloning in complex, multimodal action spaces. However, because FM policies are not trained to maximize expected return, there is room to improve their performance at test time.
Introducing FlowMPC
FlowMPC combines an imitation-learned FM policy with a learned world model. The world model acts as a simulator, predicting the outcomes of potential action sequences. At test time, the FM policy proposes candidate action trajectories, and MPPI uses the world model to evaluate and select the best sequence. This approach allows the system to plan ahead without modifying the FM training objective.
The framework builds directly on TD-MPC2 (Hansen et al., 2024), a state-of-the-art model-based reinforcement learning method. The authors note that the world model is used only during inference, leaving the FM training procedure unchanged.
Results on Manipulation Benchmarks
The researchers evaluated FlowMPC on two tasks from the ManiSkill manipulation benchmark (Tao et al., 2025): PickCube and PickSingleYCB. Across both tasks, adding the world model improved performance over the FM policy alone. The gains were especially clear in end-of-episode success rates, indicating that planning helps the policy complete tasks more reliably.
| Task | FM Policy Only | FlowMPC (FM + World Model) |
|---|---|---|
| PickCube | Lower success | Higher success (clear gains) |
| PickSingleYCB | Lower success | Higher success (clear gains) |
| Note: Exact numerical results are not provided in the paper; performance improvement is reported qualitatively. |
Implications for AI-Powered Systems
While the experiments focus on simulated robot manipulation, the underlying approach—augmenting imitation-learned policies with model-based planning—has broader relevance. For enterprise systems that rely on behavior cloning, such as automated assembly or logistics handling, FlowMPC demonstrates that world models can provide a practical performance boost without retraining the policy. The framework's reliance on TD-MPC2 and MPPI means it can integrate with existing model-based reinforcement learning tools.
According to the paper, these results suggest that world-model-based planning can effectively complement flow-based imitation policies. The ability to improve policy performance at test time could reduce the need for extensive retraining when environments change—a valuable property for deployment in dynamic real-world settings.
The paper is available on arXiv under a Creative Commons license, with code and data expected to be released through associated links.