Adapting AI models to perform new tasks typically requires collecting task-specific teleoperated demonstrations and fine-tuning the model for each new task. This process is costly in both data collection and compute. A new paper on arXiv, titled "Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time," proposes a method that replaces this per-task fine-tuning with retrieval, dramatically reducing adaptation costs.
The Retrieval Approach
The authors — Park, Jeongeun, Juhan, Kim, Taekyung, Choi, Sungjoon, Han, Dongyoon, Yun, and Sangdoo — introduce a retrieval-augmented policy that is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video). After training, the policy is frozen. New tasks are added at deployment by simply appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters.
"Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task."
This distinction is crucial: enterprises deploying robotic or automation systems can incrementally add new tasks without retraining their models, as long as the embodiment (robot hardware) remains the same. Only when the physical robot changes is fine-tuning required.
Cosmos Policy and World-Action Models
The paper shows that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. This combination yields more robust task execution.
Benchmark Results
The method was evaluated on several robotic benchmarks:
| Environment | Task | Outcome |
|---|---|---|
| PushT | Cross-embodiment generalization to unseen goal angles | Retrieval provides a reusable high-level motion prior |
| RoboTwin 2.0 | Unseen tasks with cross-embodiment baselines | Outperforms baselines |
| Real robot | Demonstration on a physical system | Successful transfer |
While specific numerical results are not detailed in the paper, the qualitative findings indicate that retrieval-augmented VLA policies offer a practical path to task extensibility without expensive retraining.
Implications for Enterprise Automation
For technology leaders evaluating AI investments in robotics, this research suggests a way to reduce the total cost of ownership for AI-powered automation. Instead of retraining models for each new product or process variation, a single frozen policy can handle new tasks by simply adding demonstration data to a retrieval pool. This reduces data collection costs (fewer teleoperated demonstrations per task) and compute costs (no per-task fine-tuning). The approach is particularly valuable for deployment scenarios where tasks change frequently, such as warehouse picking, assembly line reconfiguration, or logistics sorting.
- Cost reduction: Eliminates per-task fine-tuning, saving GPU hours and engineering effort.
- Faster deployment: New tasks can be added by indexing data, not by retraining models.
- Scalability: The fixed retrieval pool can grow as new demonstrations are added, without modifying the neural network.
The authors note that fine-tuning is still needed for novel embodiments, but once a robot platform is established, adding tasks becomes a data management exercise rather than a model retraining cycle.
Looking Forward
As vision-language-action models become more prevalent in industrial robotics, techniques that decouple task expansion from model retraining will be critical for widespread adoption. The "Retrieve, Don't Retrain" paradigm offers a concrete method to achieve this, backed by experiments on both simulations and real hardware. Enterprise teams exploring AI for automation should monitor this line of research for integration into their own deployment pipelines.