Enterprise robotics teams investing in generalist manipulation policies lack a systematic way to diagnose real-world generalization failures. A new benchmark, ATOM-Bench, directly addresses this gap by factorizing manipulation into atomic skills and compositional generalization, according to a paper published on arXiv by Wu, Zenan, Wei, Bingqing, Liu, Lu, He, Zheqi, Wang, Xi, Jiakang, Zehui, Yao, Guocai, Zheng, Jing-Shu, Yang, Yongtao, and colleagues.
ATOM-Bench is a real-world benchmark that evaluates both atomic skills and compositional generalization in manipulation policies. It factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. The researchers collected 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation.
Benchmark Design and Methodology
Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. To distinguish failure sources, the team introduced two metrics:
- Atomic Score (AS) – quantifies weak atomic skills
- Compositional Failure Share (CFS) – quantifies failures caused by limited compositional reuse
Through 2,700 physical rollouts on five representative manipulation policies, the benchmark provides a diagnostic testbed for understanding whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.
Key Findings
The evaluation revealed that current policies can acquire simple instruction-grounding skills, but struggle with:
- Fine-grained motor atoms
- Counting
- Logical filtering
More critically, strong atomic performance does not reliably transfer to held-out compositional tasks. This finding has direct implications for enterprise automation: a robot that can pick and place individual objects may still fail when asked to perform a sequence requiring reasoning (e.g., "pick two small red cubes and place them in the left bin after filtering out any blue ones").
Implications for Enterprise Robotics
For technology leaders evaluating robotic solutions for warehouse or factory floor automation, ATOM-Bench offers a structured way to probe generalization boundaries. The benchmark is not a product but a diagnostic tool that can be used to compare manipulation policies from vendors such as those developing generalist manipulation policies. The released demonstration data and rollout data enable reproducible evaluation, which can inform procurement decisions.
| Metric | Purpose |
|---|---|
| Atomic Score (AS) | Measures mastery of individual motor or instruction skills |
| Compositional Failure Share (CFS) | Identifies failures due to inability to recombine skills |
| Task Type | Number of Tasks |
|---|---|
| Atomic | 30 |
| Compositional (held-out) | 24 |
Data and Reproducibility
The benchmark includes paired single-arm and dual-arm robot tracks, and all data — 3,000 human demonstrations plus evaluation rollout data — are publicly released to support reproducible real-world evaluation. This allows enterprise teams to run their own policy evaluations against the same task suite.
While ATOM-Bench is a research contribution, its structure directly addresses the gap that enterprise users face: how to trust that a policy will generalize from training to novel task configurations. The findings suggest that current state-of-the-art policies still require improvement in motor precision and compositional reasoning before they can be reliably deployed in dynamic environments.