ATOM-Bench: New Benchmark Evaluates Atomic Skills and Compositional Generalization in Robotic Manipulation Policies

Researchers introduce ATOM-Bench, a real-world benchmark that factorizes tabletop manipulation into atomic skills and compositional tasks. It includes 30 atomic tasks and 24 held-out compositional tasks across single-arm and dual-arm tracks, with 3,000 human demonstrations. Through 2,700 physical rollouts, the team found that current policies struggle with fine-grained motor skills, counting, and logical filtering, and strong atomic performance does not guarantee compositional transfer.

iGEN Editorial

June 16, 2026

ATOM-Bench: New Benchmark Evaluates Atomic Skills and Compositional Generalization in Robotic Manipulation Policies

Enterprise robotics teams investing in generalist manipulation policies lack a systematic way to diagnose real-world generalization failures. A new benchmark, ATOM-Bench, directly addresses this gap by factorizing manipulation into atomic skills and compositional generalization, according to a paper published on arXiv by Wu, Zenan, Wei, Bingqing, Liu, Lu, He, Zheqi, Wang, Xi, Jiakang, Zehui, Yao, Guocai, Zheng, Jing-Shu, Yang, Yongtao, and colleagues.

ATOM-Bench is a real-world benchmark that evaluates both atomic skills and compositional generalization in manipulation policies. It factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. The researchers collected 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation.

Benchmark Design and Methodology

Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. To distinguish failure sources, the team introduced two metrics:

Atomic Score (AS) – quantifies weak atomic skills
Compositional Failure Share (CFS) – quantifies failures caused by limited compositional reuse

Through 2,700 physical rollouts on five representative manipulation policies, the benchmark provides a diagnostic testbed for understanding whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.

Key Findings

The evaluation revealed that current policies can acquire simple instruction-grounding skills, but struggle with:

Fine-grained motor atoms
Counting
Logical filtering

More critically, strong atomic performance does not reliably transfer to held-out compositional tasks. This finding has direct implications for enterprise automation: a robot that can pick and place individual objects may still fail when asked to perform a sequence requiring reasoning (e.g., "pick two small red cubes and place them in the left bin after filtering out any blue ones").

Implications for Enterprise Robotics

For technology leaders evaluating robotic solutions for warehouse or factory floor automation, ATOM-Bench offers a structured way to probe generalization boundaries. The benchmark is not a product but a diagnostic tool that can be used to compare manipulation policies from vendors such as those developing generalist manipulation policies. The released demonstration data and rollout data enable reproducible evaluation, which can inform procurement decisions.

Metric	Purpose
Atomic Score (AS)	Measures mastery of individual motor or instruction skills
Compositional Failure Share (CFS)	Identifies failures due to inability to recombine skills

Task Type	Number of Tasks
Atomic	30
Compositional (held-out)	24

Data and Reproducibility

The benchmark includes paired single-arm and dual-arm robot tracks, and all data — 3,000 human demonstrations plus evaluation rollout data — are publicly released to support reproducible real-world evaluation. This allows enterprise teams to run their own policy evaluations against the same task suite.

While ATOM-Bench is a research contribution, its structure directly addresses the gap that enterprise users face: how to trust that a policy will generalize from training to novel task configurations. The findings suggest that current state-of-the-art policies still require improvement in motor precision and compositional reasoning before they can be reliably deployed in dynamic environments.

Sources:

ATOM-Bench: New Benchmark Evaluates Atomic Skills and Compositional Generalization in Robotic Manipulation Policies

Benchmark Design and Methodology

Key Findings

Implications for Enterprise Robotics

Data and Reproducibility

Recommended Stories

New Benchmark ARB4WM Evaluates Adversarial Robustness of World Models for Safety-Critical Control

Cough Regression Benchmark Reveals Trade-Offs in Respiratory Acoustic Foundation Models

LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs

P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models