Multimodal large language models (MLLMs) are increasingly expected to perceive and act in the physical world, but existing benchmarks fail to assess their interactive spatial reasoning in complex, real-world scenarios. To address this gap, a team of researchers led by Gao and colleagues introduced SpatialWorld, a unified benchmark designed specifically for evaluating interactive spatial understanding of multimodal agents.
The benchmark integrates eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, according to the paper titled "SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks." It features 760 human-annotated tasks spanning diverse domains including household routines, travel, and social collaboration. Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs.
Each task includes three components for reliable evaluation: a human-validated initial state, a reference trajectory, and a terminal-state verifier. This design ensures consistent and reproducible assessment across different agents and simulation backends.
Evaluation Results: Low Success Rates Across Advanced Models
The researchers evaluated 15 advanced agents on SpatialWorld. The results reveal that robust spatial task solving remains highly challenging. The following table summarises the performance of the top models:
| Model | Task Success Rate (TSR) | Notes |
|---|---|---|
| GPT-5 | 17.4% | Strongest closed-source model |
| Qwen-3.5 | 14.1% | Leading open-source model |
According to the paper, GPT-5 achieved an average task success rate of only 17.4%, while the leading open-source model, Qwen-3.5, reached 14.1%. These low scores underscore the difficulty of interactive spatial reasoning for current MLLMs.
Further analysis by the researchers exposed a clear mismatch between task success and execution efficiency, as well as substantial domain-specific performance variations. The authors attribute these bottlenecks to deficits in active exploration and long-horizon planning, positioning SpatialWorld as a rigorous testbed for future spatial agents.
Implications for Enterprise AI and Automation
While SpatialWorld focuses on consumer domains like household routines and travel, the underlying capabilities—spatial reasoning under partial observability, active exploration, and long-horizon planning—are directly relevant to enterprise applications in robotics, autonomous navigation, and augmented reality for industrial workflows. For technology leaders evaluating AI agents for supply chain or logistics tasks, the benchmark highlights a critical gap: even the most advanced models struggle to solve tasks that require sequential, space-aware decision-making.
For example, an AI agent tasked with locating a package in a cluttered warehouse, navigating around obstacles, and verifying shipment details would involve similar interactive spatial reasoning. The low success rates on SpatialWorld suggest that current MLLMs are not yet ready for such unsupervised operation.
Benchmark Design and Rigor
SpatialWorld was built to overcome limitations of prior benchmarks that rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines. By integrating multiple simulation backends under a common protocol, it provides a simulator-agnostic evaluation framework. All 760 tasks are human-annotated and come with verified ground truth, enabling consistent comparison across models and future iterations.
The paper is available on arXiv under a Creative Commons license, and the authors have released code and data for the community. This openness will allow researchers and practitioners to reproduce results and develop improved spatial reasoning models.
Conclusion
SpatialWorld sets a new challenging benchmark for interactive spatial reasoning, with top models like GPT-5 achieving only 17.4% success. For enterprise decision-makers, this underscores the need for continued investment in fundamental AI research before deploying agents in spatially complex, real-world environments.