SpatialWorld Benchmark Reveals Multimodal Agents Struggle with Interactive Spatial Reasoning

Researchers introduced SpatialWorld, a benchmark for evaluating interactive spatial understanding of multimodal agents in real-world tasks. Testing 15 advanced agents, the strongest model (GPT-5) achieved only 17.4% task success rate, highlighting challenges in active exploration and long-horizon planning.

iGEN Editorial

June 16, 2026

SpatialWorld Benchmark Reveals Multimodal Agents Struggle with Interactive Spatial Reasoning

Multimodal large language models (MLLMs) are increasingly expected to perceive and act in the physical world, but existing benchmarks fail to assess their interactive spatial reasoning in complex, real-world scenarios. To address this gap, a team of researchers led by Gao and colleagues introduced SpatialWorld, a unified benchmark designed specifically for evaluating interactive spatial understanding of multimodal agents.

The benchmark integrates eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, according to the paper titled "SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks." It features 760 human-annotated tasks spanning diverse domains including household routines, travel, and social collaboration. Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs.

Each task includes three components for reliable evaluation: a human-validated initial state, a reference trajectory, and a terminal-state verifier. This design ensures consistent and reproducible assessment across different agents and simulation backends.

Evaluation Results: Low Success Rates Across Advanced Models

The researchers evaluated 15 advanced agents on SpatialWorld. The results reveal that robust spatial task solving remains highly challenging. The following table summarises the performance of the top models:

Model	Task Success Rate (TSR)	Notes
GPT-5	17.4%	Strongest closed-source model
Qwen-3.5	14.1%	Leading open-source model

According to the paper, GPT-5 achieved an average task success rate of only 17.4%, while the leading open-source model, Qwen-3.5, reached 14.1%. These low scores underscore the difficulty of interactive spatial reasoning for current MLLMs.

Further analysis by the researchers exposed a clear mismatch between task success and execution efficiency, as well as substantial domain-specific performance variations. The authors attribute these bottlenecks to deficits in active exploration and long-horizon planning, positioning SpatialWorld as a rigorous testbed for future spatial agents.

Implications for Enterprise AI and Automation

While SpatialWorld focuses on consumer domains like household routines and travel, the underlying capabilities—spatial reasoning under partial observability, active exploration, and long-horizon planning—are directly relevant to enterprise applications in robotics, autonomous navigation, and augmented reality for industrial workflows. For technology leaders evaluating AI agents for supply chain or logistics tasks, the benchmark highlights a critical gap: even the most advanced models struggle to solve tasks that require sequential, space-aware decision-making.

For example, an AI agent tasked with locating a package in a cluttered warehouse, navigating around obstacles, and verifying shipment details would involve similar interactive spatial reasoning. The low success rates on SpatialWorld suggest that current MLLMs are not yet ready for such unsupervised operation.

Benchmark Design and Rigor

SpatialWorld was built to overcome limitations of prior benchmarks that rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines. By integrating multiple simulation backends under a common protocol, it provides a simulator-agnostic evaluation framework. All 760 tasks are human-annotated and come with verified ground truth, enabling consistent comparison across models and future iterations.

The paper is available on arXiv under a Creative Commons license, and the authors have released code and data for the community. This openness will allow researchers and practitioners to reproduce results and develop improved spatial reasoning models.

Conclusion

SpatialWorld sets a new challenging benchmark for interactive spatial reasoning, with top models like GPT-5 achieving only 17.4% success. For enterprise decision-makers, this underscores the need for continued investment in fundamental AI research before deploying agents in spatially complex, real-world environments.

Sources:

SpatialWorld Benchmark Reveals Multimodal Agents Struggle with Interactive Spatial Reasoning

Evaluation Results: Low Success Rates Across Advanced Models

Implications for Enterprise AI and Automation

Benchmark Design and Rigor

Conclusion

Recommended Stories

SLUM-i: AI Semi-Supervised Learning Maps Informal Settlements with Benchmark Dataset

Hyderabad Researchers Develop AI-Powered Plant Leaf Disease Detection System with 96% Accuracy

Meta's NameTag Face Recognition: Code Exists, Feature 'Doesn't' – What Enterprise Buyers Must Know

REVEAL++: Continuous Phenotypic Grouping Improves Vision-Language Retinal Model for Alzheimer's Risk