Vision-Language-Action (VLA) models have achieved strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. According to the paper "LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination," this assumption often fails in realistic settings, where occlusion makes manipulation partially observable. The authors introduced LIBERO-Occ, an occlusion-oriented extension of the LIBERO benchmark, and Viewpoint Imagination (VIM), a method that generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion, and VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time.
The Occlusion Challenge for Vision-Language-Action Models
VLA models integrate visual perception, language understanding, and action generation for robotic manipulation. Standard benchmarks typically present scenes where task-relevant objects are fully visible, a condition that rarely holds in real-world deployments. The paper identifies scene-induced occlusion as a fundamental challenge for VLA models. In settings such as cluttered bins, shelves, or industrial environments, objects may be partially hidden by other items or by the robot's own gripper. The authors report that state-of-the-art VLAs experience substantial performance degradation when occlusion is present, underscoring the need for robust perception-completion mechanisms.
LIBERO-Occ: A Benchmark for Scene-Induced Occlusion
To systematically evaluate VLA models under occlusion, the researchers created LIBERO-Occ, an occlusion-oriented extension of the existing LIBERO benchmark. This new benchmark introduces various occlusion types and severity levels across multiple manipulation task suites. The paper states that LIBERO-Occ is designed to assess how well VLAs handle partially observable conditions. The benchmark and corresponding code are publicly available, enabling the research community to test and compare occlusion-robust methods.
Viewpoint Imagination (VIM): Technical Overview
The proposed method, Viewpoint Imagination (VIM), addresses occlusion by generating a complementary view from the primary occluded observation. VIM conditions action prediction on both the observed and the imagined evidence, effectively providing the model with a more complete scene understanding. According to the authors, this approach improves robustness across task suites, occlusion types, and severity levels. Importantly, VIM does not require additional cameras at deployment time, meaning it can be applied to existing robotic systems without hardware modifications. The paper suggests that viewpoint imagination is a promising mechanism for perception completion in partially observable manipulation.
Implications for Robotics in Logistics and Supply Chain
Although the experiments are conducted on manipulation benchmarks, the principles of LIBERO-Occ and VIM are directly relevant to robotics in logistics and supply chain environments. Occlusion is a common occurrence in warehouse automation, such as when a robotic arm picks items from cluttered bins or when a mobile robot navigates tightly packed shelves. The ability to generate a complementary view without extra cameras could improve the reliability of automated picking, packing, and sorting operations. The research provides a foundation for developing VLA models that are more resilient to the imperfect viewing conditions typical of industrial settings.
| Component | Description |
|---|---|
| LIBERO-Occ | Occlusion-oriented extension of LIBERO benchmark for evaluating VLA models under scene-induced occlusion |
| Viewpoint Imagination (VIM) | Generates complementary view from occluded primary observation; conditions action prediction on observed and imagined evidence |
| Key Result | State-of-the-art VLAs suffer substantial degradation under occlusion; VIM improves robustness without additional cameras |
The paper's authors — Li, Taishan; Zhang, Jiwen; Wang, Siyuan; Huang, Xuanjing; and Wei, Zhongyu — have released the benchmark and code at this link.