New Benchmark and Method Address Occlusion in Vision-Language-Action Models for Robotics

Researchers introduced LIBERO-Occ, an occlusion-oriented benchmark for Vision-Language-Action (VLA) models, and proposed Viewpoint Imagination (VIM), a method that generates a complementary view from an occluded primary observation to condition action prediction. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion, and VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment.

iGEN Editorial

June 16, 2026

New Benchmark and Method Address Occlusion in Vision-Language-Action Models for Robotics

Vision-Language-Action (VLA) models have achieved strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. According to the paper "LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination," this assumption often fails in realistic settings, where occlusion makes manipulation partially observable. The authors introduced LIBERO-Occ, an occlusion-oriented extension of the LIBERO benchmark, and Viewpoint Imagination (VIM), a method that generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion, and VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time.

The Occlusion Challenge for Vision-Language-Action Models

VLA models integrate visual perception, language understanding, and action generation for robotic manipulation. Standard benchmarks typically present scenes where task-relevant objects are fully visible, a condition that rarely holds in real-world deployments. The paper identifies scene-induced occlusion as a fundamental challenge for VLA models. In settings such as cluttered bins, shelves, or industrial environments, objects may be partially hidden by other items or by the robot's own gripper. The authors report that state-of-the-art VLAs experience substantial performance degradation when occlusion is present, underscoring the need for robust perception-completion mechanisms.

LIBERO-Occ: A Benchmark for Scene-Induced Occlusion

To systematically evaluate VLA models under occlusion, the researchers created LIBERO-Occ, an occlusion-oriented extension of the existing LIBERO benchmark. This new benchmark introduces various occlusion types and severity levels across multiple manipulation task suites. The paper states that LIBERO-Occ is designed to assess how well VLAs handle partially observable conditions. The benchmark and corresponding code are publicly available, enabling the research community to test and compare occlusion-robust methods.

Viewpoint Imagination (VIM): Technical Overview

The proposed method, Viewpoint Imagination (VIM), addresses occlusion by generating a complementary view from the primary occluded observation. VIM conditions action prediction on both the observed and the imagined evidence, effectively providing the model with a more complete scene understanding. According to the authors, this approach improves robustness across task suites, occlusion types, and severity levels. Importantly, VIM does not require additional cameras at deployment time, meaning it can be applied to existing robotic systems without hardware modifications. The paper suggests that viewpoint imagination is a promising mechanism for perception completion in partially observable manipulation.

Implications for Robotics in Logistics and Supply Chain

Although the experiments are conducted on manipulation benchmarks, the principles of LIBERO-Occ and VIM are directly relevant to robotics in logistics and supply chain environments. Occlusion is a common occurrence in warehouse automation, such as when a robotic arm picks items from cluttered bins or when a mobile robot navigates tightly packed shelves. The ability to generate a complementary view without extra cameras could improve the reliability of automated picking, packing, and sorting operations. The research provides a foundation for developing VLA models that are more resilient to the imperfect viewing conditions typical of industrial settings.

Component	Description
LIBERO-Occ	Occlusion-oriented extension of LIBERO benchmark for evaluating VLA models under scene-induced occlusion
Viewpoint Imagination (VIM)	Generates complementary view from occluded primary observation; conditions action prediction on observed and imagined evidence
Key Result	State-of-the-art VLAs suffer substantial degradation under occlusion; VIM improves robustness without additional cameras

The paper's authors — Li, Taishan; Zhang, Jiwen; Wang, Siyuan; Huang, Xuanjing; and Wei, Zhongyu — have released the benchmark and code at this link.

Sources:

New Benchmark and Method Address Occlusion in Vision-Language-Action Models for Robotics

The Occlusion Challenge for Vision-Language-Action Models

LIBERO-Occ: A Benchmark for Scene-Induced Occlusion

Viewpoint Imagination (VIM): Technical Overview

Implications for Robotics in Logistics and Supply Chain

Recommended Stories

New Research Reveals How Visual Tokens Evolve Inside Vision-Language Models

New Training-Free Method Enables Robots to Follow Personalized Commands Like 'Bring My Cup'

New AI Research Shows Vision-Language Models Think Better with Visual Grounding

DF3DV-1K: Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis