Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

The Semantic Flip framework trains a lightweight rejection module on top of frozen vision-language models to detect unanswerable queries in embodied question answering and spatial localization. It synthesizes out-of-distribution pairs by transforming query and video memory, achieving high refusal accuracy without external OOD annotations.

iGEN Editorial

June 16, 2026

Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

Modern vision-language models (VLMs) powering embodied agents often produce overly confident answers even when the agent's visual memory cannot support the query. This overconfidence poses task-dependent risks: in Embodied Question Answering, the agent may provide misleading information, and in spatial reasoning for navigation, it may select an arbitrary coordinate and physically guide the user there. Despite these high stakes, only a few prior studies address when and how an embodied VLM should respond with 'I do not know,' according to a recent paper on arXiv.

The Problem of Overconfident VLMs

The paper, authored by Na, Dongbin, Kim, Chanwoo, Choi, Giyun, Hong, and Dooyoung, highlights that current VLMs lack a mechanism for reliable refusal. Detecting unanswerable user queries is essential for the reliable deployment of real-world embodied agents. Without such detection, agents can cause user confusion or even physical harm during navigation tasks.

The Semantic Flip Framework

To address this, the researchers propose Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model.

Benchmarking with SpaceReject

The paper introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory. On this benchmark, Semantic Flip achieves an F1 score of 0.9559. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. The source codes and datasets are publicly available.

Benchmark	F1 Score
SpaceReject	0.9559

Implications for AI Safety

By enabling robust refusal, Semantic Flip reduces the risk of misdirection in embodied agents. For enterprise applications such as warehouse robotics or autonomous inspection, this capability ensures that agents abstain from acting when information is insufficient. The framework's compatibility with frozen pretrained VLMs lowers deployment barriers, as it does not require retraining the entire model.

The paper notes that such overconfidence poses various task-dependent risks, and Semantic Flip provides a practical solution by synthesizing OOD data without external annotations. As embodied AI becomes more prevalent in industrial settings, refusal mechanisms will be critical for safe human-robot interaction.

Sources:

Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

The Problem of Overconfident VLMs

The Semantic Flip Framework

Benchmarking with SpaceReject

Implications for AI Safety

Recommended Stories

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

Tri-Info Method Predicts VLA Model Failures with 83% Accuracy Across Real-World Tasks, Researchers Report

MapDream: Task-Driven Map Learning Achieves State-of-the-Art Vision-Language Navigation

SACE Framework Introduces First Scale-Aware Concept Erasure for Visual Autoregressive Models to Prevent Catastrophic Semantic Collapse