Modern vision-language models (VLMs) powering embodied agents often produce overly confident answers even when the agent's visual memory cannot support the query. This overconfidence poses task-dependent risks: in Embodied Question Answering, the agent may provide misleading information, and in spatial reasoning for navigation, it may select an arbitrary coordinate and physically guide the user there. Despite these high stakes, only a few prior studies address when and how an embodied VLM should respond with 'I do not know,' according to a recent paper on arXiv.
The Problem of Overconfident VLMs
The paper, authored by Na, Dongbin, Kim, Chanwoo, Choi, Giyun, Hong, and Dooyoung, highlights that current VLMs lack a mechanism for reliable refusal. Detecting unanswerable user queries is essential for the reliable deployment of real-world embodied agents. Without such detection, agents can cause user confusion or even physical harm during navigation tasks.
The Semantic Flip Framework
To address this, the researchers propose Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model.
Benchmarking with SpaceReject
The paper introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory. On this benchmark, Semantic Flip achieves an F1 score of 0.9559. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. The source codes and datasets are publicly available.
| Benchmark | F1 Score |
|---|---|
| SpaceReject | 0.9559 |
Implications for AI Safety
By enabling robust refusal, Semantic Flip reduces the risk of misdirection in embodied agents. For enterprise applications such as warehouse robotics or autonomous inspection, this capability ensures that agents abstain from acting when information is insufficient. The framework's compatibility with frozen pretrained VLMs lowers deployment barriers, as it does not require retraining the entire model.
The paper notes that such overconfidence poses various task-dependent risks, and Semantic Flip provides a practical solution by synthesizing OOD data without external annotations. As embodied AI becomes more prevalent in industrial settings, refusal mechanisms will be critical for safe human-robot interaction.