GEASS: Gated Evidence-Adaptive Selective Caption Trust Tackles VLM Hallucination

Vision-language models often hallucinate objects, and feeding them their own captions can actually worsen accuracy. Researchers propose GEASS, a gated evidence-adaptive module that decides per query how much of the caption to trust, improving accuracy across four VLMs on two benchmarks without training or additional parameters.

iGEN Editorial

June 16, 2026

GEASS: Gated Evidence-Adaptive Selective Caption Trust Tackles VLM Hallucination

Vision-language models (VLMs) have a persistent problem: they hallucinate objects not present in an image. A common fix is to feed the model its own generated caption as auxiliary evidence, on the assumption that the caption, once available, is always helpful. But a new paper shows that naive caption appending can backfire, lowering accuracy by nearly ten points on the HallusionBench benchmark for the Qwen2.5-VL-3B model. To address this, the authors introduce GEASS (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that dynamically gates how much of the caption to trust per query.

The research, by Li and Zhang, was published on arXiv in May 2026. The team first built GD-Probe, a diagnostic dataset that pairs a global and a detail question on the same image. This allowed them to isolate how caption utility varies by query type. They found that the same caption helps global questions but harms detail questions, because the caption text competes with the image for the model's attention. The sign of the effect—whether the caption helps or hurts—depends on whether the caption covers the content being queried. Crucially, this regime can be read from quantities the decoder already emits, without requiring access to attention maps or grounding.

Turning this insight into GEASS, the module gates caption influence by the clean path's confidence, weights it by the entropy reduction it induces, and raises the evidence bar when the two pathways disagree. The method requires no training and adds only two forward passes—no new parameters. Across four different VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting.

Key results at a glance

Property	Detail
Problem addressed	VLM hallucination of non-present objects
Baseline issue	Naive caption appending drops Qwen2.5-VL-3B accuracy by ~10 points on HallusionBench
GEASS approach	Training-free, logit-level gating of caption trust per query
Additional cost	Two forward passes, zero new parameters
Benchmarks improved	POPE and HallusionBench
Models tested	Four VLMs (including Qwen2.5-VL-3B)
Comparison	Outperforms vanilla inference and contrastive decoding

For enterprise technology leaders evaluating VLMs for applications such as automated document processing, quality inspection, or image-based data extraction, hallucination is a critical reliability barrier. The GEASS method offers a lightweight, drop-in improvement that does not require retraining or infrastructure changes. By selectively trusting captions only when they reduce uncertainty, the module promises more trustworthy outputs without added complexity.

The paper notes that the mechanism is 'readable from quantities the decoder already emits,' meaning it can be implemented atop existing VLM pipelines. This makes it particularly attractive for organizations deploying models in production environments where reliability and cost-efficiency are paramount. As VLMs become more integrated into enterprise workflows, techniques like GEASS that enhance factual accuracy without resource overhead will be essential for scaling trustworthy AI.

Sources:

GEASS: Gated Evidence-Adaptive Selective Caption Trust Tackles VLM Hallucination

Recommended Stories

New Benchmark Reveals Remote Sensing AI Models Fail at Negation Comprehension

JoyAI-VL-Interaction Model Brings Real-Time Vision-Language AI to Enterprise Applications

Prompt-Driven AI Models Enable On-Orbit Spacecraft Inspection Without Retraining

MMLongEmbed Benchmark Reveals Limitations in Long-Context Multimodal Embedding Models