Vision-language models (VLMs) have a persistent problem: they hallucinate objects not present in an image. A common fix is to feed the model its own generated caption as auxiliary evidence, on the assumption that the caption, once available, is always helpful. But a new paper shows that naive caption appending can backfire, lowering accuracy by nearly ten points on the HallusionBench benchmark for the Qwen2.5-VL-3B model. To address this, the authors introduce GEASS (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that dynamically gates how much of the caption to trust per query.
The research, by Li and Zhang, was published on arXiv in May 2026. The team first built GD-Probe, a diagnostic dataset that pairs a global and a detail question on the same image. This allowed them to isolate how caption utility varies by query type. They found that the same caption helps global questions but harms detail questions, because the caption text competes with the image for the model's attention. The sign of the effect—whether the caption helps or hurts—depends on whether the caption covers the content being queried. Crucially, this regime can be read from quantities the decoder already emits, without requiring access to attention maps or grounding.
Turning this insight into GEASS, the module gates caption influence by the clean path's confidence, weights it by the entropy reduction it induces, and raises the evidence bar when the two pathways disagree. The method requires no training and adds only two forward passes—no new parameters. Across four different VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting.
Key results at a glance
| Property | Detail |
|---|---|
| Problem addressed | VLM hallucination of non-present objects |
| Baseline issue | Naive caption appending drops Qwen2.5-VL-3B accuracy by ~10 points on HallusionBench |
| GEASS approach | Training-free, logit-level gating of caption trust per query |
| Additional cost | Two forward passes, zero new parameters |
| Benchmarks improved | POPE and HallusionBench |
| Models tested | Four VLMs (including Qwen2.5-VL-3B) |
| Comparison | Outperforms vanilla inference and contrastive decoding |
For enterprise technology leaders evaluating VLMs for applications such as automated document processing, quality inspection, or image-based data extraction, hallucination is a critical reliability barrier. The GEASS method offers a lightweight, drop-in improvement that does not require retraining or infrastructure changes. By selectively trusting captions only when they reduce uncertainty, the module promises more trustworthy outputs without added complexity.
The paper notes that the mechanism is 'readable from quantities the decoder already emits,' meaning it can be implemented atop existing VLM pipelines. This makes it particularly attractive for organizations deploying models in production environments where reliability and cost-efficiency are paramount. As VLMs become more integrated into enterprise workflows, techniques like GEASS that enhance factual accuracy without resource overhead will be essential for scaling trustworthy AI.