iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
PISA Memory System Draws on Cognitive Psychology to Boost AI Agent Adaptability New Multi-Scale Two-Stream Framework Aims to Decouple Semantics from Distortions in AI-Generated Image Quality Assessment P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention Tamil Nadu OE Spinning Mills Threaten 50% Production Cut Over High Cotton Waste Prices BridgePolicy: New Diffusion Bridge Method Improves Visuomotor Policy Learning in Robotics New Theory Explains How Deep Transformers Achieve Adaptive Inference Using Function Vectors PVminerLLM2 Uses Preference Optimization to Improve Structured Patient Voice Extraction Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents PISA Memory System Draws on Cognitive Psychology to Boost AI Agent Adaptability New Multi-Scale Two-Stream Framework Aims to Decouple Semantics from Distortions in AI-Generated Image Quality Assessment P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention Tamil Nadu OE Spinning Mills Threaten 50% Production Cut Over High Cotton Waste Prices BridgePolicy: New Diffusion Bridge Method Improves Visuomotor Policy Learning in Robotics New Theory Explains How Deep Transformers Achieve Adaptive Inference Using Function Vectors PVminerLLM2 Uses Preference Optimization to Improve Structured Patient Voice Extraction Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents
Home ›› Technology ›› Ai ›› Computer Vision ›› GEASS: Gated Evidence-Adaptive Selective Caption Trust Tackles VLM Hallucination

GEASS: Gated Evidence-Adaptive Selective Caption Trust Tackles VLM Hallucination

Vision-language models often hallucinate objects, and feeding them their own captions can actually worsen accuracy. Researchers propose GEASS, a gated evidence-adaptive module that decides per query how much of the caption to trust, improving accuracy across four VLMs on two benchmarks without training or additional parameters.

iG
iGEN Editorial
June 16, 2026
GEASS: Gated Evidence-Adaptive Selective Caption Trust Tackles VLM Hallucination

Vision-language models (VLMs) have a persistent problem: they hallucinate objects not present in an image. A common fix is to feed the model its own generated caption as auxiliary evidence, on the assumption that the caption, once available, is always helpful. But a new paper shows that naive caption appending can backfire, lowering accuracy by nearly ten points on the HallusionBench benchmark for the Qwen2.5-VL-3B model. To address this, the authors introduce GEASS (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that dynamically gates how much of the caption to trust per query.

The research, by Li and Zhang, was published on arXiv in May 2026. The team first built GD-Probe, a diagnostic dataset that pairs a global and a detail question on the same image. This allowed them to isolate how caption utility varies by query type. They found that the same caption helps global questions but harms detail questions, because the caption text competes with the image for the model's attention. The sign of the effect—whether the caption helps or hurts—depends on whether the caption covers the content being queried. Crucially, this regime can be read from quantities the decoder already emits, without requiring access to attention maps or grounding.

Turning this insight into GEASS, the module gates caption influence by the clean path's confidence, weights it by the entropy reduction it induces, and raises the evidence bar when the two pathways disagree. The method requires no training and adds only two forward passes—no new parameters. Across four different VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting.

Key results at a glance

Property Detail
Problem addressed VLM hallucination of non-present objects
Baseline issue Naive caption appending drops Qwen2.5-VL-3B accuracy by ~10 points on HallusionBench
GEASS approach Training-free, logit-level gating of caption trust per query
Additional cost Two forward passes, zero new parameters
Benchmarks improved POPE and HallusionBench
Models tested Four VLMs (including Qwen2.5-VL-3B)
Comparison Outperforms vanilla inference and contrastive decoding

For enterprise technology leaders evaluating VLMs for applications such as automated document processing, quality inspection, or image-based data extraction, hallucination is a critical reliability barrier. The GEASS method offers a lightweight, drop-in improvement that does not require retraining or infrastructure changes. By selectively trusting captions only when they reduce uncertainty, the module promises more trustworthy outputs without added complexity.

The paper notes that the mechanism is 'readable from quantities the decoder already emits,' meaning it can be implemented atop existing VLM pipelines. This makes it particularly attractive for organizations deploying models in production environments where reliability and cost-efficiency are paramount. As VLMs become more integrated into enterprise workflows, techniques like GEASS that enhance factual accuracy without resource overhead will be essential for scaling trustworthy AI.


Sources:

Keep Reading

Recommended Stories

JoyAI-VL-Interaction Model Brings Real-Time Vision-Language AI to Enterprise Applications Technology

JoyAI-VL-Interaction Model Brings Real-Time Vision-Language AI to Enterprise Applications

JoyAI-VL-Interaction is an open-source, 8B-scale vision-language model that continuously monitors video streams and decides in real time whether to stay silent, speak, or delegate to a background model. Human raters preferred it over Doubao and Gemini in six real-world scenarios. The system includes pluggable ASR/TTS, memory, and API integration.

June 16, 2026
Prompt-Driven AI Models Enable On-Orbit Spacecraft Inspection Without Retraining Technology

Prompt-Driven AI Models Enable On-Orbit Spacecraft Inspection Without Retraining

Researchers demonstrate that prompt-driven vision-language models can perform zero-shot instance segmentation of spacecraft components on orbit without modifying onboard weights, enabling post-launch semantic expansion. The approach achieves 0.385 mAP@0.5 on a test set of 129 images of unseen satellites, with strong performance on large structures but challenges on fine-scale appendages. Structured prompts improve accuracy by up to 82% over simple category names.

June 16, 2026
MMLongEmbed Benchmark Reveals Limitations in Long-Context Multimodal Embedding Models Technology

MMLongEmbed Benchmark Reveals Limitations in Long-Context Multimodal Embedding Models

MMLongEmbed is the first comprehensive benchmark for evaluating multimodal embedding models (MEMs) in long-context scenarios. It comprises four retrieval tasks covering text, document, and video modalities. The evaluation reveals that current MEMs rely heavily on superficial feature matching and struggle with deep semantic and structural dependencies, with performance degrading systematically based on context length and key information placement.

June 16, 2026
EgoPhys Framework Creates Deformable Object Digital Twins from Single Egocentric Video Technology

EgoPhys Framework Creates Deformable Object Digital Twins from Single Egocentric Video

Researchers present EgoPhys, a framework that creates deformable physical digital twins from egocentric RGB video using generalizable priors. Deployed on an xArm6 robot, it enables zero-shot generalization and future prediction for elastic materials and fabrics, offering a scalable path to real-to-sim pipelines.

June 16, 2026