Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering

Researchers have extended game-theoretic decoding to vision-language models for medical visual question answering, introducing a Wasserstein stopping criterion that improves accuracy by up to 3.5 percentage points and reduces inference iterations by 20% while maintaining reliability.

iGEN Editorial

June 16, 2026

Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering

Small vision-language models (2-8 billion parameters) are well-suited for clinical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs, a critical challenge in medical applications. According to a research paper published on arXiv, a new decoding method called Wasserstein Equilibrium Decoding addresses this problem by extending game-theoretic decoding to multimodal models for open-ended medical visual question answering (VQA).

The Challenge of Hallucination in Medical VQA

Medical VQA requires models to answer questions about medical images accurately. Small vision-language models often produce convincing but wrong answers due to their limited capacity. Previous game-theoretic decoding approaches were restricted to text-only, closed-ended NLP tasks. The researchers extend this framework to vision-language models, introducing a semantically aware Wasserstein stopping criterion that replaces lexical order matching. This enables convergence based on semantic consensus among near-synonymous candidate answers, avoiding unnecessary iterations caused by clinically equivalent ranking swaps.

Technical Innovation: Wasserstein Stopping Criterion

The key innovation is the Wasserstein stopping criterion, which measures semantic distance between candidate answers using optimal transport theory. Instead of relying on exact word matches, it evaluates whether multiple candidates are clinically equivalent, stopping the decoding process once a stable consensus is reached. This reduces the number of iterations while preserving the game-theoretic equilibrium that discourages hallucination.

Empirical Results: Accuracy and Efficiency Gains

On the VQA-RAD and PathVQA benchmarks, the method achieved consistent, statistically significant improvements over greedy and discriminative baselines. The following table summarizes key results:

Dataset	Model	Metric	Improvement
VQA-RAD	Qwen3-VL-2B	Accuracy	+3.5 percentage points (p < 0.01)
VQA-RAD	Qwen3-VL-2B with Wasserstein	Surpasses greedy 4B model	Same metric
PathVQA	Gemma-3-4B with BDG	Accuracy	Matches MedGemma-4B greedy (no fine-tuning)
Both	Wasserstein criterion vs. classic BDG	Convergence iterations	~20% reduction at accuracy parity

On VQA-RAD, the Wasserstein approach improved Qwen3-VL-2B by 3.5 percentage points, surpassing the performance of the larger greedy 4B model. Similar trends were observed at larger scales. On PathVQA, Gemma-3-4B with BDG matched the accuracy of MedGemma-4B under greedy decoding, despite no domain-specific fine-tuning. At parity with classic BDG accuracy, the Wasserstein criterion reduced the average number of convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour.

Implications for Enterprise AI Deployment

The method's suitability for small vision-language models aligns with enterprise requirements for on-device or on-premise inference, particularly in healthcare where data privacy and low latency are paramount. The code is publicly available at the repository Wasserstein-BDG-medical-VQA, enabling organisations to evaluate and integrate the technique into their own clinical AI pipelines. By reducing both hallucination risk and computational cost, Wasserstein Equilibrium Decoding offers a practical path to more reliable medical AI systems.

Researchers named in the preprint include Hagen, Luca, Müller, Johanna P, Zhang, Weitong, Qiao, Mengyun, and Kainz, Bernhard. The work demonstrates that game-theoretic decoding can be successfully extended beyond text-only tasks to multimodal, open-ended medical VQA, potentially influencing how enterprise AI developers approach reliability-critical applications.

Sources:

Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering

The Challenge of Hallucination in Medical VQA

Technical Innovation: Wasserstein Stopping Criterion

Empirical Results: Accuracy and Efficiency Gains

Implications for Enterprise AI Deployment

Recommended Stories

New Research Reveals How Visual Tokens Evolve Inside Vision-Language Models

FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

New Framework GeoVR Learns 3D Spatial Intelligence from 2D Videos for Multimodal LLMs