Small vision-language models (2-8 billion parameters) are well-suited for clinical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs, a critical challenge in medical applications. According to a research paper published on arXiv, a new decoding method called Wasserstein Equilibrium Decoding addresses this problem by extending game-theoretic decoding to multimodal models for open-ended medical visual question answering (VQA).
The Challenge of Hallucination in Medical VQA
Medical VQA requires models to answer questions about medical images accurately. Small vision-language models often produce convincing but wrong answers due to their limited capacity. Previous game-theoretic decoding approaches were restricted to text-only, closed-ended NLP tasks. The researchers extend this framework to vision-language models, introducing a semantically aware Wasserstein stopping criterion that replaces lexical order matching. This enables convergence based on semantic consensus among near-synonymous candidate answers, avoiding unnecessary iterations caused by clinically equivalent ranking swaps.
Technical Innovation: Wasserstein Stopping Criterion
The key innovation is the Wasserstein stopping criterion, which measures semantic distance between candidate answers using optimal transport theory. Instead of relying on exact word matches, it evaluates whether multiple candidates are clinically equivalent, stopping the decoding process once a stable consensus is reached. This reduces the number of iterations while preserving the game-theoretic equilibrium that discourages hallucination.
Empirical Results: Accuracy and Efficiency Gains
On the VQA-RAD and PathVQA benchmarks, the method achieved consistent, statistically significant improvements over greedy and discriminative baselines. The following table summarizes key results:
| Dataset | Model | Metric | Improvement |
|---|---|---|---|
| VQA-RAD | Qwen3-VL-2B | Accuracy | +3.5 percentage points (p < 0.01) |
| VQA-RAD | Qwen3-VL-2B with Wasserstein | Surpasses greedy 4B model | Same metric |
| PathVQA | Gemma-3-4B with BDG | Accuracy | Matches MedGemma-4B greedy (no fine-tuning) |
| Both | Wasserstein criterion vs. classic BDG | Convergence iterations | ~20% reduction at accuracy parity |
On VQA-RAD, the Wasserstein approach improved Qwen3-VL-2B by 3.5 percentage points, surpassing the performance of the larger greedy 4B model. Similar trends were observed at larger scales. On PathVQA, Gemma-3-4B with BDG matched the accuracy of MedGemma-4B under greedy decoding, despite no domain-specific fine-tuning. At parity with classic BDG accuracy, the Wasserstein criterion reduced the average number of convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour.
Implications for Enterprise AI Deployment
The method's suitability for small vision-language models aligns with enterprise requirements for on-device or on-premise inference, particularly in healthcare where data privacy and low latency are paramount. The code is publicly available at the repository Wasserstein-BDG-medical-VQA, enabling organisations to evaluate and integrate the technique into their own clinical AI pipelines. By reducing both hallucination risk and computational cost, Wasserstein Equilibrium Decoding offers a practical path to more reliable medical AI systems.
Researchers named in the preprint include Hagen, Luca, Müller, Johanna P, Zhang, Weitong, Qiao, Mengyun, and Kainz, Bernhard. The work demonstrates that game-theoretic decoding can be successfully extended beyond text-only tasks to multimodal, open-ended medical VQA, potentially influencing how enterprise AI developers approach reliability-critical applications.