Real-world business documents frequently contain tables, charts, photographs, and diagrams arranged in diverse layouts. Yet most existing multimodal large language models (MLLMs) for document question answering produce text-only responses, underutilizing these visual elements. A recent paper on arXiv introduces VinQA, a dataset designed for long-form answer generation where cited visual elements are explicitly interleaved with supporting text and grounded in relevant document pages, according to the research by Jang, Young Rok, Kong, Hyesoo, An, Kyunghwan, Huh, Jae Sub, Kim, Gyeonghun, and Choi, Stanley Jungkyu.
Two Encoding Methods for Visual Citations
The VinQA study explores two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms:
- Page Encoding: Directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units.
- Modality Encoding: Parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units.
Each method addresses different trade-offs in handling complex documents with long text, numerous visual elements, and diverse citation requirements.
| Method | Encoding Approach | Citable Units | Suitability |
|---|---|---|---|
| Page Encoding | Full-page image + bounding boxes | Boxed regions | Initial less robust; after training on VinQA reaches comparable level |
| Modality Encoding | Separate text and cropped visuals | Cropped visual elements | Initially more robust for complex documents with long text and many visuals |
Evaluation Framework M-GroSE and Visual Source F1
To assess answer quality across multiple dimensions, the authors propose M-GroSE, a multimodal evaluation framework extending GroUSE. M-GroSE evaluates answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. Additionally, Visual Source F1 is reported to directly measure visual citation accuracy, providing a quantitative check on whether cited visual elements appear at semantically appropriate positions with faithful supporting text.
Experimental Results: Fine-Tuning Narrows the Gap
In experiments on the VinQA test split, proprietary frontier models still achieve the best overall scores. However, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Initially, Modality Encoding is more robust for complex documents. After training on VinQA, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.
Implications for Enterprise Document Processing
While the VinQA work is a research contribution, its focus on handling visual elements in document QA has direct relevance to enterprise scenarios where forms, invoices, contracts, and technical manuals mix text with graphics. Improving AI's ability to generate answers that cite both text and visual evidence could enhance automation in document review, compliance checking, and knowledge retrieval. The dataset and methods provide a benchmark for developing more capable MLLMs tailored to complex real-world documents.