VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI

A new dataset called VinQA targets long-form answer generation in multimodal document QA, where cited visual elements are interleaved with text. The paper compares two encoding methods and an evaluation framework, showing that fine-tuning open Qwen2.5-VL models can approach proprietary frontier model performance.

iGEN Editorial

June 16, 2026

VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI

Real-world business documents frequently contain tables, charts, photographs, and diagrams arranged in diverse layouts. Yet most existing multimodal large language models (MLLMs) for document question answering produce text-only responses, underutilizing these visual elements. A recent paper on arXiv introduces VinQA, a dataset designed for long-form answer generation where cited visual elements are explicitly interleaved with supporting text and grounded in relevant document pages, according to the research by Jang, Young Rok, Kong, Hyesoo, An, Kyunghwan, Huh, Jae Sub, Kim, Gyeonghun, and Choi, Stanley Jungkyu.

Two Encoding Methods for Visual Citations

The VinQA study explores two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms:

Page Encoding: Directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units.
Modality Encoding: Parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units.

Each method addresses different trade-offs in handling complex documents with long text, numerous visual elements, and diverse citation requirements.

Method	Encoding Approach	Citable Units	Suitability
Page Encoding	Full-page image + bounding boxes	Boxed regions	Initial less robust; after training on VinQA reaches comparable level
Modality Encoding	Separate text and cropped visuals	Cropped visual elements	Initially more robust for complex documents with long text and many visuals

Evaluation Framework M-GroSE and Visual Source F1

To assess answer quality across multiple dimensions, the authors propose M-GroSE, a multimodal evaluation framework extending GroUSE. M-GroSE evaluates answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. Additionally, Visual Source F1 is reported to directly measure visual citation accuracy, providing a quantitative check on whether cited visual elements appear at semantically appropriate positions with faithful supporting text.

Experimental Results: Fine-Tuning Narrows the Gap

In experiments on the VinQA test split, proprietary frontier models still achieve the best overall scores. However, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Initially, Modality Encoding is more robust for complex documents. After training on VinQA, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

Implications for Enterprise Document Processing

While the VinQA work is a research contribution, its focus on handling visual elements in document QA has direct relevance to enterprise scenarios where forms, invoices, contracts, and technical manuals mix text with graphics. Improving AI's ability to generate answers that cite both text and visual evidence could enhance automation in document review, compliance checking, and knowledge retrieval. The dataset and methods provide a benchmark for developing more capable MLLMs tailored to complex real-world documents.

Sources:

VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI

Two Encoding Methods for Visual Citations

Evaluation Framework M-GroSE and Visual Source F1

Experimental Results: Fine-Tuning Narrows the Gap

Implications for Enterprise Document Processing

Recommended Stories

TeleMorpher: New AI Framework Edits Video Motion and Location Simultaneously

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

LedgerAgent: A New Method for Policy-Adherent Tool-Calling AI Agents in Customer Service

New Benchmark Reveals Remote Sensing AI Models Fail at Negation Comprehension