iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Uncertainty Quality of VGGT: Analysis on DTU Benchmark Dataset Reveals Effective Confidence Threshold for 3D Reconstruction New Diffusion Model Learns Permutation Distributions with Softer, More Tractable Trajectories RidgeCut: Reinforcement Learning Framework Optimizes Logistics Network Partitioning with Rings and Wedges SDS-LoRA: New Low-Rank Adaptation Method Fixes Gradient Distortion in Large Model Fine-Tuning NeuronFabric Architecture Cuts Memory for On-Chip Transformer Training, Promises Efficient Edge AI Kharif Pulses Sowing Off to a Weak Start: Acreage Down 43% as of June 12 Self-Gated Clarification Method Boosts AI Accuracy in Complex Tariff Classification Tyler Framework Boosts LLM Reasoning by Up to 14 Points with Smarter Compute Allocation ResVLA Anchors Generative Policies with Residual Bridges to Reduce Noise and Speed Robot Learning MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis Uncertainty Quality of VGGT: Analysis on DTU Benchmark Dataset Reveals Effective Confidence Threshold for 3D Reconstruction New Diffusion Model Learns Permutation Distributions with Softer, More Tractable Trajectories RidgeCut: Reinforcement Learning Framework Optimizes Logistics Network Partitioning with Rings and Wedges SDS-LoRA: New Low-Rank Adaptation Method Fixes Gradient Distortion in Large Model Fine-Tuning NeuronFabric Architecture Cuts Memory for On-Chip Transformer Training, Promises Efficient Edge AI Kharif Pulses Sowing Off to a Weak Start: Acreage Down 43% as of June 12 Self-Gated Clarification Method Boosts AI Accuracy in Complex Tariff Classification Tyler Framework Boosts LLM Reasoning by Up to 14 Points with Smarter Compute Allocation ResVLA Anchors Generative Policies with Residual Bridges to Reduce Noise and Speed Robot Learning MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis
Home ›› Technology ›› Ai ›› Computer Vision ›› VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI

VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI

A new dataset called VinQA targets long-form answer generation in multimodal document QA, where cited visual elements are interleaved with text. The paper compares two encoding methods and an evaluation framework, showing that fine-tuning open Qwen2.5-VL models can approach proprietary frontier model performance.

iG
iGEN Editorial
June 16, 2026
VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI

Real-world business documents frequently contain tables, charts, photographs, and diagrams arranged in diverse layouts. Yet most existing multimodal large language models (MLLMs) for document question answering produce text-only responses, underutilizing these visual elements. A recent paper on arXiv introduces VinQA, a dataset designed for long-form answer generation where cited visual elements are explicitly interleaved with supporting text and grounded in relevant document pages, according to the research by Jang, Young Rok, Kong, Hyesoo, An, Kyunghwan, Huh, Jae Sub, Kim, Gyeonghun, and Choi, Stanley Jungkyu.

Two Encoding Methods for Visual Citations

The VinQA study explores two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms:

  • Page Encoding: Directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units.
  • Modality Encoding: Parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units.

Each method addresses different trade-offs in handling complex documents with long text, numerous visual elements, and diverse citation requirements.

Method Encoding Approach Citable Units Suitability
Page Encoding Full-page image + bounding boxes Boxed regions Initial less robust; after training on VinQA reaches comparable level
Modality Encoding Separate text and cropped visuals Cropped visual elements Initially more robust for complex documents with long text and many visuals

Evaluation Framework M-GroSE and Visual Source F1

To assess answer quality across multiple dimensions, the authors propose M-GroSE, a multimodal evaluation framework extending GroUSE. M-GroSE evaluates answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. Additionally, Visual Source F1 is reported to directly measure visual citation accuracy, providing a quantitative check on whether cited visual elements appear at semantically appropriate positions with faithful supporting text.

Experimental Results: Fine-Tuning Narrows the Gap

In experiments on the VinQA test split, proprietary frontier models still achieve the best overall scores. However, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Initially, Modality Encoding is more robust for complex documents. After training on VinQA, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

Implications for Enterprise Document Processing

While the VinQA work is a research contribution, its focus on handling visual elements in document QA has direct relevance to enterprise scenarios where forms, invoices, contracts, and technical manuals mix text with graphics. Improving AI's ability to generate answers that cite both text and visual evidence could enhance automation in document review, compliance checking, and knowledge retrieval. The dataset and methods provide a benchmark for developing more capable MLLMs tailored to complex real-world documents.


Sources:

Keep Reading

Recommended Stories

Research Shows 'Retrieve, Don't Retrain' Approach Cuts AI Model Adaptation Costs Technology

Research Shows 'Retrieve, Don't Retrain' Approach Cuts AI Model Adaptation Costs

A new research paper from arXiv proposes a retrieval-augmented vision-language-action (VLA) policy that eliminates the need for per-task fine-tuning. By retrieving relevant demonstrations from a pool at test time, the frozen policy adapts to new tasks without updating model parameters. The method shows strong results on robotic manipulation benchmarks, including PushT and RoboTwin 2.0, and on a real robot.

June 16, 2026
MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy Technology

MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy

The MAGE-RAG research paper introduces a multigranular adaptive graph evidence framework for multimodal retrieval-augmented generation (RAG) in long-document question answering. By building an evidence graph with page and element nodes and using an online controller to iteratively activate and prune evidence, it balances coverage and noise. Experiments show accuracy improvements over existing methods on LongDocURL and MMLongBench-Doc benchmarks.

June 16, 2026
Visual-Seeker: Visual-Native AI Agent for Active Visual Reasoning in Multimodal Search Technology

Visual-Seeker: Visual-Native AI Agent for Active Visual Reasoning in Multimodal Search

Researchers propose Visual-Seeker, a visual-native multimodal deep search agent that actively harvests fine-grained visual evidence during search. Using a synthesized dataset of 5K multimodal trajectories, it achieves state-of-the-art on five benchmarks, outperforming several proprietary models.

June 16, 2026
MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis Technology

MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis

Researchers introduce MA-ProofBench, the first formal theorem-proving benchmark dedicated to mathematical analysis. It contains 200 theorems across six topics at two difficulty levels. Evaluations show that even the best model, GPT-5.5, achieves only 16% Pass@8 on undergraduate-level problems and 5% on Ph.D.-level problems, highlighting significant limitations of current LLMs in formal mathematical reasoning.

June 16, 2026