MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy

The MAGE-RAG research paper introduces a multigranular adaptive graph evidence framework for multimodal retrieval-augmented generation (RAG) in long-document question answering. By building an evidence graph with page and element nodes and using an online controller to iteratively activate and prune evidence, it balances coverage and noise. Experiments show accuracy improvements over existing methods on LongDocURL and MMLongBench-Doc benchmarks.

iGEN Editorial

June 16, 2026

MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy

Enterprises processing long, multimodal documents—such as PDFs containing text, tables, images, charts, and complex layouts—face a fundamental challenge: locating sparse evidence scattered across many pages while managing irrelevant information and controlling inference cost. Existing retrieval-augmented generation (RAG) methods rely on fixed Top-k retrieval over text chunks or entire pages, leading to a static trade-off among evidence coverage, noise, and cost. Text retrieval compresses context but often loses visual and layout information; page-level visual retrieval preserves the original page but introduces large irrelevant regions, degrading reader performance.

A new research paper published on arXiv proposes MAGE-RAG (Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA) to address this problem. According to the paper, authored by Zuo Yilong, Li Xunkai, Yuan Jing, Dai Qiangqiang, Qin Hongchao, and Ronghua, the framework uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is rendered into structured multimodal reader input, allowing a large vision-language model (LVLM) to consume compact and relevant evidence within a limited context.

How MAGE-RAG Works

The architecture consists of two main phases:

Offline graph construction: An evidence graph is created with two types of nodes: page nodes (representing each page in a document) and element nodes (representing individual elements like text blocks, tables, images, and charts). Edges encode five relationship types: containment (element belongs to a page), reading order (sequential flow), layout adjacency (spatial proximity), section hierarchy (e.g., heading-subsection), and semantic-neighbor relations (based on content similarity).
Online evidence control: At query time, an evidence controller performs iterative steps—activation, opening, searching, and pruning—guided by explicit budgets (e.g., maximum number of pages or elements). This produces a query-specific evidence subgraph that is then flattened into structured input for the LVLM.

The paper states that this approach allows the system to dynamically adapt the evidence set, balancing dispersed evidence coverage with context-noise control.

Benchmark Results

The authors established a unified comparison protocol covering four baseline methods: Direct MLLM (multimodal large language model without retrieval), Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments were conducted on two long-document multimodal QA datasets: LongDocURL and MMLongBench-Doc.

Method	LongDocURL Overall Accuracy	MMLongBench-Doc Accuracy	MMLongBench-Doc F1
Direct MLLM	Not reported in source	Not reported in source	Not reported in source
Text RAG	Not reported in source	Not reported in source	Not reported in source
Page-level Visual RAG	Not reported in source	Not reported in source	Not reported in source
Graph/Agentic RAG	Not reported in source	Not reported in source	Not reported in source
MAGE-RAG	52.75	53.26	51.19

Note: The source provides MAGE-RAG scores but does not include baseline numbers in the abstract; the full paper likely contains comparative results.

According to the abstract, MAGE-RAG achieved 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further demonstrate that query-time evidence subgraph construction effectively balances dispersed evidence coverage with noise control.

Implications for Enterprise Document Processing

While the research is academic, the core technology directly applies to enterprise scenarios where long, multimodal documents are common—such as legal contracts, technical manuals, regulatory filings, and trade documentation. The ability to adaptively retrieve and structure evidence from mixed-format PDFs could reduce manual review time and improve accuracy in question-answering tasks. The evidence graph's encoding of layout and reading order is particularly relevant for documents where spatial arrangement carries meaning (e.g., tables spanning pages, charts with footnotes).

The code for MAGE-RAG is available on GitHub (link in the paper), enabling enterprises and integrators to evaluate and adapt the framework for their own data. Future work may explore integration with supply chain document processing systems, though the paper does not address this directly.

The paper is hosted on arXiv under a CC BY 4.0 license and is authored by researchers affiliated with institutions not specified in the abstract. The work is part of the arXivLabs framework, which emphasizes openness, community, excellence, and user data privacy.

For CTOs and technology managers evaluating next-generation document AI, MAGE-RAG represents a promising direction for overcoming the limitations of fixed-retrieval RAG systems. The adaptive evidence control mechanism could be a key enabler for deploying multimodal RAG in production environments where cost-per-query and accuracy are both critical.

Sources:

MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy

How MAGE-RAG Works

Benchmark Results

Implications for Enterprise Document Processing

Recommended Stories

Visual-Seeker: Visual-Native AI Agent for Active Visual Reasoning in Multimodal Search

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

New Benchmark Reveals Remote Sensing AI Models Fail at Negation Comprehension

UniT Framework Enables Multimodal Chain-of-Thought Test-Time Scaling for AI Reasoning