Enterprises processing long, multimodal documents—such as PDFs containing text, tables, images, charts, and complex layouts—face a fundamental challenge: locating sparse evidence scattered across many pages while managing irrelevant information and controlling inference cost. Existing retrieval-augmented generation (RAG) methods rely on fixed Top-k retrieval over text chunks or entire pages, leading to a static trade-off among evidence coverage, noise, and cost. Text retrieval compresses context but often loses visual and layout information; page-level visual retrieval preserves the original page but introduces large irrelevant regions, degrading reader performance.
A new research paper published on arXiv proposes MAGE-RAG (Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA) to address this problem. According to the paper, authored by Zuo Yilong, Li Xunkai, Yuan Jing, Dai Qiangqiang, Qin Hongchao, and Ronghua, the framework uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is rendered into structured multimodal reader input, allowing a large vision-language model (LVLM) to consume compact and relevant evidence within a limited context.
How MAGE-RAG Works
The architecture consists of two main phases:
Offline graph construction: An evidence graph is created with two types of nodes: page nodes (representing each page in a document) and element nodes (representing individual elements like text blocks, tables, images, and charts). Edges encode five relationship types: containment (element belongs to a page), reading order (sequential flow), layout adjacency (spatial proximity), section hierarchy (e.g., heading-subsection), and semantic-neighbor relations (based on content similarity).
Online evidence control: At query time, an evidence controller performs iterative steps—activation, opening, searching, and pruning—guided by explicit budgets (e.g., maximum number of pages or elements). This produces a query-specific evidence subgraph that is then flattened into structured input for the LVLM.
The paper states that this approach allows the system to dynamically adapt the evidence set, balancing dispersed evidence coverage with context-noise control.
Benchmark Results
The authors established a unified comparison protocol covering four baseline methods: Direct MLLM (multimodal large language model without retrieval), Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments were conducted on two long-document multimodal QA datasets: LongDocURL and MMLongBench-Doc.
| Method | LongDocURL Overall Accuracy | MMLongBench-Doc Accuracy | MMLongBench-Doc F1 |
|---|---|---|---|
| Direct MLLM | Not reported in source | Not reported in source | Not reported in source |
| Text RAG | Not reported in source | Not reported in source | Not reported in source |
| Page-level Visual RAG | Not reported in source | Not reported in source | Not reported in source |
| Graph/Agentic RAG | Not reported in source | Not reported in source | Not reported in source |
| MAGE-RAG | 52.75 | 53.26 | 51.19 |
Note: The source provides MAGE-RAG scores but does not include baseline numbers in the abstract; the full paper likely contains comparative results.
According to the abstract, MAGE-RAG achieved 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further demonstrate that query-time evidence subgraph construction effectively balances dispersed evidence coverage with noise control.
Implications for Enterprise Document Processing
While the research is academic, the core technology directly applies to enterprise scenarios where long, multimodal documents are common—such as legal contracts, technical manuals, regulatory filings, and trade documentation. The ability to adaptively retrieve and structure evidence from mixed-format PDFs could reduce manual review time and improve accuracy in question-answering tasks. The evidence graph's encoding of layout and reading order is particularly relevant for documents where spatial arrangement carries meaning (e.g., tables spanning pages, charts with footnotes).
The code for MAGE-RAG is available on GitHub (link in the paper), enabling enterprises and integrators to evaluate and adapt the framework for their own data. Future work may explore integration with supply chain document processing systems, though the paper does not address this directly.
The paper is hosted on arXiv under a CC BY 4.0 license and is authored by researchers affiliated with institutions not specified in the abstract. The work is part of the arXivLabs framework, which emphasizes openness, community, excellence, and user data privacy.
For CTOs and technology managers evaluating next-generation document AI, MAGE-RAG represents a promising direction for overcoming the limitations of fixed-retrieval RAG systems. The adaptive evidence control mechanism could be a key enabler for deploying multimodal RAG in production environments where cost-per-query and accuracy are both critical.