iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Neuro-Symbolic Framework Improves Motion Prediction for Autonomous Vehicles in Mixed Traffic AI Scientist Automates Entire Research Lifecycle, Passes First Peer Review AI-driven Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs Quantum Machine Learning for Industrial Applications: New Research Tackles Trainability and Expressivity New Method Resolves Drift Attribution Ambiguity in LLM Evaluation Pipelines New Hardware-Aware Neural Architecture Search Runs on Embedded Devices with Under 512MB RAM Malaysia's AI Agent-Powered Messaging Platform Respond.io Raises $62.5M, Targets Acquisitions MimicIK Framework Achieves Real-Time Inverse Kinematics with 4.65 mm Accuracy for Robotic Teleoperation Reward Hacking Still Undefeated: AI Safety Gridworlds Test Shows Exploits Persist Across LLM Scales Hormuz Threat Level Stays Severe Despite Peace Breakthrough as Explosions and Uncertainty Persist Neuro-Symbolic Framework Improves Motion Prediction for Autonomous Vehicles in Mixed Traffic AI Scientist Automates Entire Research Lifecycle, Passes First Peer Review AI-driven Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs Quantum Machine Learning for Industrial Applications: New Research Tackles Trainability and Expressivity New Method Resolves Drift Attribution Ambiguity in LLM Evaluation Pipelines New Hardware-Aware Neural Architecture Search Runs on Embedded Devices with Under 512MB RAM Malaysia's AI Agent-Powered Messaging Platform Respond.io Raises $62.5M, Targets Acquisitions MimicIK Framework Achieves Real-Time Inverse Kinematics with 4.65 mm Accuracy for Robotic Teleoperation Reward Hacking Still Undefeated: AI Safety Gridworlds Test Shows Exploits Persist Across LLM Scales Hormuz Threat Level Stays Severe Despite Peace Breakthrough as Explosions and Uncertainty Persist
Home ›› Technology ›› Ai ›› Llms ›› MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy

MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy

The MAGE-RAG research paper introduces a multigranular adaptive graph evidence framework for multimodal retrieval-augmented generation (RAG) in long-document question answering. By building an evidence graph with page and element nodes and using an online controller to iteratively activate and prune evidence, it balances coverage and noise. Experiments show accuracy improvements over existing methods on LongDocURL and MMLongBench-Doc benchmarks.

iG
iGEN Editorial
June 16, 2026
MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy

Enterprises processing long, multimodal documents—such as PDFs containing text, tables, images, charts, and complex layouts—face a fundamental challenge: locating sparse evidence scattered across many pages while managing irrelevant information and controlling inference cost. Existing retrieval-augmented generation (RAG) methods rely on fixed Top-k retrieval over text chunks or entire pages, leading to a static trade-off among evidence coverage, noise, and cost. Text retrieval compresses context but often loses visual and layout information; page-level visual retrieval preserves the original page but introduces large irrelevant regions, degrading reader performance.

A new research paper published on arXiv proposes MAGE-RAG (Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA) to address this problem. According to the paper, authored by Zuo Yilong, Li Xunkai, Yuan Jing, Dai Qiangqiang, Qin Hongchao, and Ronghua, the framework uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is rendered into structured multimodal reader input, allowing a large vision-language model (LVLM) to consume compact and relevant evidence within a limited context.

How MAGE-RAG Works

The architecture consists of two main phases:

  • Offline graph construction: An evidence graph is created with two types of nodes: page nodes (representing each page in a document) and element nodes (representing individual elements like text blocks, tables, images, and charts). Edges encode five relationship types: containment (element belongs to a page), reading order (sequential flow), layout adjacency (spatial proximity), section hierarchy (e.g., heading-subsection), and semantic-neighbor relations (based on content similarity).

  • Online evidence control: At query time, an evidence controller performs iterative steps—activation, opening, searching, and pruning—guided by explicit budgets (e.g., maximum number of pages or elements). This produces a query-specific evidence subgraph that is then flattened into structured input for the LVLM.

The paper states that this approach allows the system to dynamically adapt the evidence set, balancing dispersed evidence coverage with context-noise control.

Benchmark Results

The authors established a unified comparison protocol covering four baseline methods: Direct MLLM (multimodal large language model without retrieval), Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments were conducted on two long-document multimodal QA datasets: LongDocURL and MMLongBench-Doc.

Method LongDocURL Overall Accuracy MMLongBench-Doc Accuracy MMLongBench-Doc F1
Direct MLLM Not reported in source Not reported in source Not reported in source
Text RAG Not reported in source Not reported in source Not reported in source
Page-level Visual RAG Not reported in source Not reported in source Not reported in source
Graph/Agentic RAG Not reported in source Not reported in source Not reported in source
MAGE-RAG 52.75 53.26 51.19

Note: The source provides MAGE-RAG scores but does not include baseline numbers in the abstract; the full paper likely contains comparative results.

According to the abstract, MAGE-RAG achieved 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further demonstrate that query-time evidence subgraph construction effectively balances dispersed evidence coverage with noise control.

Implications for Enterprise Document Processing

While the research is academic, the core technology directly applies to enterprise scenarios where long, multimodal documents are common—such as legal contracts, technical manuals, regulatory filings, and trade documentation. The ability to adaptively retrieve and structure evidence from mixed-format PDFs could reduce manual review time and improve accuracy in question-answering tasks. The evidence graph's encoding of layout and reading order is particularly relevant for documents where spatial arrangement carries meaning (e.g., tables spanning pages, charts with footnotes).

The code for MAGE-RAG is available on GitHub (link in the paper), enabling enterprises and integrators to evaluate and adapt the framework for their own data. Future work may explore integration with supply chain document processing systems, though the paper does not address this directly.

The paper is hosted on arXiv under a CC BY 4.0 license and is authored by researchers affiliated with institutions not specified in the abstract. The work is part of the arXivLabs framework, which emphasizes openness, community, excellence, and user data privacy.

For CTOs and technology managers evaluating next-generation document AI, MAGE-RAG represents a promising direction for overcoming the limitations of fixed-retrieval RAG systems. The adaptive evidence control mechanism could be a key enabler for deploying multimodal RAG in production environments where cost-per-query and accuracy are both critical.


Sources:

Keep Reading

Recommended Stories

Visual-Seeker: Visual-Native AI Agent for Active Visual Reasoning in Multimodal Search Technology

Visual-Seeker: Visual-Native AI Agent for Active Visual Reasoning in Multimodal Search

Researchers propose Visual-Seeker, a visual-native multimodal deep search agent that actively harvests fine-grained visual evidence during search. Using a synthesized dataset of 5K multimodal trajectories, it achieves state-of-the-art on five benchmarks, outperforming several proprietary models.

June 16, 2026
CONCORD: Asynchronous Sparse Aggregation Boosts Device-Cloud RAG Efficiency Under Document Isolation Technology

CONCORD: Asynchronous Sparse Aggregation Boosts Device-Cloud RAG Efficiency Under Document Isolation

A new framework called CONCORD addresses the challenge of document isolation in device-cloud retrieval-augmented generation (RAG). By treating the cloud as an asynchronous evidence source and introducing waiting debt control and certificate-guided minimal supplementation, CONCORD improves end-to-end throughput by 1.66× to 2.15× over baselines while cutting per-token communication by over two orders of magnitude. Experiments on Natural Questions and WikiText-2 demonstrate comparable answer quality and perplexity.

June 16, 2026
New Benchmark IRTS-ToolBench Tests LLMs on Irregular Time Series Question Answering Technology

New Benchmark IRTS-ToolBench Tests LLMs on Irregular Time Series Question Answering

A research paper introduces IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains to evaluate large language models (LLMs) and AI agents on irregular time series question answering (TSQA). The benchmark addresses a gap in existing TSQA benchmarks that assume regular sampling, providing standardized inputs and a reproducible evaluation protocol for verifiable agentic data science.

June 16, 2026
LLM Agents Look at Correct Tools but Still Pick Wrong, Research Reveals Readout as Failure Point Technology

LLM Agents Look at Correct Tools but Still Pick Wrong, Research Reveals Readout as Failure Point

Research by Shiyang Chen reveals that LLM agents mis-call tools not because they fail to see the right tool, but because the decision readout fails. The model attends to the correct tool 80% of the time, yet picks wrong. Readout-side interventions recover 59-91% of failures, while input-side fixes recover ≤23%.

June 16, 2026