Topic
agentic
New Benchmark IRTS-ToolBench Tests LLMs on Irregular Time Series Question Answering
A research paper introduces IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains to evaluate large language models (LLMs) and AI agents on irregular time series question answering (TSQA). The benchmark addresses a gap in existing TSQA benchmarks that assume regular sampling, providing standardized inputs and a reproducible evaluation protocol for verifiable agentic data science.
New Framework Automates Skill Construction for Agentic Large Language Models
A new framework called Collective Skill Tree Search (CSTS) automatically constructs reusable skills for large language model (LLM) agents. It uses two iterative phases—collective generation and collective assessment—to build a diverse, generalizable tree of skills that enhances agentic capabilities in planning, tool use, and environment interaction.
MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy
The MAGE-RAG research paper introduces a multigranular adaptive graph evidence framework for multimodal retrieval-augmented generation (RAG) in long-document question answering. By building an evidence graph with page and element nodes and using an online controller to iteratively activate and prune evidence, it balances coverage and noise. Experiments show accuracy improvements over existing methods on LongDocURL and MMLongBench-Doc benchmarks.
Visual-Seeker: Visual-Native AI Agent for Active Visual Reasoning in Multimodal Search
Researchers propose Visual-Seeker, a visual-native multimodal deep search agent that actively harvests fine-grained visual evidence during search. Using a synthesized dataset of 5K multimodal trajectories, it achieves state-of-the-art on five benchmarks, outperforming several proprietary models.