New Benchmark IRTS-ToolBench Tests LLMs on Irregular Time Series Question Answering

A research paper introduces IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains to evaluate large language models (LLMs) and AI agents on irregular time series question answering (TSQA). The benchmark addresses a gap in existing TSQA benchmarks that assume regular sampling, providing standardized inputs and a reproducible evaluation protocol for verifiable agentic data science.

iGEN Editorial

June 16, 2026

New Benchmark IRTS-ToolBench Tests LLMs on Irregular Time Series Question Answering

Time series data from real-world deployments is overwhelmingly irregular, with asynchronous observations, informative missing values, and variable sampling frequencies across sensors and operational windows. Yet existing Time Series Question Answering (TSQA) benchmarks mostly assume regularly sampled inputs, creating a fundamental gap in understanding how large language models (LLMs) and AI agents perform under irregular conditions.

To bridge this gap, a new research paper titled "Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning" introduces IRTS-ToolBench, a benchmark comprising 1,700 questions spanning 10 task types across 13 domains. According to the paper, IRTS-ToolBench is designed to be used independently by any researcher working on LLM-based irregular time series analysis, providing standardized inputs and a reproducible evaluation protocol.

The Benchmark: IRTS-ToolBench

The benchmark focuses on irregular time series question answering, where queries require reasoning over temporal data that is not uniformly sampled. The 1,700 questions cover 10 distinct task types, though the paper does not specify each type. The 13 domains represent a broad range of real-world scenarios, ensuring the benchmark tests generalizability. Key characteristics of IRTS-ToolBench are summarized in the table below.

Feature	Detail
Total questions	1,700
Task types	10
Domains	13
Focus	Irregular time series QA
Evaluation	Standardized, reproducible protocol

The paper emphasizes that the benchmark is designed to assess "tool-grounded reasoning"—a method where AI agents use external tools to compute answers, enabling verifiable and reliable outputs. This approach contrasts with relying solely on LLM internal knowledge, which may hallucinate or produce unverifiable results.

Implications for Agentic Data Science

The introduction of IRTS-ToolBench addresses a critical limitation in current TSQA evaluation. As the paper notes, existing benchmarks assume regularly sampled inputs, which do not reflect real-world deployments where data is often irregular. For enterprise decision-makers, especially those in fields like supply chain, logistics, and IoT monitoring, irregular time series are the norm. However, the paper does not specify these domains; it generically states "13 domains" without listing them.

By providing a standardized benchmark, IRTS-ToolBench enables researchers and practitioners to systematically evaluate how LLMs and AI agents handle irregular temporal data. The focus on "verifiable agentic data science" suggests a push toward AI systems that can be trusted to produce accurate, auditable answers—critical for high-stakes applications where errors can have significant operational or financial consequences.

Availability and Code

The authors have released the code for IRTS-ToolBench, accessible via the paper's arXiv page. The paper is published under a Creative Commons Attribution 4.0 International License. Researchers can use the benchmark to test their own LLM-based systems or agents, with a standardized protocol ensuring comparability across studies.

As organizations increasingly deploy AI for data analysis, benchmarks like IRTS-ToolBench will become essential tools for validating performance under realistic conditions. The shift from regular to irregular time series evaluation marks a step toward more practical and robust AI systems in enterprise environments.

Sources:

New Benchmark IRTS-ToolBench Tests LLMs on Irregular Time Series Question Answering

The Benchmark: IRTS-ToolBench

Implications for Agentic Data Science

Availability and Code

Recommended Stories

SorryDB Benchmark Tests AI Provers on Real-World Lean Theorem Completion Tasks

New Research Shows Chain-of-Thought Reasoning Should Be Selective, Not Default, for LLMs

New Research Reveals How Visual Tokens Evolve Inside Vision-Language Models

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation