Time series data from real-world deployments is overwhelmingly irregular, with asynchronous observations, informative missing values, and variable sampling frequencies across sensors and operational windows. Yet existing Time Series Question Answering (TSQA) benchmarks mostly assume regularly sampled inputs, creating a fundamental gap in understanding how large language models (LLMs) and AI agents perform under irregular conditions.
To bridge this gap, a new research paper titled "Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning" introduces IRTS-ToolBench, a benchmark comprising 1,700 questions spanning 10 task types across 13 domains. According to the paper, IRTS-ToolBench is designed to be used independently by any researcher working on LLM-based irregular time series analysis, providing standardized inputs and a reproducible evaluation protocol.
The Benchmark: IRTS-ToolBench
The benchmark focuses on irregular time series question answering, where queries require reasoning over temporal data that is not uniformly sampled. The 1,700 questions cover 10 distinct task types, though the paper does not specify each type. The 13 domains represent a broad range of real-world scenarios, ensuring the benchmark tests generalizability. Key characteristics of IRTS-ToolBench are summarized in the table below.
| Feature | Detail |
|---|---|
| Total questions | 1,700 |
| Task types | 10 |
| Domains | 13 |
| Focus | Irregular time series QA |
| Evaluation | Standardized, reproducible protocol |
The paper emphasizes that the benchmark is designed to assess "tool-grounded reasoning"—a method where AI agents use external tools to compute answers, enabling verifiable and reliable outputs. This approach contrasts with relying solely on LLM internal knowledge, which may hallucinate or produce unverifiable results.
Implications for Agentic Data Science
The introduction of IRTS-ToolBench addresses a critical limitation in current TSQA evaluation. As the paper notes, existing benchmarks assume regularly sampled inputs, which do not reflect real-world deployments where data is often irregular. For enterprise decision-makers, especially those in fields like supply chain, logistics, and IoT monitoring, irregular time series are the norm. However, the paper does not specify these domains; it generically states "13 domains" without listing them.
By providing a standardized benchmark, IRTS-ToolBench enables researchers and practitioners to systematically evaluate how LLMs and AI agents handle irregular temporal data. The focus on "verifiable agentic data science" suggests a push toward AI systems that can be trusted to produce accurate, auditable answers—critical for high-stakes applications where errors can have significant operational or financial consequences.
Availability and Code
The authors have released the code for IRTS-ToolBench, accessible via the paper's arXiv page. The paper is published under a Creative Commons Attribution 4.0 International License. Researchers can use the benchmark to test their own LLM-based systems or agents, with a standardized protocol ensuring comparability across studies.
As organizations increasingly deploy AI for data analysis, benchmarks like IRTS-ToolBench will become essential tools for validating performance under realistic conditions. The shift from regular to irregular time series evaluation marks a step toward more practical and robust AI systems in enterprise environments.