Artificial Intelligence #ai#artificial intelligence
New Benchmark IRTS-ToolBench Tests LLMs on Irregular Time Series Question Answering
A research paper introduces IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains to evaluate large language models (LLMs) and AI agents on irregular time series question answering (TSQA). The benchmark addresses a gap in existing TSQA benchmarks that assume regular sampling, providing standardized inputs and a reproducible evaluation protocol for verifiable agentic data science.
Jun 16, 2026 2 sources