High-quality time series forecasting is pivotal for real-world decision-making, according to a new paper from researchers including Chen Zhi, Wang Yuxuan, and colleagues. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. To address this, the team explores using Vision-Language Models (VLMs) as judges for time series forecasting.
The paper, titled "TimeVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting," proposes a novel framework that integrates micro- and macro-level judgments informed by contextual information. The approach harnesses the ability of VLMs to comprehend time series plots grounded in textual information.
The Limits of Traditional Metrics
Conventional evaluation methods for time series forecasting models rely on point-wise error measures such as Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE). The researchers argue that these metrics fail to capture complex temporal patterns and often do not align with how humans intuitively assess forecast quality. This misalignment can lead to poor model selection in practice.
VLM-as-a-Judge Framework
The proposed framework leverages Vision-Language Models to evaluate time series forecasts by analyzing plots of the time series data. The VLMs provide both micro-level judgments (evaluating specific points) and macro-level judgments (assessing overall patterns) with contextual information. This approach mimics human evaluators who consider visual patterns and contextual clues.
The TimeVista Benchmark
To support this evaluation paradigm, the researchers introduce TimeVista, a comprehensive benchmark comprising 5,563 time series samples paired with detailed evaluation rubrics. The benchmark is designed to meta-evaluate the reliability of VLMs as judges. The results show that VLMs are highly reliable judges, achieving significantly higher consistency with human preferences than conventional metrics.
| Aspect | Description |
|---|---|
| Benchmark Size | 5,563 time series samples |
| Evaluation Type | Micro-level and macro-level judgments |
| Comparison | VLMs vs. conventional point-wise metrics |
| Key Finding | VLMs achieve significantly higher consistency with human preferences |
Assessing Time Series Foundation Models
Building on the TimeVista benchmark, the researchers comprehensively assessed recent Time Series Foundation Models (TSFMs) under the VLM-as-a-Judge paradigm. Their findings demonstrate that VLMs serve as robust and interpretable judges, providing a comprehensive, human-aligned standard for evaluating time series models.
Implications for Enterprise Decision-Making
For decision-makers reliant on time series forecasting, this research highlights a path toward more human-aligned model evaluation. By adopting VLM-based judges, enterprises can better assess forecast quality in contexts where complex temporal patterns matter—such as demand forecasting, inventory planning, or energy load prediction. The TimeVista benchmark offers a standardized way to compare models, potentially reducing the gap between technical metrics and business value.