It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Anni Wang; Chenghao Liu; Ming Jin; Mingsheng Long; Qingsong Wen; Sheng Pan; Viktoriya Zhukova; Xudong Jiang; Yong Liu; Zhongzheng Qiao

arxiv: 2602.12147 · v4 · pith:EL7VGXBZnew · submitted 2026-02-12 · 💻 cs.LG

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Zhongzheng Qiao , Sheng Pan , Anni Wang , Viktoriya Zhukova , Yong Liu , Xudong Jiang , Qingsong Wen , Mingsheng Long

show 2 more authors

Ming Jin Chenghao Liu

This is my paper

classification 💻 cs.LG

keywords timedataforecastingevaluationgeneralizableseriestaskanalysis

0 comments

read the original abstract

Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from real-world contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection. The leaderboard is available at https://huggingface.co/spaces/Real-TSF/TIME-leaderboard.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models
cs.LG 2026-05 unverdicted novelty 8.0

TSFMAudit detects pretraining contamination in time series foundation models via probe adaptation dynamics (faster loss drop, smaller backbone shift), tested on 6 models and 187 datasets against 10 LLM-derived baselines.
Beyond IID: How General Are Tabular Foundation Models, Really?
cs.LG 2026-06 unverdicted novelty 7.0

Tabular foundation models excel on tiny- to medium-sized IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, and high-dimensional datasets, based on evaluations across 1...
From Recognition to Understanding: Unlocking Cognitive Time Series Reasoning with LLMs
cs.CL 2026-06 unverdicted novelty 7.0

Introduces the TSCognition benchmark for cognitive time series reasoning tasks and the TSAlign alignment framework, reporting outperformance over LLM, VLM, and time-series baselines on TSCognition and TimerBed with lo...
TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning
cs.LG 2026-06 unverdicted novelty 7.0

TS-ICL introduces a probabilistic in-context learning encoder-regressor Transformer that unifies forecasting and imputation for time series via timestamp-aligned regression trained on synthetic causal data.
Toto 2.0: Time Series Forecasting Enters the Scaling Era
cs.LG 2026-05 unverdicted novelty 6.0

Toto 2.0 is a family of open time series foundation models that demonstrates reliable scaling and sets new state-of-the-art results on three forecasting benchmarks.
Toto 2.0: Time Series Forecasting Enters the Scaling Era
cs.LG 2026-05 unverdicted novelty 5.0

Time series foundation models scale under a single training recipe, with forecast quality improving from 4M to 2.5B parameters and new SOTA results on BOOM, GIFT-Eval, and TIME benchmarks.