fev-bench: A Realistic Benchmark for Time Series Forecasting

Abdul Fatir Ansari; Caner Turkmen; Lorenzo Stella; Michael Bohlke-Schneider; Nick Erickson; Oleksandr Shchur; Pablo Guerron; Yuyang Wang

arxiv: 2509.26468 · v3 · pith:4OZHGM25new · submitted 2025-09-30 · 💻 cs.LG

fev-bench: A Realistic Benchmark for Time Series Forecasting

Oleksandr Shchur , Abdul Fatir Ansari , Caner Turkmen , Lorenzo Stella , Nick Erickson , Pablo Guerron , Michael Bohlke-Schneider , Yuyang Wang This is my paper

classification 💻 cs.LG

keywords benchmarkfev-benchforecastingevaluationexistingaggregationbenchmarkscovariates

0 comments

read the original abstract

Benchmark quality is critical for meaningful evaluation and sustained progress in time series forecasting, particularly with the rise of pretrained models. Existing benchmarks often have limited domain coverage or overlook real-world settings such as tasks with covariates. Their aggregation procedures frequently lack statistical rigor, making it unclear whether observed performance differences reflect true improvements or random variation. Many benchmarks lack consistent evaluation infrastructure or are too rigid for integration into existing pipelines. To address these gaps, we propose fev-bench, a benchmark of 100 forecasting tasks across seven domains, including 46 with covariates. Supporting the benchmark, we introduce fev, a lightweight Python library for forecasting evaluation emphasizing reproducibility and integration with existing workflows. Using fev, fev-bench employs principled aggregation with bootstrapped confidence intervals to report performance along two dimensions: win rates and skill scores. We report results on fev-bench for pretrained, statistical, and baseline models and identify promising future research directions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond IID: How General Are Tabular Foundation Models, Really?
cs.LG 2026-06 unverdicted novelty 7.0

Tabular foundation models excel on tiny- to medium-sized IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, and high-dimensional datasets, based on evaluations across 1...
Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models
cs.LG 2026-04 unverdicted novelty 7.0

Time series foundation models match the performance of specialized models for day-ahead load forecasting while providing explanations that match domain knowledge on weather and calendar effects.
Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting
econ.EM 2026-04 unverdicted novelty 7.0

Energy-Arena is a dynamic, forward-looking benchmarking platform that standardizes ex-ante submissions and rolling ex-post evaluations for operational energy forecasting to improve transparency and comparability.
AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting
cs.LG 2026-05 unverdicted novelty 6.0

AME-TS is a structure-guided sparse MoE foundation model for time series that aligns expert routing with series-level temporal descriptors to achieve strong accuracy-efficiency tradeoffs on GIFT-Eval while improving s...
TabPFN-3: Technical Report
cs.LG 2026-05 unverdicted novelty 6.0

TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

Foundation models outperform dataset-specific machine learning in energy time series forecasting across 54 datasets in 9 categories.
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
cs.LG 2026-03 unverdicted novelty 6.0

iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.
Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
cs.LG 2026-05 unverdicted novelty 5.0

Falcon-X introduces a latent prototype space with Unified Prototype Diff-Attention and Latent Entity Attention for heterogeneous multivariate time series forecasting.
Investigating simple target-covariate relationships for Chronos-2 and TabPFN-TS
cs.LG 2026-05 unverdicted novelty 5.0

TabPFN-TS captures simple target-covariate relationships more effectively than Chronos-2 in controlled experiments, especially for short horizons.
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning
eess.SP 2026-04 unverdicted novelty 5.0

TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
cs.LG 2026-03 unverdicted novelty 5.0

iAmTime is a hierarchical transformer-based time series foundation model that uses semantic tokens and instruction-conditioned prompts to infer tasks from demonstrations, achieving improved zero-shot performance on fo...