pith. sign in

arxiv: 2602.24238 · v2 · pith:JWC4GT6Snew · submitted 2026-02-27 · 💻 cs.LG

Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

Pith reviewed 2026-05-21 11:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords time series forecastingfoundation modelstransportationzero-shot learningbenchmarktraffic predictionprobabilistic forecasting
0
0 comments X

The pith

A general time series foundation model matches or beats specialized forecasters on transportation tasks without any task-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks the zero-shot performance of Chronos-2 on ten real-world datasets covering highway traffic, urban speeds, bike-sharing demand, and electric vehicle charging. It shows that this model, used with no fine-tuning or architecture changes, reaches state-of-the-art or competitive accuracy on most datasets and often surpasses both classical statistical methods and purpose-built deep learning models, especially when predicting further into the future. The same model also supplies usable prediction intervals that quantify uncertainty without extra training. A sympathetic reader would care because the result suggests that a single pre-trained model can replace much of the dataset-by-dataset engineering currently standard in transportation forecasting.

Core claim

Under a consistent evaluation protocol across ten real-world transportation datasets, Chronos-2 without any task-specific fine-tuning delivers state-of-the-art or competitive accuracy, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. It also supplies useful uncertainty quantification through its native probabilistic outputs.

What carries the argument

Zero-shot application of the Chronos-2 time series foundation model, which performs forecasting on new datasets using only its pre-trained weights and a fixed evaluation protocol.

If this is right

  • Transportation forecasting research can adopt a single general-purpose model as a strong, low-effort baseline instead of building new architectures for each dataset.
  • Specialized deep learning models may not be required for competitive performance when longer forecast horizons are the main concern.
  • Probabilistic forecasts with calibrated uncertainty become available for transportation tasks without dataset-specific training.
  • Foundation models shift the cost of forecasting from repeated model development to occasional evaluation of off-the-shelf performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Urban planners and infrastructure teams could deploy forecasting tools faster by starting with a foundation model and only adding custom work when clear gaps remain.
  • The same zero-shot approach might serve as a first test for other sequential prediction problems such as demand for ride-sharing or public transit.
  • If longer-horizon gains persist across domains, foundation models could reduce the need for horizon-specific model families in operational systems.

Load-bearing premise

The ten chosen transportation datasets together with the shared evaluation rules give a fair test that lets the zero-shot model be compared directly to models that were trained or tuned on exactly those same datasets.

What would settle it

Running the identical benchmark protocol on the same ten datasets and finding that one or more specialized models, after proper training and tuning, produce strictly lower error than Chronos-2 at every horizon and on every dataset would falsify the central claim.

read the original abstract

Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong performance with deep learning models, these methods typically require dataset-specific training, architecture design and hyper-parameter tuning. This paper evaluates whether general-purpose time-series foundation models can serve as forecasters for transportation tasks by benchmarking the zero-shot performance of the state-of-the-art model, Chronos-2, across ten real-world datasets covering highway traffic volume and flow, urban traffic speed, bike-sharing demand, and electric vehicle charging station data. Under a consistent evaluation protocol, we find that, even without any task-specific fine-tuning, Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. Beyond point forecasting, we evaluate its native probabilistic outputs using prediction-interval coverage and sharpness, demonstrating that Chronos-2 also provides useful uncertainty quantification without dataset-specific training. In general, this study supports the adoption of time-series foundation models as a key baseline for transportation forecasting research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript benchmarks the zero-shot forecasting performance of the Chronos-2 time-series foundation model across ten real-world transportation datasets (highway traffic volume/flow, urban traffic speed, bike-sharing demand, and EV charging). Under a single consistent evaluation protocol it reports that Chronos-2 matches or exceeds both classical statistical baselines and specialized deep-learning models on point-forecast accuracy (especially at longer horizons) and also supplies well-calibrated probabilistic forecasts without any dataset-specific training or fine-tuning.

Significance. If the baseline comparisons are shown to be equitable, the result would be a useful empirical contribution: it supplies a large-scale, multi-domain demonstration that a single pre-trained foundation model can function as a strong, low-effort baseline for transportation forecasting, thereby lowering the barrier to entry for new studies. The consistent protocol across ten datasets and the joint evaluation of point forecasts plus prediction-interval coverage are clear strengths.

major comments (1)
  1. [§4 and §5] §4 (Experimental Setup) and §5 (Results): the manuscript states that a 'consistent evaluation protocol' was used for all methods, yet provides only high-level descriptions of the deep-learning baselines and does not report the hyper-parameter search budget, validation-based selection procedure, or number of random seeds used for each specialized architecture. Because the central claim is that Chronos-2 outperforms these models without task-specific tuning, the absence of these details leaves open the possibility that the reported gains are partly due to unequal optimization effort rather than model superiority; explicit documentation of tuning effort is required to substantiate the comparison.
minor comments (2)
  1. [Table 2] Table 2: the column headers for horizon lengths are not aligned with the numerical results in the rows; a small formatting correction would improve readability.
  2. [Figure 4] Figure 4: the y-axis label for coverage probability is missing the unit or range; adding '(%)' would clarify the plot.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight an important aspect of ensuring equitable and transparent baseline comparisons. We address the major comment on the experimental setup below.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): the manuscript states that a 'consistent evaluation protocol' was used for all methods, yet provides only high-level descriptions of the deep-learning baselines and does not report the hyper-parameter search budget, validation-based selection procedure, or number of random seeds used for each specialized architecture. Because the central claim is that Chronos-2 outperforms these models without task-specific tuning, the absence of these details leaves open the possibility that the reported gains are partly due to unequal optimization effort rather than model superiority; explicit documentation of tuning effort is required to substantiate the comparison.

    Authors: We agree that more granular documentation of the hyper-parameter tuning process for the deep-learning baselines is necessary to fully substantiate the fairness of the comparisons. In the revised manuscript, we will expand §4 with a new subsection that explicitly reports the hyper-parameter search budget (including the ranges explored and the optimization method), the validation-based selection procedure, the number of random seeds for each architecture, and the total computational effort allocated to tuning the specialized models. This addition will allow readers to directly evaluate whether the reported advantages of Chronos-2 stem from model capability rather than differences in optimization effort. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential steps

full rationale

This paper reports a direct empirical benchmark of Chronos-2 zero-shot performance against statistical and deep-learning baselines on ten transportation datasets under a fixed evaluation protocol. No equations, parameter fittings, uniqueness theorems, or ansatzes are presented; the central claims rest on measured accuracy, coverage, and sharpness metrics obtained from independent model outputs on held-out data. The evaluation is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the pre-trained Chronos-2 model from prior work together with standard machine-learning benchmarking assumptions such as representative datasets and appropriate evaluation metrics.

axioms (1)
  • domain assumption Standard time-series forecasting metrics and protocols allow fair comparison between zero-shot foundation models and task-specific baselines.
    Invoked when the paper states results under a consistent evaluation protocol.

pith-pipeline@v0.9.0 · 5718 in / 1167 out tokens · 38363 ms · 2026-05-21T11:28:37.746058+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 3 internal anchors

  1. [1]

    A methodological review on time series forecasting by using arima,

    B. D. K. Reddy, J. S. Naik, S. V . Kumar, S. Kumar, and et al., “A methodological review on time series forecasting by using arima,” in Proceedings of the International Conference on Advanced Materials, Manufacturing and Sustainable Development (ICAMMSD 2024). At- lantis Press, 2025, pp. 709–719

  2. [2]

    Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting,

    B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting,” 07 2018, pp. 3634–3640

  3. [3]

    Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,

    Y . Li, R. Yu, C. Shahabi, and Y . Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” 2018

  4. [4]

    T-gcn: A temporal graph convolutional network for traffic prediction,

    L. Zhao, Y . Song, C. Zhang, Y . Liu, P. Wang, T. Lin, M. Deng, and H. Li, “T-gcn: A temporal graph convolutional network for traffic prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 9, pp. 3848–3858, 2020

  5. [5]

    3dgcn: 3-dimensional dynamic graph convolutional network for citywide crowd flow prediction,

    T. Xia, J. Lin, Y . Li, J. Feng, P. Hui, F. Sun, D. Guo, and D. Jin, “3dgcn: 3-dimensional dynamic graph convolutional network for citywide crowd flow prediction,”ACM Trans. Knowl. Discov. Data, vol. 15, no. 6, 2021

  6. [6]

    Language Models are Few-Shot Learners

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, and et al., “Language models are few-shot learners,”CoRR, vol. abs/2005.14165, 2020

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,”CoRR, vol. abs/1810.04805, 2018

  8. [8]

    Time-llm: Time series forecasting by reprogramming large language models,

    M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Chen, Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-llm: Time series forecasting by reprogramming large language models,” 2024

  9. [9]

    Promptcast: A new prompt-based learning paradigm for time series forecasting,

    H. Xue and F. D. Salim, “Promptcast: A new prompt-based learning paradigm for time series forecasting,” 2023

  10. [10]

    A decoder-only foundation model for time-series forecasting,

    A. Das, W. Kong, R. Sen, and Y . Zhou, “A decoder-only foundation model for time-series forecasting,” 2024

  11. [11]

    Lag-llama: Towards foundation models for probabilistic time series forecasting,

    K. Rasul, A. Ashok, A. R. Williams, H. Ghonia, and et al, “Lag-llama: Towards foundation models for probabilistic time series forecasting,” 2024

  12. [12]

    Chronos: Learning the language of time series,

    A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, and et al., “Chronos: Learning the language of time series,” 2024

  13. [13]

    Chronos-2: From univariate to universal forecasting,

    A. F. Ansari, O. Shchur, J. K ¨uken, A. Auer, B. Han, P. Mercado, and et al., “Chronos-2: From univariate to universal forecasting,” 2025

  14. [14]

    Forecast evaluation for data scientists: common pitfalls and best practices,

    H. Hewamalage, K. Ackermann, and C. Bergmeir, “Forecast evaluation for data scientists: common pitfalls and best practices,”Data Mining and Knowledge Discovery, vol. 37, no. 2, pp. 788–832, 2023

  15. [15]

    Probabilistic forecasts, calibration and sharpness,

    T. Gneiting, F. Balabdaoui, and A. Raftery, “Probabilistic forecasts, calibration and sharpness,”Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 69, pp. 243 – 268, 04 2007

  16. [16]

    On the importance of stationarity, strong baselines and benchmarks in transport prediction problems,

    F. Rodrigues, “On the importance of stationarity, strong baselines and benchmarks in transport prediction problems,” in2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), 2023, pp. 4927–4932

  17. [17]

    Urbanev: An open benchmark dataset for urban electric vehicle charging demand prediction,

    H. Li, H. Qu, X. Tan, L. You, R. Zhu, and W. Fan, “Urbanev: An open benchmark dataset for urban electric vehicle charging demand prediction,”Scientific Data, p. 523, 2025

  18. [18]

    Ddp-gcn: Multi-graph convolutional network for spatiotemporal traffic forecasting,

    K. Lee and W. Rhee, “Ddp-gcn: Multi-graph convolutional network for spatiotemporal traffic forecasting,” 2022

  19. [19]

    Coupled layer-wise graph convolution for transportation demand prediction,

    J. Ye, L. Sun, B. Du, Y . Fu, and H. Xiong, “Coupled layer-wise graph convolution for transportation demand prediction,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, pp. 4617– 4625, May 2021

  20. [20]

    Graph neural controlled dif- ferential equations for traffic forecasting,

    J. Choi, H. Choi, J. Hwang, and N. Park, “Graph neural controlled dif- ferential equations for traffic forecasting,”Proc. of the AAAI Conference on Artificial Intelligence, vol. 36, no. 6, pp. 6367–6374, Jun. 2022

  21. [21]

    Real-time spatiotemporal prediction and imputation of traffic status based on lstm and graph laplacian regu- larized matrix factorization,

    J.-M. Yang, Z.-R. Peng, and L. Lin, “Real-time spatiotemporal prediction and imputation of traffic status based on lstm and graph laplacian regu- larized matrix factorization,”Transportation Research Part C: Emerging Technologies, vol. 129, p. 103228, 2021

  22. [22]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”CoRR, vol. abs/1910.10683, 2019

  23. [23]

    Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,

    L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,” 2023

  24. [24]

    On the opportunities and risks of foundation models,

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, and et al., “On the opportunities and risks of foundation models,” 2022