pith. sign in

arxiv: 2605.20119 · v1 · pith:T3KMXW7Pnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Pith reviewed 2026-05-20 06:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords time series forecastingfoundation modelsmodel scalingforecasting benchmarksopen weightsobservability
0
0 comments X

The pith

Time series foundation models improve forecast quality as they scale from 4 million to 2.5 billion parameters under one fixed training recipe.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that time series forecasting benefits from scaling in the same way other foundation model domains do. A consistent training approach yields steady accuracy gains across five model sizes without needing adjustments for each scale. This matters because it opens the door to larger models delivering better predictions on real-world tasks like observability and general forecasting. The authors back the claim with new state-of-the-art results on three benchmarks, including one built to resist data contamination, and release the models openly.

Core claim

A single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters in time series foundation models, as shown by the Toto 2.0 family setting new state of the art on BOOM, GIFT-Eval, and the contamination-resistant TIME benchmark.

What carries the argument

The u-muP hyperparameter transfer pipeline that lets the same architecture, data handling, and training steps work across model sizes without per-size retuning.

If this is right

  • Forecast accuracy keeps rising with added parameters when the recipe stays fixed.
  • The same design choices support deployment at any scale from small to very large models.
  • Open release of five checkpoints lets users test scaling directly on their own data.
  • New benchmarks confirm gains hold even when contamination is controlled for.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could train once at small scale and then grow the model size for harder problems without restarting from scratch.
  • The approach might extend to related tasks such as time series classification or anomaly detection.
  • Larger models could reduce the need for task-specific fine-tuning in production forecasting systems.
  • Testing on longer horizons or multivariate series would show whether the scaling pattern continues.

Load-bearing premise

The observed forecast gains come from the models learning general patterns rather than from test data leaking into training or from overfitting to the specific benchmarks.

What would settle it

Results on a fresh, held-out time series forecasting dataset where the 2.5B model performs no better than the 4M model or falls short of prior non-scaled baselines.

read the original abstract

We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that time series foundation models scale: a single fixed training recipe (including u-muP hyperparameter transfer) produces reliable forecast-quality gains as parameters increase from 4M to 2.5B. The authors release the Toto 2.0 family of five open-weights models, which set new state-of-the-art results on the BOOM observability benchmark, GIFT-Eval general-purpose benchmark, and contamination-resistant TIME benchmark. The paper details the architecture, training recipe, data handling, and experimental results.

Significance. If the scaling holds after isolating parameter count from data volume, this would provide concrete evidence that time series forecasting can enter a scaling regime comparable to language models, with practical value from the open model releases. The emphasis on the TIME benchmark for contamination resistance is a methodological strength that supports the generalization claim.

major comments (2)
  1. [Training recipe and data] Training recipe and data section: the central claim is that a single training recipe yields parameter-only scaling improvements from 4M to 2.5B. However, the manuscript does not report total training tokens or effective dataset size after sampling for each scale point. Without this, gains on BOOM/GIFT-Eval/TIME cannot be isolated from possible joint scaling with data exposure, which directly undermines the 'parameter scaling' interpretation asserted in the abstract.
  2. [Results] Results section: while the abstract states 'consistent empirical gains across scales,' no quantitative scaling curves, per-scale token counts, or ablation isolating capacity from data are provided. This leaves the load-bearing claim of reliable parameter-driven improvement only partially supported by the reported benchmarks.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'reliable forecast-quality improvements' would benefit from a brief quantification (e.g., average MAE reduction or rank improvement) to make the claim more precise.
  2. [Architecture and u-muP pipeline] Notation: the u-muP pipeline is referenced but its exact hyperparameter transfer rules for time-series-specific components (e.g., patch size or horizon) could be clarified with a short equation or table.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on strengthening the evidence for parameter scaling. We respond to each major point below and indicate planned revisions to improve transparency around data exposure and scaling curves.

read point-by-point responses
  1. Referee: Training recipe and data section: the central claim is that a single training recipe yields parameter-only scaling improvements from 4M to 2.5B. However, the manuscript does not report total training tokens or effective dataset size after sampling for each scale point. Without this, gains on BOOM/GIFT-Eval/TIME cannot be isolated from possible joint scaling with data exposure, which directly undermines the 'parameter scaling' interpretation asserted in the abstract.

    Authors: We agree that explicit per-scale token counts are required to isolate parameter effects. The training pipeline employed a single fixed data mixture and sampling strategy for all five models, so effective data exposure was held constant while only model capacity varied. In the revised manuscript we will add a table in the Training recipe and data section reporting approximate total tokens processed by each scale (4M through 2.5B), confirming that data volume did not increase with parameter count. revision: yes

  2. Referee: Results section: while the abstract states 'consistent empirical gains across scales,' no quantitative scaling curves, per-scale token counts, or ablation isolating capacity from data are provided. This leaves the load-bearing claim of reliable parameter-driven improvement only partially supported by the reported benchmarks.

    Authors: We acknowledge that a scaling curve figure would make the gains more visible. We will add such a plot in the Results section, with performance on BOOM, GIFT-Eval, and TIME shown against parameter count and annotated with the token counts from the new table. A full ablation that independently varies data volume at the 2.5B scale is not feasible given computational limits; the fixed-recipe design with u-muP already holds all non-capacity factors constant, providing the strongest evidence obtainable under our constraints. revision: partial

standing simulated objections not resolved
  • A complete ablation study that decouples data volume from parameter count at the largest scale, which is computationally prohibitive.

Circularity Check

0 steps flagged

No circularity: empirical scaling results are self-contained

full rationale

The paper reports direct training and evaluation of models from 4M to 2.5B parameters under a fixed recipe, with results measured on external benchmarks (BOOM, GIFT-Eval, TIME). No derivation chain, equation, or prediction reduces by construction to a fitted quantity defined in terms of itself; the scaling claim rests on observed performance deltas rather than any self-referential ansatz, uniqueness theorem, or renamed empirical pattern. The u-muP pipeline and architecture choices are described as design decisions, not as load-bearing self-citations that collapse the central result. This is the standard non-circular outcome for an empirical scaling study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from transformer-based sequence modeling and scaling observations established in other domains; the main addition is empirical validation on time series data.

axioms (1)
  • domain assumption Transformer-style architectures can be effectively adapted to time series data
    This underpins the model family design described in the abstract.

pith-pipeline@v0.9.0 · 5709 in / 1163 out tokens · 79930 ms · 2026-05-20T06:37:54.480344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 9 internal anchors

  1. [1]

    Op- tuna: A next-generation hyperparameter optimization framework

    doi: 10.1145/3292500.3330701. Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. GIFT-Eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024a. URLhttps://arxiv.org/abs/2410.10393. Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio ...

  2. [2]

    Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Hao Shen, Oleksandr Shchur, Syama S

    URL https://openreview.net/forum?id=yRtgZ1K8hO. Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Hao Shen, Oleksandr Shchur, Syama S. Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and...

  3. [3]

    Chronos: Learning the Language of Time Series

    URL https://arxiv.org/abs/ 2403.07815. Abdul Fatir Ansari, Oleksandr Shchur, Jasper Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama S. Rangapuram, Hao Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Sanyam Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek M. Desai, Hao Wang, Huzefa Rangwala, George Karypis,...

  4. [4]

    Chronos-2: From Univariate to Universal Forecasting

    URLhttps://arxiv.org/abs/2510.15821. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. TiRex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  5. [6]

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter

    URLhttps://arxiv.org/abs/1705.07774. Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  6. [7]

    Workshop at NeurIPS 2025, San Diego

    URL https://berts-workshop.github.io/. Workshop at NeurIPS 2025, San Diego. Charlie Blake, Douglas Orr, and Carlo Luschi. Unit scaling: Out-of-the-box low-precision training. InProceedings of the 40th International Conference on Machine Learning, pages 2548–2576. PMLR,

  7. [8]

    Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, and Wei Wang

    URL https://openreview.net/forum?id= P7KRIiLM8T. Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, and Wei Wang. Time-IMM: A dataset and benchmark for irregular multimodal multivariate time series. InAdvances in Neural Information Processing Systems (NeurIPS 2025 Datasets and Benchmarks T rack),

  8. [9]

    Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu

    URLhttps://arxiv.org/abs/2506.10412. Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu. Cautious weight decay.arXiv preprint arXiv:2510.12402,

  9. [10]

    Tianqi Chen and Carlos Guestrin

    URLhttps://arxiv.org/abs/2510.12402. Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794,

  10. [11]

    Xgboost: A scalable tree boosting system

    doi: 10.1145/2939672.2939785. Ben Cohen, Emaad Khwaja, Kan Wang, Clément Masson, Elise Ramé, Youssef Doubli, and Othmane Abou-Amal. Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874,

  11. [12]

    URL https://arxiv.org/ abs/2407.07874. Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, and Othmane Abou-Amal. This time is dif...

  12. [14]

    A decoder-only foundation model for time-series forecasting

    URLhttps://arxiv.org/abs/2310.10688. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transform- ers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis...

  13. [15]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URLhttps://aclanthology.org/N19-1423/. Federico Garza, Kin Gutiérrez, Cristian Challu, Jose Moralez, Ricardo Olivares, and Max Mergenthaler. tsfeatures: Calculates various features from time series data. python implementation of the r package tsfeatures,

  14. [16]

    ©Datadog 2026 17 T echnical Report Lars Graf, Thomas Ortner, Stanisław Wo´ zniak, and Angeliki Pantazi

    URLhttps://huggingface.co/google/timesfm-2.5-200m-pytorch. ©Datadog 2026 17 T echnical Report Lars Graf, Thomas Ortner, Stanisław Wo´ zniak, and Angeliki Pantazi. FlowState: Sampling-rate invariant time series foundation model with dynamic forecasting horizons. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  15. [18]

    URLhttps://arxiv.org/abs/2010.04245. Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the po...

  16. [20]

    Scaling Laws for Neural Language Models

    URLhttps://arxiv.org/abs/2001.08361. Andrej Karpathy. Beating GPT-2 for <<$100: the nanochat journey. GitHub Discussions,

  17. [22]

    Adam: A Method for Stochastic Optimization

    URLhttps://arxiv.org/abs/1412.6980. Roger Koenker and Gilbert Bassett. Regression quantiles.Econometrica, 46(1):33–50,

  18. [24]

    Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li

    URLhttps://arxiv.org/abs/2510.05491. Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025a. URLhttps://arxiv.org/abs/2511.11698. Haoxin Liu, Shangqing Xu, Zhiyuan Zhao, Lingkai Kong, ...

  19. [25]

    Muon is Scalable for LLM Training

    URLhttps://openreview.net/forum?id=Z1TMV4bGuu. Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025b. URL https://arxiv.org/abs/ 2502.16982. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularizati...

  20. [26]

    doi: https://doi.org/10

    ISSN 0169-2070. doi: https://doi.org/10. 1016/j.ijforecast.2019.04.014. URL https://www.sciencedirect.com/science/article/pii/S0169207019301128. M4 Competition. Pablo Montero-Manso, George Athanasopoulos, Rob J. Hyndman, and Thiyanga S. Talagala. FFORMA: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92,

  21. [27]

    Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, and Frank Hutter

    doi: 10.1016/j.ijforecast.2019.02.011. Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, and Frank Hutter. TempoPFN: Synthetic pre- training of linear RNNs for zero-shot time series forecasting.arXiv preprint arXiv:2510.25502,

  22. [28]

    Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter

    URL https: //arxiv.org/abs/2510.25502. Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do Bayesian inference. InInternational Conference on Learning Representations,

  23. [29]

    ©Datadog 2026 18 T echnical Report Yuqi Nie, Nam H

    URL https://openreview.net/forum? id=KSugKcbNf9. ©Datadog 2026 18 T echnical Report Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations,

  24. [30]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    URL https://arxiv.org/ abs/2211.14730. Zhongzheng Qiao, Sheng Pan, Anni Wang, Viktoriya Zhukova, Yong Liu, Xudong Jiang, Qingsong Wen, Mingsheng Long, Ming Jin, and Chenghao Liu. It’s TIME: Towards the next generation of time series forecasting benchmarks. arXiv preprint arXiv:2602.12147,

  25. [31]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever

    URLhttps://arxiv.org/abs/2602.12147. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners,

  26. [33]

    Tree-structured parzen estimator: Understanding its algorithm components and their roles for better empirical performance.arXiv preprint arXiv:2304.11127, 2023

    URLhttps://arxiv.org/abs/2304.11127. Stephan Xie, Ben Cohen, Mononito Goswami, Junhong Shen, Emaad Khwaja, Chenghao Liu, David Asker, Othmane Abou-Amal, and Ameet Talwalkar. ARFBench: Benchmarking time series question answering ability for software incident response.arXiv preprint arXiv:2604.21199,

  27. [34]

    ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

    URLhttps://arxiv.org/abs/2604.21199. Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, and Qiang Xu. Fidel-TS: A high-fidelity multimodal benchmark for time series forecasting.arXiv preprint arXiv:2509.24789,

  28. [35]

    Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting

    URLhttps://arxiv.org/abs/2509.24789. Greg Yang and Edward J. Hu. Tensor programs IV: Feature learning in infinite-width neural networks. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 11727–11737. PMLR,