pith. machine review for the scientific record. sign in

arxiv: 2603.04791 · v3 · submitted 2026-03-05 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords time seriesfoundation modelmixture of expertsforecastingserial scalingserial token predictionTimeBenchGIFT-Eval
0
0 comments X

The pith

Timer-S1 scales time series models serially to achieve state-of-the-art forecasting on GIFT-Eval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents Timer-S1, a mixture-of-experts time series foundation model with 8.3 billion total parameters. It applies serial scaling to the architecture, a one-trillion-point dataset called TimeBench, and the training pipeline. The core innovation is Serial-Token Prediction, which predicts tokens sequentially to better handle long-term forecasting without rolling inference and error accumulation. On the GIFT-Eval leaderboard, it attains the best MASE and CRPS scores among pre-trained models, and the model is released for further use.

Core claim

Timer-S1 uses sparse TimeMoE blocks and TimeSTP blocks to implement Serial-Token Prediction as a training objective aligned with the serial nature of forecasting. This, together with serial scaling in three dimensions and a post-training stage, enables superior performance on long-horizon predictions compared to standard next-token prediction approaches.

What carries the argument

Serial-Token Prediction (STP), a generic training objective using TimeMoE and TimeSTP blocks that performs serial computations for improved long-term forecasting without costly rolling-style inference.

If this is right

  • Improves long-term predictions by avoiding error accumulation from rolling forecasts.
  • Supports context lengths of 11.5K tokens through serial scaling.
  • Provides a pre-trained model that leads in MASE and CRPS on the GIFT-Eval benchmark.
  • Facilitates further research via public release of the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Serial scaling methods could apply to other foundation model domains facing similar scalability issues.
  • The emphasis on unbiased data curation highlights the importance of dataset quality over sheer size in time series modeling.
  • Post-training stages may become a standard practice for optimizing both short-term and long-context performance in forecasting models.

Load-bearing premise

The curated TimeBench corpus with one trillion time points is high-quality and free from biases that could affect long-term predictive accuracy.

What would settle it

A significant drop in Timer-S1's MASE or CRPS scores when tested on time series data drawn from domains absent in TimeBench would indicate that the dataset curation introduced predictive bias.

read the original abstract

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 is released to facilitate further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Timer-S1, an 8.3B-parameter Mixture-of-Experts time series foundation model that applies Serial Scaling across architecture (TimeMoE and TimeSTP blocks), dataset (TimeBench corpus of 1T points), and training pipeline (Serial-Token Prediction objective plus post-training). It claims this yields state-of-the-art MASE and CRPS scores on the GIFT-Eval leaderboard as a pre-trained model, while avoiding error accumulation from rolling inference.

Significance. If the performance claims are supported by rigorous, reproducible experiments, the work would demonstrate the viability of billion-scale MoE models for time series with a serial prediction paradigm, potentially improving long-horizon forecasting efficiency and accuracy over standard next-token approaches.

major comments (3)
  1. [Evaluation on GIFT-Eval leaderboard] The SOTA MASE and CRPS claims on GIFT-Eval rest on leaderboard results without reported baselines, error bars, ablation studies, or statistical tests in the evaluation section, leaving the contribution of Serial Scaling unverifiable from the provided text.
  2. [TimeBench Curation and Data Augmentation] The TimeBench curation (1T points) and data augmentation pipeline are described as mitigating predictive bias, yet no quantitative diagnostics (distributional distances, leakage statistics, or augmentation ablations) are supplied to confirm that GIFT-Eval test windows remain out-of-distribution.
  3. [Serial-Token Prediction Objective] Serial-Token Prediction is presented as adhering to the serial nature of forecasting to reduce error accumulation, but the manuscript contains no equations, loss formulations, or derivations comparing STP to standard next-token prediction or rolling inference.
minor comments (2)
  1. [Abstract and Training Pipeline] The abstract and model description use terms such as 'meticulous data augmentation' and 'pioneer a post-training stage' without specifying the concrete techniques or hyperparameters employed.
  2. [Model Architecture] Clarify the exact parameter breakdown (8.3B total vs. 0.75B activated) and how TimeMoE blocks differ from generic TimeSTP blocks in the architecture diagram or section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the opportunity to strengthen the manuscript's rigor in evaluation, data validation, and technical exposition. We address each major comment below and commit to revisions that directly incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Evaluation on GIFT-Eval leaderboard] The SOTA MASE and CRPS claims on GIFT-Eval rest on leaderboard results without reported baselines, error bars, ablation studies, or statistical tests in the evaluation section, leaving the contribution of Serial Scaling unverifiable from the provided text.

    Authors: We agree that the current presentation relies primarily on leaderboard references without sufficient internal verification. In the revised manuscript, we will expand the evaluation section to include explicit baseline comparisons drawn from the GIFT-Eval leaderboard, error bars computed over multiple independent runs, ablation studies that isolate the contributions of Serial Scaling (architecture, dataset, and training pipeline), and statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against competing methods. These additions will make the performance claims and the role of Serial Scaling fully verifiable and reproducible. revision: yes

  2. Referee: [TimeBench Curation and Data Augmentation] The TimeBench curation (1T points) and data augmentation pipeline are described as mitigating predictive bias, yet no quantitative diagnostics (distributional distances, leakage statistics, or augmentation ablations) are supplied to confirm that GIFT-Eval test windows remain out-of-distribution.

    Authors: We acknowledge that the manuscript currently lacks quantitative support for the data curation claims. We will add a dedicated subsection with: (i) distributional distance metrics (e.g., Wasserstein distance and maximum mean discrepancy) between the TimeBench corpus and GIFT-Eval test windows, (ii) explicit leakage statistics and overlap checks, and (iii) ablation results on the data augmentation pipeline demonstrating its effect on bias mitigation. These diagnostics will confirm that GIFT-Eval test windows remain out-of-distribution. revision: yes

  3. Referee: [Serial-Token Prediction Objective] Serial-Token Prediction is presented as adhering to the serial nature of forecasting to reduce error accumulation, but the manuscript contains no equations, loss formulations, or derivations comparing STP to standard next-token prediction or rolling inference.

    Authors: We will incorporate the missing technical details. The revised manuscript will include the formal loss formulation for Serial-Token Prediction, the precise algorithmic description of serial token generation, and a direct comparison (with equations) to standard next-token prediction and rolling inference. We will also provide a brief derivation or illustrative analysis showing how the serial objective reduces error accumulation over long horizons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmark evaluation

full rationale

The paper's central claims consist of empirical SOTA results on the independent GIFT-Eval leaderboard after training on the curated TimeBench corpus. No equations, derivations, or self-referential definitions appear that would reduce any prediction or result to its own inputs by construction. Data augmentation and post-training are presented as preprocessing steps rather than fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked in the provided text. The performance metrics are externally falsifiable on a held-out leaderboard, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The abstract introduces new named components (TimeMoE blocks, TimeSTP blocks, TimeBench) and a new training objective without providing independent external validation or formal definitions.

invented entities (3)
  • TimeMoE blocks no independent evidence
    purpose: Sparse Mixture-of-Experts layers specialized for time series
    Introduced as part of the model architecture to enable scaling.
  • TimeSTP blocks no independent evidence
    purpose: Generic blocks supporting Serial-Token Prediction
    New block type tied to the serial forecasting objective.
  • TimeBench no independent evidence
    purpose: Training corpus containing one trillion time points
    Curated dataset claimed to be high-quality after augmentation.

pith-pipeline@v0.9.0 · 5553 in / 1269 out tokens · 38943 ms · 2026-05-15T16:45:48.748830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Don't Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition

    cs.LG 2026-05 unverdicted novelty 4.0

    A frozen average of the last two cycles matches or exceeds eight shape-learning alternatives on 97 GIFT-Eval configurations for periodic time series forecasting.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Gift-eval: A benchmark for general time series forecasting model evaluation

    Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. InNeurIPS Workshop on Time Series in the Age of Large Models, 2024

  2. [2]

    Chronos: Learning the Language of Time Series

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815, 2024

  3. [3]

    Chronos-2: From Univariate to Universal Forecasting

    Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al. Chronos-2: From univariate to universal forecasting. arXiv preprint arXiv:2510.15821, 2025

  4. [4]

    Chronosx: Adapting pretrained time series models with exogenous variables.arXiv preprint arXiv:2503.12107, 2025

    Sebastian Pineda Arango, Pedro Mercado, Shubham Kapoor, Abdul Fatir Ansari, Lorenzo Stella, Huibin Shen, Hugo Senetaire, Caner Turkmen, Oleksandr Shchur, Danielle C Maddix, et al. Chronosx: Adapting pretrained time series models with exogenous variables.arXiv preprint arXiv:2503.12107, 2025

  5. [5]

    Tirex: Zero- shot forecasting across long and short horizons with enhanced in-context learning.arXiv preprintarXiv:2505.23719, 2025

    Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero- shot forecasting across long and short horizons with enhanced in-context learning.arXiv preprintarXiv:2505.23719, 2025

  6. [6]

    Adapts: Adapting univariate foundation models to probabilistic multivariate time series forecasting.arXiv preprint arXiv:2502.10235, 2025

    Abdelhakim Benechehab, Vasilii Feofanov, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, and Balázs Kégl. Adapts: Adapting univariate foundation models to probabilistic multivariate time series forecasting.arXiv preprint arXiv:2502.10235, 2025

  7. [7]

    A neural probabilistic language model.Advances in neural information processing systems, 13, 2000

    Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model.Advances in neural information processing systems, 13, 2000

  8. [8]

    Box and jenkins: time series analysis, forecasting and control

    George Box. Box and jenkins: time series analysis, forecasting and control. InA Very British Affair: Six Britons and the Development of Time Series Analysis During the 20th Century, pages 161–215. Springer, 2013

  9. [9]

    Lof: identifying density-based local outliers

    Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000

  10. [10]

    Semi-supervised learning (chapelle, o

    Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009

  11. [11]

    Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874, 2024

    Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ramé, Youssef Doubli, and Othmane Abou-Amal. Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874, 2024

  12. [12]

    A decoder-only foundation model for time-series forecasting

    Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688, 2023

  13. [13]

    In-context fine-tuning for time-series foundation models

    Abhimanyu Das, Matthew Faw, Rajat Sen, and Yichen Zhou. In-context fine-tuning for time-series foundation models. arXiv preprint arXiv:2410.24087, 2024

  14. [14]

    Synapse: Adaptive arbitration of complementary expertise in time series foundational models.arXiv preprint arXiv:2511.05460, 2025

    Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Yiwen Song, Long T Le, Lesly Miculicich, Jinsung Yoon, Rui Zhang, Hamid Palangi, and Tomas Pfister. Synapse: Adaptive arbitration of complementary expertise in time series foundational models.arXiv preprint arXiv:2511.05460, 2025

  15. [15]

    Forecastpfn: Synthetically-trained zero-shot forecasting.arXiv preprint arXiv:2311.01933, 2023

    Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha Naidu, and Colin White. Forecastpfn: Synthetically-trained zero-shot forecasting.arXiv preprint arXiv:2311.01933, 2023

  16. [16]

    Rothenberg, and James H

    Graham Elliott, Thomas J. Rothenberg, and James H. Stock. Efficient tests for an autoregressive unit root. Econometrica, 1996

  17. [17]

    The interpolation of time series by related series.JournaloftheAmericanStatisticalAssociation, 57(300):729–757, 1962

    Milton Friedman. The interpolation of time series by related series.JournaloftheAmericanStatisticalAssociation, 57(300):729–757, 1962

  18. [18]

    Timecopilot.arXiv preprint arXiv:2509.00616, 2025

    Azul Garza and Renée Rosillo. Timecopilot.arXiv preprint arXiv:2509.00616, 2025

  19. [19]

    Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024. 17

  20. [20]

    Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

  21. [21]

    Forecastable component analysis

    Georg Goerg. Forecastable component analysis. InInternational conference on machine learning, pages 64–72. PMLR, 2013

  22. [22]

    Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

  23. [23]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025

  24. [24]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020

  25. [25]

    Forecasting: principles and practice

    RJ Hyndman. Forecasting: principles and practice. OTexts, 2018

  26. [26]

    Adaptive mixtures of local experts

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

  27. [27]

    The analysis of economic time-series-part i: Prices.Journal of the Royal Statistical Society.Series A (General), 116(1):11–34, 1953

    Maurice George Kendall and A Bradford Hill. The analysis of economic time-series-part i: Prices.Journal of the Royal Statistical Society.Series A (General), 116(1):11–34, 1953

  28. [28]

    Reversible instance normalization for accurate time-series forecasting against distribution shift

    Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations, 2021

  29. [29]

    Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021

    Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021

  30. [30]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  31. [31]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  32. [32]

    Moirai 2.0: When less is more for time series forecasting

    Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. Moirai 2.0: When less is more for time series forecasting. arXiv preprint arXiv:2511.11698, 2025

  33. [33]

    Moirai-moe: Empowering time series foundation models with sparse mixture of experts.arXiv preprint arXiv:2410.10469, 2024

    Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empowering time series foundation models with sparse mixture of experts.arXiv preprint arXiv:2410.10469, 2024

  34. [34]

    Non-stationary transformers: Exploring the stationarity in time series forecasting.Advancesin Neural Information Processing Systems, 35:9881–9893, 2022

    Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Exploring the stationarity in time series forecasting.Advancesin Neural Information Processing Systems, 35:9881–9893, 2022

  35. [35]

    Timer-xl: Long-context transformers for unified time series forecasting.arXiv preprint arXiv:2410.04803, 2024

    Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting.arXiv preprint arXiv:2410.04803, 2024

  36. [36]

    Timer: Generative pre-trained transformers are large time series models

    Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: Generative pre-trained transformers are large time series models. InForty-firstInternationalConference on MachineLearning, 2024

  37. [37]

    Sundial: A Family of Highly Capable Time Series Foundation Models

    Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816, 2025

  38. [38]

    The Serial Scaling Hypothesis

    Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, and Yutong Bai. The serial scaling hypothesis.arXiv preprint arXiv:2507.12549, 2025

  39. [39]

    VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, 2025

    Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025. 18

  40. [40]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022

  41. [41]

    The language instinct: How the mind creates language

    Steven Pinker. The language instinct: How the mind creates language. Penguin uK, 2003

  42. [42]

    Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023

  43. [43]

    Non-linear and non-stationary time series analysis.London: Academic Press, 1988

    Maurice Bertram Priestley. Non-linear and non-stationary time series analysis.London: Academic Press, 1988

  44. [44]

    Cora: Covariate-aware adaptation of time series foundation models.arXiv preprint arXiv:2510.12681, 2025

    Guo Qin, Zhi Chen, Yong Liu, Zhiyuan Shi, Haixuan Liu, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Cora: Covariate-aware adaptation of time series foundation models.arXiv preprint arXiv:2510.12681, 2025

  45. [45]

    Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

    David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

  46. [46]

    Scaling law for time series forecasting

    Jingzhe Shi, Qinwei Ma, Huan Ma, and Lei Li. Scaling law for time series forecasting. arXiv preprint arXiv:2405.15124, 2024

  47. [47]

    Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040, 2024

    Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040, 2024

  48. [48]

    Estimating conditional quantiles with the help of the pinball loss

    Ingo Steinwart and Andreas Christmann. Estimating conditional quantiles with the help of the pinball loss. 2011

  49. [49]

    Blockwise parallel decoding for deep autoregressive models

    Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. Advancesin Neural Information Processing Systems, 31, 2018

  50. [50]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  51. [51]

    Powerpm: Foundation model for power systems.Advancesin Neural Information Processing Systems, 37:115233–115260, 2024

    Shihao Tu, Yupeng Zhang, Jing Zhang, Zhendong Fu, Yin Zhang, and Yang Yang. Powerpm: Foundation model for power systems.Advancesin Neural Information Processing Systems, 37:115233–115260, 2024

  52. [52]

    Attention is all you need.Advancesin neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

  53. [53]

    Deep Time Series Models: A Comprehensive Survey and Benchmark

    Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.(2024).URL https://arxiv. org/abs/2407.13278, 18, 2024

  54. [54]

    Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022

    Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022

  55. [55]

    Unified training of universal time series forecasting transformers.arXiv preprint arXiv:2402.02592, 2024

    Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers.arXiv preprint arXiv:2402.02592, 2024

  56. [56]

    Interpretable weather forecasting for worldwide stations with a unified deep model.Nature Machine Intelligence, 5(6):602–611, 2023

    Haixu Wu, Hang Zhou, Mingsheng Long, and Jianmin Wang. Interpretable weather forecasting for worldwide stations with a unified deep model.Nature Machine Intelligence, 5(6):602–611, 2023

  57. [57]

    On layer normalization in the transformer architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning, pages 10524–10533. PMLR, 2020

  58. [58]

    Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019

  59. [59]

    Trajectory flow matching with applications to clinical time series modelling.Advances in Neural Information Processing Systems, 37:107198–107224, 2024

    Xi Nicole Zhang, Yuan Pu, Yuki Kawamura, Andrew Loza, Yoshua Bengio, Dennis Shung, and Alexander Tong. Trajectory flow matching with applications to clinical time series modelling.Advances in Neural Information Processing Systems, 37:107198–107224, 2024

  60. [60]

    Timeseriesscientist: A general-purpose ai agent for time series analysis.arXiv preprint arXiv:2510.01538, 2025

    Haokun Zhao, Xiang Zhang, Jiaqi Wei, Yiwei Xu, Yuting He, Siqi Sun, and Chenyu You. Timeseriesscientist: A general-purpose ai agent for time series analysis.arXiv preprint arXiv:2510.01538, 2025

  61. [61]

    Fincast: A foundation model for financial time- series forecasting

    Zhuohang Zhu, Haodong Chen, Qiang Qu, and Vera Chung. Fincast: A foundation model for financial time- series forecasting. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 4539–4549, 2025. 19