arxiv: 2603.04791 · v3 · submitted 2026-03-05 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Yong Liu , Xingjian Su , Shiyu Wang , Haoran Zhang , Haixuan Liu , Yuxuan Wang , Zhou Ye , Yang Xiang

show 2 more authors

Jianmin Wang Mingsheng Long

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords time seriesfoundation modelmixture of expertsforecastingserial scalingserial token predictionTimeBenchGIFT-Eval

0 comments

The pith

Timer-S1 scales time series models serially to achieve state-of-the-art forecasting on GIFT-Eval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents Timer-S1, a mixture-of-experts time series foundation model with 8.3 billion total parameters. It applies serial scaling to the architecture, a one-trillion-point dataset called TimeBench, and the training pipeline. The core innovation is Serial-Token Prediction, which predicts tokens sequentially to better handle long-term forecasting without rolling inference and error accumulation. On the GIFT-Eval leaderboard, it attains the best MASE and CRPS scores among pre-trained models, and the model is released for further use.

Core claim

Timer-S1 uses sparse TimeMoE blocks and TimeSTP blocks to implement Serial-Token Prediction as a training objective aligned with the serial nature of forecasting. This, together with serial scaling in three dimensions and a post-training stage, enables superior performance on long-horizon predictions compared to standard next-token prediction approaches.

What carries the argument

Serial-Token Prediction (STP), a generic training objective using TimeMoE and TimeSTP blocks that performs serial computations for improved long-term forecasting without costly rolling-style inference.

If this is right

Improves long-term predictions by avoiding error accumulation from rolling forecasts.
Supports context lengths of 11.5K tokens through serial scaling.
Provides a pre-trained model that leads in MASE and CRPS on the GIFT-Eval benchmark.
Facilitates further research via public release of the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Serial scaling methods could apply to other foundation model domains facing similar scalability issues.
The emphasis on unbiased data curation highlights the importance of dataset quality over sheer size in time series modeling.
Post-training stages may become a standard practice for optimizing both short-term and long-context performance in forecasting models.

Load-bearing premise

The curated TimeBench corpus with one trillion time points is high-quality and free from biases that could affect long-term predictive accuracy.

What would settle it

A significant drop in Timer-S1's MASE or CRPS scores when tested on time series data drawn from domains absent in TimeBench would indicate that the dataset curation introduced predictive bias.

read the original abstract

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 is released to facilitate further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Timer-S1 scales a time series MoE to 8.3B parameters with a serial-token objective and releases the model, but the SOTA claims on GIFT-Eval lack the supporting experiments needed to separate architecture gains from data effects.

read the letter

The paper's core move is to treat forecasting as serial token prediction rather than standard next-token or rolling autoregression. They pair sparse TimeMoE blocks with generic TimeSTP blocks inside an 8.3B MoE (0.75B active per token) and train on a curated trillion-point corpus called TimeBench, followed by a post-training stage that extends context to 11.5K. The claim is that this setup delivers the best MASE and CRPS numbers on the GIFT-Eval leaderboard among pre-trained models, and they release the weights. That release and the explicit focus on avoiding error accumulation in long horizons are the parts that actually move the field forward for practitioners who need off-the-shelf long-context time series models. The serial scaling story—model size, data volume, and training pipeline—is laid out clearly enough that someone could replicate the high-level recipe. The architecture choices look like a reasonable adaptation of MoE ideas to the serial structure of time series rather than a forced transplant from language modeling. On the soft side, the abstract and available text give almost no experimental backbone. There are no listed baselines, no ablation tables on the TimeMoE versus TimeSTP components, no error bars, and no quantitative checks on whether the data augmentation in TimeBench actually kept the GIFT-Eval test windows out of distribution. The stress-test concern about residual correlation between augmented training sequences and leaderboard windows is not addressed with any leakage statistics or distributional diagnostics in what we have. Without those, it is difficult to tell whether the headline numbers reflect the new objective or simply better data hygiene. The paper is aimed at groups already running large time series foundation models or trying to push scaling laws on sequential data. A reader who wants to experiment with the released weights or adapt the serial prediction loss will get immediate value; someone looking for a fully documented, reproducible advance will need the full experimental appendix first. I would send it to peer review. The scale and the release make it worth referee time, but the experimental section will need substantial strengthening before the performance claims can be taken at face value.

Referee Report

3 major / 2 minor

Summary. The paper introduces Timer-S1, an 8.3B-parameter Mixture-of-Experts time series foundation model that applies Serial Scaling across architecture (TimeMoE and TimeSTP blocks), dataset (TimeBench corpus of 1T points), and training pipeline (Serial-Token Prediction objective plus post-training). It claims this yields state-of-the-art MASE and CRPS scores on the GIFT-Eval leaderboard as a pre-trained model, while avoiding error accumulation from rolling inference.

Significance. If the performance claims are supported by rigorous, reproducible experiments, the work would demonstrate the viability of billion-scale MoE models for time series with a serial prediction paradigm, potentially improving long-horizon forecasting efficiency and accuracy over standard next-token approaches.

major comments (3)

[Evaluation on GIFT-Eval leaderboard] The SOTA MASE and CRPS claims on GIFT-Eval rest on leaderboard results without reported baselines, error bars, ablation studies, or statistical tests in the evaluation section, leaving the contribution of Serial Scaling unverifiable from the provided text.
[TimeBench Curation and Data Augmentation] The TimeBench curation (1T points) and data augmentation pipeline are described as mitigating predictive bias, yet no quantitative diagnostics (distributional distances, leakage statistics, or augmentation ablations) are supplied to confirm that GIFT-Eval test windows remain out-of-distribution.
[Serial-Token Prediction Objective] Serial-Token Prediction is presented as adhering to the serial nature of forecasting to reduce error accumulation, but the manuscript contains no equations, loss formulations, or derivations comparing STP to standard next-token prediction or rolling inference.

minor comments (2)

[Abstract and Training Pipeline] The abstract and model description use terms such as 'meticulous data augmentation' and 'pioneer a post-training stage' without specifying the concrete techniques or hyperparameters employed.
[Model Architecture] Clarify the exact parameter breakdown (8.3B total vs. 0.75B activated) and how TimeMoE blocks differ from generic TimeSTP blocks in the architecture diagram or section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the opportunity to strengthen the manuscript's rigor in evaluation, data validation, and technical exposition. We address each major comment below and commit to revisions that directly incorporate the suggested improvements.

read point-by-point responses

Referee: [Evaluation on GIFT-Eval leaderboard] The SOTA MASE and CRPS claims on GIFT-Eval rest on leaderboard results without reported baselines, error bars, ablation studies, or statistical tests in the evaluation section, leaving the contribution of Serial Scaling unverifiable from the provided text.

Authors: We agree that the current presentation relies primarily on leaderboard references without sufficient internal verification. In the revised manuscript, we will expand the evaluation section to include explicit baseline comparisons drawn from the GIFT-Eval leaderboard, error bars computed over multiple independent runs, ablation studies that isolate the contributions of Serial Scaling (architecture, dataset, and training pipeline), and statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against competing methods. These additions will make the performance claims and the role of Serial Scaling fully verifiable and reproducible. revision: yes
Referee: [TimeBench Curation and Data Augmentation] The TimeBench curation (1T points) and data augmentation pipeline are described as mitigating predictive bias, yet no quantitative diagnostics (distributional distances, leakage statistics, or augmentation ablations) are supplied to confirm that GIFT-Eval test windows remain out-of-distribution.

Authors: We acknowledge that the manuscript currently lacks quantitative support for the data curation claims. We will add a dedicated subsection with: (i) distributional distance metrics (e.g., Wasserstein distance and maximum mean discrepancy) between the TimeBench corpus and GIFT-Eval test windows, (ii) explicit leakage statistics and overlap checks, and (iii) ablation results on the data augmentation pipeline demonstrating its effect on bias mitigation. These diagnostics will confirm that GIFT-Eval test windows remain out-of-distribution. revision: yes
Referee: [Serial-Token Prediction Objective] Serial-Token Prediction is presented as adhering to the serial nature of forecasting to reduce error accumulation, but the manuscript contains no equations, loss formulations, or derivations comparing STP to standard next-token prediction or rolling inference.

Authors: We will incorporate the missing technical details. The revised manuscript will include the formal loss formulation for Serial-Token Prediction, the precise algorithmic description of serial token generation, and a direct comparison (with equations) to standard next-token prediction and rolling inference. We will also provide a brief derivation or illustrative analysis showing how the serial objective reduces error accumulation over long horizons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmark evaluation

full rationale

The paper's central claims consist of empirical SOTA results on the independent GIFT-Eval leaderboard after training on the curated TimeBench corpus. No equations, derivations, or self-referential definitions appear that would reduce any prediction or result to its own inputs by construction. Data augmentation and post-training are presented as preprocessing steps rather than fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked in the provided text. The performance metrics are externally falsifiable on a held-out leaderboard, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The abstract introduces new named components (TimeMoE blocks, TimeSTP blocks, TimeBench) and a new training objective without providing independent external validation or formal definitions.

invented entities (3)

TimeMoE blocks no independent evidence
purpose: Sparse Mixture-of-Experts layers specialized for time series
Introduced as part of the model architecture to enable scaling.
TimeSTP blocks no independent evidence
purpose: Generic blocks supporting Serial-Token Prediction
New block type tied to the serial forecasting objective.
TimeBench no independent evidence
purpose: Training corpus containing one trillion time points
Curated dataset claimed to be high-quality after augmentation.

pith-pipeline@v0.9.0 · 5553 in / 1269 out tokens · 38943 ms · 2026-05-15T16:45:48.748830+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat inductive structure; embed_strictMono_of_one_lt; time-as-orbit certificate echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Serial-Token Prediction (STP), a serialized version of the Transformer block... each TimeSTP block refers to the initial lookback series and intermediate representations, and iteratively produces the shift-by-one prediction, thereby introducing progressive serial computations for multi-horizon forecasts.
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z; before_transitive; z_monotone_absolute echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Forecasting into the long term accumulates uncertainty, as the prediction of each step depends on all preceding estimations, which positions time series forecasting as a serial problem

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Don't Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition
cs.LG 2026-05 unverdicted novelty 4.0

A frozen average of the last two cycles matches or exceeds eight shape-learning alternatives on 97 GIFT-Eval configurations for periodic time series forecasting.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Gift-eval: A benchmark for general time series forecasting model evaluation

Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. InNeurIPS Workshop on Time Series in the Age of Large Models, 2024

work page 2024
[2]

Chronos: Learning the Language of Time Series

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Chronos-2: From Univariate to Universal Forecasting

Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al. Chronos-2: From univariate to universal forecasting. arXiv preprint arXiv:2510.15821, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Chronosx: Adapting pretrained time series models with exogenous variables.arXiv preprint arXiv:2503.12107, 2025

Sebastian Pineda Arango, Pedro Mercado, Shubham Kapoor, Abdul Fatir Ansari, Lorenzo Stella, Huibin Shen, Hugo Senetaire, Caner Turkmen, Oleksandr Shchur, Danielle C Maddix, et al. Chronosx: Adapting pretrained time series models with exogenous variables.arXiv preprint arXiv:2503.12107, 2025

work page arXiv 2025
[5]

Tirex: Zero- shot forecasting across long and short horizons with enhanced in-context learning.arXiv preprintarXiv:2505.23719, 2025

Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero- shot forecasting across long and short horizons with enhanced in-context learning.arXiv preprintarXiv:2505.23719, 2025

work page arXiv 2025
[6]

Adapts: Adapting univariate foundation models to probabilistic multivariate time series forecasting.arXiv preprint arXiv:2502.10235, 2025

Abdelhakim Benechehab, Vasilii Feofanov, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, and Balázs Kégl. Adapts: Adapting univariate foundation models to probabilistic multivariate time series forecasting.arXiv preprint arXiv:2502.10235, 2025

work page arXiv 2025
[7]

A neural probabilistic language model.Advances in neural information processing systems, 13, 2000

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model.Advances in neural information processing systems, 13, 2000

work page 2000
[8]

Box and jenkins: time series analysis, forecasting and control

George Box. Box and jenkins: time series analysis, forecasting and control. InA Very British Affair: Six Britons and the Development of Time Series Analysis During the 20th Century, pages 161–215. Springer, 2013

work page 2013
[9]

Lof: identifying density-based local outliers

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000

work page 2000
[10]

Semi-supervised learning (chapelle, o

Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009

work page 2006
[11]

Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874, 2024

Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ramé, Youssef Doubli, and Othmane Abou-Amal. Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874, 2024

work page arXiv 2024
[12]

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688, 2023

work page arXiv 2023
[13]

In-context fine-tuning for time-series foundation models

Abhimanyu Das, Matthew Faw, Rajat Sen, and Yichen Zhou. In-context fine-tuning for time-series foundation models. arXiv preprint arXiv:2410.24087, 2024

work page arXiv 2024
[14]

Synapse: Adaptive arbitration of complementary expertise in time series foundational models.arXiv preprint arXiv:2511.05460, 2025

Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Yiwen Song, Long T Le, Lesly Miculicich, Jinsung Yoon, Rui Zhang, Hamid Palangi, and Tomas Pfister. Synapse: Adaptive arbitration of complementary expertise in time series foundational models.arXiv preprint arXiv:2511.05460, 2025

work page arXiv 2025
[15]

Forecastpfn: Synthetically-trained zero-shot forecasting.arXiv preprint arXiv:2311.01933, 2023

Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha Naidu, and Colin White. Forecastpfn: Synthetically-trained zero-shot forecasting.arXiv preprint arXiv:2311.01933, 2023

work page arXiv 2023
[16]

Rothenberg, and James H

Graham Elliott, Thomas J. Rothenberg, and James H. Stock. Efficient tests for an autoregressive unit root. Econometrica, 1996

work page 1996
[17]

The interpolation of time series by related series.JournaloftheAmericanStatisticalAssociation, 57(300):729–757, 1962

Milton Friedman. The interpolation of time series by related series.JournaloftheAmericanStatisticalAssociation, 57(300):729–757, 1962

work page 1962
[18]

Timecopilot.arXiv preprint arXiv:2509.00616, 2025

Azul Garza and Renée Rosillo. Timecopilot.arXiv preprint arXiv:2509.00616, 2025

work page arXiv 2025
[19]

Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024. 17

work page arXiv 2024
[20]

Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

work page 2007
[21]

Forecastable component analysis

Georg Goerg. Forecastable component analysis. InInternational conference on machine learning, pages 64–72. PMLR, 2013

work page 2013
[22]

Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

work page arXiv 2024
[23]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025

work page 2025
[24]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020

work page 2020
[25]

Forecasting: principles and practice

RJ Hyndman. Forecasting: principles and practice. OTexts, 2018

work page 2018
[26]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

work page 1991
[27]

The analysis of economic time-series-part i: Prices.Journal of the Royal Statistical Society.Series A (General), 116(1):11–34, 1953

Maurice George Kendall and A Bradford Hill. The analysis of economic time-series-part i: Prices.Journal of the Royal Statistical Society.Series A (General), 116(1):11–34, 1953

work page 1953
[28]

Reversible instance normalization for accurate time-series forecasting against distribution shift

Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations, 2021

work page 2021
[29]

Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021

Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021

work page 2021
[30]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Moirai 2.0: When less is more for time series forecasting

Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. Moirai 2.0: When less is more for time series forecasting. arXiv preprint arXiv:2511.11698, 2025

work page arXiv 2025
[33]

Moirai-moe: Empowering time series foundation models with sparse mixture of experts.arXiv preprint arXiv:2410.10469, 2024

Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empowering time series foundation models with sparse mixture of experts.arXiv preprint arXiv:2410.10469, 2024

work page arXiv 2024
[34]

Non-stationary transformers: Exploring the stationarity in time series forecasting.Advancesin Neural Information Processing Systems, 35:9881–9893, 2022

Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Exploring the stationarity in time series forecasting.Advancesin Neural Information Processing Systems, 35:9881–9893, 2022

work page 2022
[35]

Timer-xl: Long-context transformers for unified time series forecasting.arXiv preprint arXiv:2410.04803, 2024

Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting.arXiv preprint arXiv:2410.04803, 2024

work page arXiv 2024
[36]

Timer: Generative pre-trained transformers are large time series models

Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: Generative pre-trained transformers are large time series models. InForty-firstInternationalConference on MachineLearning, 2024

work page 2024
[37]

Sundial: A Family of Highly Capable Time Series Foundation Models

Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

The Serial Scaling Hypothesis

Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, and Yutong Bai. The serial scaling hypothesis.arXiv preprint arXiv:2507.12549, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, 2025

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025. 18

work page arXiv 2025
[40]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

The language instinct: How the mind creates language

Steven Pinker. The language instinct: How the mind creates language. Penguin uK, 2003

work page 2003
[42]

Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023

work page 2023
[43]

Non-linear and non-stationary time series analysis.London: Academic Press, 1988

Maurice Bertram Priestley. Non-linear and non-stationary time series analysis.London: Academic Press, 1988

work page 1988
[44]

Cora: Covariate-aware adaptation of time series foundation models.arXiv preprint arXiv:2510.12681, 2025

Guo Qin, Zhi Chen, Yong Liu, Zhiyuan Shi, Haixuan Liu, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Cora: Covariate-aware adaptation of time series foundation models.arXiv preprint arXiv:2510.12681, 2025

work page arXiv 2025
[45]

Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

work page 2020
[46]

Scaling law for time series forecasting

Jingzhe Shi, Qinwei Ma, Huan Ma, and Lei Li. Scaling law for time series forecasting. arXiv preprint arXiv:2405.15124, 2024

work page arXiv 2024
[47]

Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040, 2024

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040, 2024

work page arXiv 2024
[48]

Estimating conditional quantiles with the help of the pinball loss

Ingo Steinwart and Andreas Christmann. Estimating conditional quantiles with the help of the pinball loss. 2011

work page 2011
[49]

Blockwise parallel decoding for deep autoregressive models

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. Advancesin Neural Information Processing Systems, 31, 2018

work page 2018
[50]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[51]

Powerpm: Foundation model for power systems.Advancesin Neural Information Processing Systems, 37:115233–115260, 2024

Shihao Tu, Yupeng Zhang, Jing Zhang, Zhendong Fu, Yin Zhang, and Yang Yang. Powerpm: Foundation model for power systems.Advancesin Neural Information Processing Systems, 37:115233–115260, 2024

work page 2024
[52]

Attention is all you need.Advancesin neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

work page 2017
[53]

Deep Time Series Models: A Comprehensive Survey and Benchmark

Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.(2024).URL https://arxiv. org/abs/2407.13278, 18, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022

Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022

work page arXiv 2022
[55]

Unified training of universal time series forecasting transformers.arXiv preprint arXiv:2402.02592, 2024

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers.arXiv preprint arXiv:2402.02592, 2024

work page arXiv 2024
[56]

Interpretable weather forecasting for worldwide stations with a unified deep model.Nature Machine Intelligence, 5(6):602–611, 2023

Haixu Wu, Hang Zhou, Mingsheng Long, and Jianmin Wang. Interpretable weather forecasting for worldwide stations with a unified deep model.Nature Machine Intelligence, 5(6):602–611, 2023

work page 2023
[57]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning, pages 10524–10533. PMLR, 2020

work page 2020
[58]

Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019

work page 2019
[59]

Trajectory flow matching with applications to clinical time series modelling.Advances in Neural Information Processing Systems, 37:107198–107224, 2024

Xi Nicole Zhang, Yuan Pu, Yuki Kawamura, Andrew Loza, Yoshua Bengio, Dennis Shung, and Alexander Tong. Trajectory flow matching with applications to clinical time series modelling.Advances in Neural Information Processing Systems, 37:107198–107224, 2024

work page 2024
[60]

Timeseriesscientist: A general-purpose ai agent for time series analysis.arXiv preprint arXiv:2510.01538, 2025

Haokun Zhao, Xiang Zhang, Jiaqi Wei, Yiwei Xu, Yuting He, Siqi Sun, and Chenyu You. Timeseriesscientist: A general-purpose ai agent for time series analysis.arXiv preprint arXiv:2510.01538, 2025

work page arXiv 2025
[61]

Fincast: A foundation model for financial time- series forecasting

Zhuohang Zhu, Haodong Chen, Qiang Qu, and Vera Chung. Fincast: A foundation model for financial time- series forecasting. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 4539–4549, 2025. 19

work page 2025