Recognition: 2 theorem links
· Lean TheoremTimer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
Pith reviewed 2026-05-15 16:45 UTC · model grok-4.3
The pith
Timer-S1 scales time series models serially to achieve state-of-the-art forecasting on GIFT-Eval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Timer-S1 uses sparse TimeMoE blocks and TimeSTP blocks to implement Serial-Token Prediction as a training objective aligned with the serial nature of forecasting. This, together with serial scaling in three dimensions and a post-training stage, enables superior performance on long-horizon predictions compared to standard next-token prediction approaches.
What carries the argument
Serial-Token Prediction (STP), a generic training objective using TimeMoE and TimeSTP blocks that performs serial computations for improved long-term forecasting without costly rolling-style inference.
If this is right
- Improves long-term predictions by avoiding error accumulation from rolling forecasts.
- Supports context lengths of 11.5K tokens through serial scaling.
- Provides a pre-trained model that leads in MASE and CRPS on the GIFT-Eval benchmark.
- Facilitates further research via public release of the model.
Where Pith is reading between the lines
- Serial scaling methods could apply to other foundation model domains facing similar scalability issues.
- The emphasis on unbiased data curation highlights the importance of dataset quality over sheer size in time series modeling.
- Post-training stages may become a standard practice for optimizing both short-term and long-context performance in forecasting models.
Load-bearing premise
The curated TimeBench corpus with one trillion time points is high-quality and free from biases that could affect long-term predictive accuracy.
What would settle it
A significant drop in Timer-S1's MASE or CRPS scores when tested on time series data drawn from domains absent in TimeBench would indicate that the dataset curation introduced predictive bias.
read the original abstract
We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 is released to facilitate further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Timer-S1, an 8.3B-parameter Mixture-of-Experts time series foundation model that applies Serial Scaling across architecture (TimeMoE and TimeSTP blocks), dataset (TimeBench corpus of 1T points), and training pipeline (Serial-Token Prediction objective plus post-training). It claims this yields state-of-the-art MASE and CRPS scores on the GIFT-Eval leaderboard as a pre-trained model, while avoiding error accumulation from rolling inference.
Significance. If the performance claims are supported by rigorous, reproducible experiments, the work would demonstrate the viability of billion-scale MoE models for time series with a serial prediction paradigm, potentially improving long-horizon forecasting efficiency and accuracy over standard next-token approaches.
major comments (3)
- [Evaluation on GIFT-Eval leaderboard] The SOTA MASE and CRPS claims on GIFT-Eval rest on leaderboard results without reported baselines, error bars, ablation studies, or statistical tests in the evaluation section, leaving the contribution of Serial Scaling unverifiable from the provided text.
- [TimeBench Curation and Data Augmentation] The TimeBench curation (1T points) and data augmentation pipeline are described as mitigating predictive bias, yet no quantitative diagnostics (distributional distances, leakage statistics, or augmentation ablations) are supplied to confirm that GIFT-Eval test windows remain out-of-distribution.
- [Serial-Token Prediction Objective] Serial-Token Prediction is presented as adhering to the serial nature of forecasting to reduce error accumulation, but the manuscript contains no equations, loss formulations, or derivations comparing STP to standard next-token prediction or rolling inference.
minor comments (2)
- [Abstract and Training Pipeline] The abstract and model description use terms such as 'meticulous data augmentation' and 'pioneer a post-training stage' without specifying the concrete techniques or hyperparameters employed.
- [Model Architecture] Clarify the exact parameter breakdown (8.3B total vs. 0.75B activated) and how TimeMoE blocks differ from generic TimeSTP blocks in the architecture diagram or section.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We appreciate the opportunity to strengthen the manuscript's rigor in evaluation, data validation, and technical exposition. We address each major comment below and commit to revisions that directly incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Evaluation on GIFT-Eval leaderboard] The SOTA MASE and CRPS claims on GIFT-Eval rest on leaderboard results without reported baselines, error bars, ablation studies, or statistical tests in the evaluation section, leaving the contribution of Serial Scaling unverifiable from the provided text.
Authors: We agree that the current presentation relies primarily on leaderboard references without sufficient internal verification. In the revised manuscript, we will expand the evaluation section to include explicit baseline comparisons drawn from the GIFT-Eval leaderboard, error bars computed over multiple independent runs, ablation studies that isolate the contributions of Serial Scaling (architecture, dataset, and training pipeline), and statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against competing methods. These additions will make the performance claims and the role of Serial Scaling fully verifiable and reproducible. revision: yes
-
Referee: [TimeBench Curation and Data Augmentation] The TimeBench curation (1T points) and data augmentation pipeline are described as mitigating predictive bias, yet no quantitative diagnostics (distributional distances, leakage statistics, or augmentation ablations) are supplied to confirm that GIFT-Eval test windows remain out-of-distribution.
Authors: We acknowledge that the manuscript currently lacks quantitative support for the data curation claims. We will add a dedicated subsection with: (i) distributional distance metrics (e.g., Wasserstein distance and maximum mean discrepancy) between the TimeBench corpus and GIFT-Eval test windows, (ii) explicit leakage statistics and overlap checks, and (iii) ablation results on the data augmentation pipeline demonstrating its effect on bias mitigation. These diagnostics will confirm that GIFT-Eval test windows remain out-of-distribution. revision: yes
-
Referee: [Serial-Token Prediction Objective] Serial-Token Prediction is presented as adhering to the serial nature of forecasting to reduce error accumulation, but the manuscript contains no equations, loss formulations, or derivations comparing STP to standard next-token prediction or rolling inference.
Authors: We will incorporate the missing technical details. The revised manuscript will include the formal loss formulation for Serial-Token Prediction, the precise algorithmic description of serial token generation, and a direct comparison (with equations) to standard next-token prediction and rolling inference. We will also provide a brief derivation or illustrative analysis showing how the serial objective reduces error accumulation over long horizons. revision: yes
Circularity Check
No significant circularity; claims rest on external benchmark evaluation
full rationale
The paper's central claims consist of empirical SOTA results on the independent GIFT-Eval leaderboard after training on the curated TimeBench corpus. No equations, derivations, or self-referential definitions appear that would reduce any prediction or result to its own inputs by construction. Data augmentation and post-training are presented as preprocessing steps rather than fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked in the provided text. The performance metrics are externally falsifiable on a held-out leaderboard, making the derivation chain self-contained.
Axiom & Free-Parameter Ledger
invented entities (3)
-
TimeMoE blocks
no independent evidence
-
TimeSTP blocks
no independent evidence
-
TimeBench
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat inductive structure; embed_strictMono_of_one_lt; time-as-orbit certificate echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Serial-Token Prediction (STP), a serialized version of the Transformer block... each TimeSTP block refers to the initial lookback series and intermediate representations, and iteratively produces the shift-by-one prediction, thereby introducing progressive serial computations for multi-horizon forecasts.
-
IndisputableMonolith/Foundation/ArrowOfTime.leanarrow_from_z; before_transitive; z_monotone_absolute echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Forecasting into the long term accumulates uncertainty, as the prediction of each step depends on all preceding estimations, which positions time series forecasting as a serial problem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Don't Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition
A frozen average of the last two cycles matches or exceeds eight shape-learning alternatives on 97 GIFT-Eval configurations for periodic time series forecasting.
Reference graph
Works this paper leans on
-
[1]
Gift-eval: A benchmark for general time series forecasting model evaluation
Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. InNeurIPS Workshop on Time Series in the Age of Large Models, 2024
work page 2024
-
[2]
Chronos: Learning the Language of Time Series
Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Chronos-2: From Univariate to Universal Forecasting
Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al. Chronos-2: From univariate to universal forecasting. arXiv preprint arXiv:2510.15821, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Sebastian Pineda Arango, Pedro Mercado, Shubham Kapoor, Abdul Fatir Ansari, Lorenzo Stella, Huibin Shen, Hugo Senetaire, Caner Turkmen, Oleksandr Shchur, Danielle C Maddix, et al. Chronosx: Adapting pretrained time series models with exogenous variables.arXiv preprint arXiv:2503.12107, 2025
-
[5]
Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero- shot forecasting across long and short horizons with enhanced in-context learning.arXiv preprintarXiv:2505.23719, 2025
-
[6]
Abdelhakim Benechehab, Vasilii Feofanov, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, and Balázs Kégl. Adapts: Adapting univariate foundation models to probabilistic multivariate time series forecasting.arXiv preprint arXiv:2502.10235, 2025
-
[7]
A neural probabilistic language model.Advances in neural information processing systems, 13, 2000
Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model.Advances in neural information processing systems, 13, 2000
work page 2000
-
[8]
Box and jenkins: time series analysis, forecasting and control
George Box. Box and jenkins: time series analysis, forecasting and control. InA Very British Affair: Six Britons and the Development of Time Series Analysis During the 20th Century, pages 161–215. Springer, 2013
work page 2013
-
[9]
Lof: identifying density-based local outliers
Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000
work page 2000
-
[10]
Semi-supervised learning (chapelle, o
Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009
work page 2006
-
[11]
Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874, 2024
Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ramé, Youssef Doubli, and Othmane Abou-Amal. Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874, 2024
-
[12]
A decoder-only foundation model for time-series forecasting
Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688, 2023
-
[13]
In-context fine-tuning for time-series foundation models
Abhimanyu Das, Matthew Faw, Rajat Sen, and Yichen Zhou. In-context fine-tuning for time-series foundation models. arXiv preprint arXiv:2410.24087, 2024
-
[14]
Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Yiwen Song, Long T Le, Lesly Miculicich, Jinsung Yoon, Rui Zhang, Hamid Palangi, and Tomas Pfister. Synapse: Adaptive arbitration of complementary expertise in time series foundational models.arXiv preprint arXiv:2511.05460, 2025
-
[15]
Forecastpfn: Synthetically-trained zero-shot forecasting.arXiv preprint arXiv:2311.01933, 2023
Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha Naidu, and Colin White. Forecastpfn: Synthetically-trained zero-shot forecasting.arXiv preprint arXiv:2311.01933, 2023
-
[16]
Graham Elliott, Thomas J. Rothenberg, and James H. Stock. Efficient tests for an autoregressive unit root. Econometrica, 1996
work page 1996
-
[17]
Milton Friedman. The interpolation of time series by related series.JournaloftheAmericanStatisticalAssociation, 57(300):729–757, 1962
work page 1962
-
[18]
Timecopilot.arXiv preprint arXiv:2509.00616, 2025
Azul Garza and Renée Rosillo. Timecopilot.arXiv preprint arXiv:2509.00616, 2025
-
[19]
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024. 17
-
[20]
Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007
work page 2007
-
[21]
Forecastable component analysis
Georg Goerg. Forecastable component analysis. InInternational conference on machine learning, pages 64–72. PMLR, 2013
work page 2013
-
[22]
Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024
Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024
-
[23]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025
work page 2025
-
[24]
Query-key normalization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020
work page 2020
-
[25]
Forecasting: principles and practice
RJ Hyndman. Forecasting: principles and practice. OTexts, 2018
work page 2018
-
[26]
Adaptive mixtures of local experts
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991
work page 1991
-
[27]
Maurice George Kendall and A Bradford Hill. The analysis of economic time-series-part i: Prices.Journal of the Royal Statistical Society.Series A (General), 116(1):11–34, 1953
work page 1953
-
[28]
Reversible instance normalization for accurate time-series forecasting against distribution shift
Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations, 2021
work page 2021
-
[29]
Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021
work page 2021
-
[30]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Moirai 2.0: When less is more for time series forecasting
Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. Moirai 2.0: When less is more for time series forecasting. arXiv preprint arXiv:2511.11698, 2025
-
[33]
Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empowering time series foundation models with sparse mixture of experts.arXiv preprint arXiv:2410.10469, 2024
-
[34]
Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Exploring the stationarity in time series forecasting.Advancesin Neural Information Processing Systems, 35:9881–9893, 2022
work page 2022
-
[35]
Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting.arXiv preprint arXiv:2410.04803, 2024
-
[36]
Timer: Generative pre-trained transformers are large time series models
Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: Generative pre-trained transformers are large time series models. InForty-firstInternationalConference on MachineLearning, 2024
work page 2024
-
[37]
Sundial: A Family of Highly Capable Time Series Foundation Models
Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, and Yutong Bai. The serial scaling hypothesis.arXiv preprint arXiv:2507.12549, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, 2025
Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025. 18
-
[40]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
The language instinct: How the mind creates language
Steven Pinker. The language instinct: How the mind creates language. Penguin uK, 2003
work page 2003
-
[42]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023
work page 2023
-
[43]
Non-linear and non-stationary time series analysis.London: Academic Press, 1988
Maurice Bertram Priestley. Non-linear and non-stationary time series analysis.London: Academic Press, 1988
work page 1988
-
[44]
Guo Qin, Zhi Chen, Yong Liu, Zhiyuan Shi, Haixuan Liu, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Cora: Covariate-aware adaptation of time series foundation models.arXiv preprint arXiv:2510.12681, 2025
-
[45]
David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020
work page 2020
-
[46]
Scaling law for time series forecasting
Jingzhe Shi, Qinwei Ma, Huan Ma, and Lei Li. Scaling law for time series forecasting. arXiv preprint arXiv:2405.15124, 2024
-
[47]
Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040, 2024
-
[48]
Estimating conditional quantiles with the help of the pinball loss
Ingo Steinwart and Andreas Christmann. Estimating conditional quantiles with the help of the pinball loss. 2011
work page 2011
-
[49]
Blockwise parallel decoding for deep autoregressive models
Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. Advancesin Neural Information Processing Systems, 31, 2018
work page 2018
-
[50]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[51]
Shihao Tu, Yupeng Zhang, Jing Zhang, Zhendong Fu, Yin Zhang, and Yang Yang. Powerpm: Foundation model for power systems.Advancesin Neural Information Processing Systems, 37:115233–115260, 2024
work page 2024
-
[52]
Attention is all you need.Advancesin neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017
work page 2017
-
[53]
Deep Time Series Models: A Comprehensive Survey and Benchmark
Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.(2024).URL https://arxiv. org/abs/2407.13278, 18, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022
Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022
-
[55]
Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers.arXiv preprint arXiv:2402.02592, 2024
-
[56]
Haixu Wu, Hang Zhou, Mingsheng Long, and Jianmin Wang. Interpretable weather forecasting for worldwide stations with a unified deep model.Nature Machine Intelligence, 5(6):602–611, 2023
work page 2023
-
[57]
On layer normalization in the transformer architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning, pages 10524–10533. PMLR, 2020
work page 2020
-
[58]
Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019
work page 2019
-
[59]
Xi Nicole Zhang, Yuan Pu, Yuki Kawamura, Andrew Loza, Yoshua Bengio, Dennis Shung, and Alexander Tong. Trajectory flow matching with applications to clinical time series modelling.Advances in Neural Information Processing Systems, 37:107198–107224, 2024
work page 2024
-
[60]
Haokun Zhao, Xiang Zhang, Jiaqi Wei, Yiwei Xu, Yuting He, Siqi Sun, and Chenyu You. Timeseriesscientist: A general-purpose ai agent for time series analysis.arXiv preprint arXiv:2510.01538, 2025
-
[61]
Fincast: A foundation model for financial time- series forecasting
Zhuohang Zhu, Haodong Chen, Qiang Qu, and Vera Chung. Fincast: A foundation model for financial time- series forecasting. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 4539–4549, 2025. 19
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.