pith. sign in

arxiv: 2606.01999 · v1 · pith:NFKF6IB4new · submitted 2026-06-01 · 💻 cs.LG · cs.AI

Why Do Time Series Models Need Long Context Windows?

Pith reviewed 2026-06-28 15:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords time series forecastinggenerative process identificationcontext window sizeconditional forecastingprocess identificationmemory lengthlong context windowsforecasting architectures
0
0 comments X

The pith

Time series forecasting models require input windows longer than the processes' memory length to reach minimal error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that forecasting groups of time series has two parts: identifying which generative process produced the observed sequence and then making predictions conditional on the observations. Optimal forecasts average over all processes that could have generated the data, with weights given by how likely each process is based on the window. Longer windows matter because they make it easier to tell which process is the true one, lowering the uncertainty in the average. The authors prove that when all processes have memory length P, windows of size P or smaller cannot achieve the lowest possible error. They also show that separating the identification task from forecasting can make models faster without hurting accuracy.

Core claim

Forecasting groups of time series involves generative process identification, inferring the specific process generating the input sequence, and conditional forecasting. Optimal predictions are an average over plausible data-generating processes weighted by their likelihood given the input window. Even for processes with memory length P, an input window size strictly larger than P is necessary to achieve the minimum attainable error.

What carries the argument

The decomposition of the forecasting task into generative process identification (GPI) and conditional forecasting (CF), where GPI reduces uncertainty about the active process.

If this is right

  • Decoupling GPI from CF improves computational scalability without compromising accuracy.
  • Window sizes must be strictly larger than the memory length P to minimize error.
  • The benefit of long context holds for collections of finite-memory processes.
  • Experiments on synthetic and real-world data support the need for longer windows to resolve process identity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could explicitly estimate the posterior over processes to achieve the optimal average forecast.
  • The requirement for extra context beyond memory length may generalize to other sequence modeling tasks with multiple possible sources.
  • Designing architectures that first classify the process type before forecasting could be more efficient.

Load-bearing premise

That the best prediction is the average over possible processes weighted by how well they explain the input window, and that longer windows reduce uncertainty about the active process.

What would settle it

An experiment on a mixture of processes each with memory P where the prediction error with window length P equals the error with windows longer than P.

Figures

Figures reproduced from arXiv: 2606.01999 by Andrea Cini, Cesare Alippi, Giovanni De Felice, Luca Butera.

Figure 1
Figure 1. Figure 1: Partial autocorrelation at different time lags in two [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pre-trained foundation models MAE, H=24. For each pair of subplots, the left histogram (orange) shows the error for increasing window length W, without context, while the right one (blue) shows the error with the smallest W and increasing context length C. Evidence in foundation models Under the model in Eq. 3, GPI and CF are inherent to the problem of learning global models for groups of time series, and … view at source ↗
Figure 3
Figure 3. Figure 3: 1-step ahead forecasting error (MSE, 3 runs, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 1-step ahead forecasting (MSE, inductive, 3 runs ±std) for different combinations of NAR domains. Columns show error on a specific domain, color identifies the training domains. context. Notably, adding the context enables near-minimum error already at W =P, supporting the idea that, even if here additional observations beyond the last P time lags are not relevant for their temporal relationship with x n t… view at source ↗
Figure 5
Figure 5. Figure 5: 24-step ahead forecasting error (MAE, inductive, 3 runs ±std) against window length W. Foundation models (Foundation, dashed) against domain-specific global models (Specialized, solid). results in the foundation models requiring a much larger input window to match the performance of the domain-specific models. Note that the foundation models’ pretraining corpora contain time series from the considered doma… view at source ↗
Figure 6
Figure 6. Figure 6: 24-step ahead forecasting error (MAE, inductive, 3 runs ±std). Color/marker denotes the model. Line-style denotes the approach: decoupling (dashed) or standard (solid). (LRU) [36] (recurrent); details in App. F.1.2. The embedding module (Eq. 8a) is always implemented by PatchTST for consistency. Depending on the implementation of the base model, e n is integrated via summation or concatenation (see App. F.… view at source ↗
Figure 7
Figure 7. Figure 7: Pareto frontiers for the decoupled (blue) and standard [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of a linear model on data from multiple AR(2) processes with [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance of a linear model on data from multiple AR(2) processes with [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: 24-step ahead forecasting error (MAE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example trajectories produced by foundation models (color-coded), on real-world datasets, when given as [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 1-step ahead forecasting error (MSE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 1-step ahead forecasting error (MSE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: 1-step ahead forecasting error (MSE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: 1-step ahead forecasting (MSE, transductive, 3 runs ±std) for different combinations of NAR domains. Columns show error on a specific domain, color identifies the training domains. Similarly, to complement [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: 24-step ahead forecasting error (MAE, transductive, 3 runs ±std). Color/marker denotes the model. Line-style denotes the approach: decoupling (dashed) or standard (solid) [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: 24-step ahead forecasting error (MAE, inductive, 3 runs ±std). Columns correspond to different models while rows represent different datasets. Line-color denotes decoupling (orange) or not (blue). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: 24-step ahead forecasting error (MAE, transductive, 3 runs ±std). Columns correspond to different models while rows represent different datasets. Line-color denotes decoupling (orange) or not (blue). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Pareto frontiers, inductive MSE (y-axis) vs FLOPs (MFLOPs, x-axis), for the decoupled (blue) and standard (orange) approach. Dots correspond to model/dataset pairs from [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Pareto frontiers, inductive MSE (y-axis) vs CPU inference time (ms, x-axis), for the decoupled (blue) and standard (orange) approach. Dots correspond to model/dataset pairs from [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Pareto frontiers, inductive MSE (y-axis) vs GPU inference time (ms, x-axis), for the decoupled (blue) and standard (orange) approach. Dots correspond to model/dataset pairs from [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Pareto frontiers, inductive MSE (y-axis) vs peak GPU memory occupancy (MB, x-axis), for the decoupled (blue) and standard (orange) approach. Dots correspond to model/dataset pairs from [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: 96-step ahead forecasting error (normalized MSE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗
read the original abstract

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that forecasting groups of time series involves two objectives: generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and conditional forecasting (CF). Optimal predictions are interpreted as averages over plausible data-generating processes weighted by their likelihood given the input window. It proves that even for processes with memory length P, an input window size strictly larger than P is necessary to achieve the minimum attainable error. It further shows that decoupling GPI and CF can improve computational scalability without compromising accuracy, with validation on synthetic and real-world data.

Significance. If the central result holds, the work supplies a theoretical explanation for the empirical value of long context windows in global time-series models that is distinct from long-range dependence capture, namely reduction of uncertainty over process identity. The attempted proof, the decoupling proposal, and the dual synthetic/real-data validation are explicit strengths that would be credited if the derivation is complete and the experiments are reproducible.

major comments (2)
  1. [Theoretical results section (the proof referenced in the abstract)] The proof that a window strictly larger than P is required for minimum attainable error is the load-bearing claim. The manuscript must supply the full derivation (including the precise definition of minimum attainable error, the measure over processes, and the conditions under which the posterior-average interpretation equals the Bayes-optimal predictor) so that the result can be verified rather than asserted at the level of the abstract.
  2. [Introduction and theoretical framework] The weakest modeling assumption—that optimal forecasts are exactly the likelihood-weighted mixture over plausible processes—underpins both the necessity proof and the GPI/CF distinction. The paper should state whether this holds only for squared-error loss, only for finite process classes, or more generally, and should indicate what happens when the model class is misspecified.
minor comments (2)
  1. [Abstract] The abstract states the main claims clearly but does not name the concrete process class (e.g., order-P Markov chains) used in the necessity proof; adding this would help readers assess scope.
  2. [Experiments] Experimental sections should report the exact procedure used to generate the synthetic data that isolates the GPI effect and the quantitative improvement obtained by the proposed decoupling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comments correctly identify areas where the theoretical presentation requires expansion for verifiability and precision. We address each point below and will revise the manuscript to incorporate the requested clarifications and full derivations.

read point-by-point responses
  1. Referee: [Theoretical results section (the proof referenced in the abstract)] The proof that a window strictly larger than P is required for minimum attainable error is the load-bearing claim. The manuscript must supply the full derivation (including the precise definition of minimum attainable error, the measure over processes, and the conditions under which the posterior-average interpretation equals the Bayes-optimal predictor) so that the result can be verified rather than asserted at the level of the abstract.

    Authors: We agree that the full derivation must be supplied for independent verification. The current manuscript presents the key steps at a high level in the theoretical results section but omits some intermediate lemmas and explicit definitions. In the revised version we will expand this section to include: (i) the definition of minimum attainable error as the Bayes risk (expected squared error) under the posterior over the finite process class; (ii) the prior measure as a probability mass function on the process class; and (iii) the proof that the posterior-weighted average equals the Bayes-optimal predictor under squared loss. We will also add the complete argument showing why any window of length exactly P leaves positive posterior mass on multiple processes, preventing attainment of the minimum error. revision: yes

  2. Referee: [Introduction and theoretical framework] The weakest modeling assumption—that optimal forecasts are exactly the likelihood-weighted mixture over plausible processes—underpins both the necessity proof and the GPI/CF distinction. The paper should state whether this holds only for squared-error loss, only for finite process classes, or more generally, and should indicate what happens when the model class is misspecified.

    Authors: We appreciate this observation. The equivalence between the posterior-weighted predictor and the Bayes-optimal forecast holds specifically for squared-error loss and finite process classes equipped with a known prior; we will state these scope conditions explicitly in the revised introduction and theoretical framework. Under model misspecification (true process outside the class), GPI selects the maximum-likelihood process within the class and CF proceeds conditionally on that choice, but optimality is no longer guaranteed. We will add a short discussion of this limitation, noting that the necessity result for window length > P continues to apply in an approximate sense when misspecification is not severe. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances a theoretical argument and proof that window size must exceed memory length P to minimize error, grounded in the distinction between generative process identification and conditional forecasting. This follows from standard properties of Markov processes (a single sequence of length P can arise from multiple processes) and the resulting mixture predictor, without reducing to any fitted parameter, self-defined quantity, or load-bearing self-citation chain. The provided abstract and reader analysis confirm the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that time series are generated by processes with finite memory length P that can be distinguished by longer observations, plus the interpretive step that optimal forecasts are posterior-weighted averages over those processes. No free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Time series data are generated by one of a family of processes possessing finite memory length P that remain distinguishable given sufficiently long observations.
    Invoked directly in the proof that window size must exceed P.

pith-pipeline@v0.9.1-grok · 5735 in / 1315 out tokens · 33370 ms · 2026-06-28T15:49:49.158374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 4 canonical work pages

  1. [1]

    Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

    Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

  2. [2]

    Deep learning for time series forecasting: Tutorial and literature survey.ACM Computing Surveys, 55(6):1–36, 2022

    Konstantinos Benidis, Syama Sundar Rangapuram, Valentin Flunkert, Yuyang Wang, Danielle Maddix, Caner Turkmen, Jan Gasthaus, Michael Bohlke-Schneider, David Salinas, Lorenzo Stella, et al. Deep learning for time series forecasting: Tutorial and literature survey.ACM Computing Surveys, 55(6):1–36, 2022

  3. [3]

    Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278, 2024

    Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278, 2024

  4. [4]

    Some recent advances in forecasting and control.Journal of the Royal Statistical Society

    George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control.Journal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968

  5. [5]

    De Gooijer and Rob J

    Jan G. De Gooijer and Rob J. Hyndman. 25 years of time series forecasting.International Journal of Forecasting, 22(3):443–473, 2006. ISSN 0169-2070. doi:https://doi.org/10.1016/j.ijforecast.2006.01.001. URL https:// www.sciencedirect.com/science/article/pii/S0169207006000021. Twenty five years of forecasting

  6. [6]

    Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021

    Hansika Hewamalage, Christoph Bergmeir, and Kasun Bandara. Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021

  7. [7]

    Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

    David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

  8. [8]

    International Journal of Forecasting 36, 54–74

    Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The m4 competition: 100,000 time series and 61 forecasting methods.International Journal of Forecasting, 36(1):54–74, 2020. ISSN 0169-2070. doi:https://doi.org/10.1016/j.ijforecast.2019.04.014. URL https://www.sciencedirect.com/ science/article/pii/S0169207019301128. M4 Competition

  9. [9]

    Principles and algorithms for forecasting groups of time series: Locality and globality.International Journal of Forecasting, 37(4):1632–1653, 2021

    Pablo Montero-Manso and Rob J Hyndman. Principles and algorithms for forecasting groups of time series: Locality and globality.International Journal of Forecasting, 37(4):1632–1653, 2021. 9 Why Do Time Series Models Need Long Context Windows?PREPRINT

  10. [10]

    Informer: Beyond efficient transformer for long sequence time-series forecasting

    Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InThe Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, volume 35, pages 11106–11115. AAAI Press, 2021

  11. [11]

    A time series is worth 64 words: Long-term forecasting with transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Representations, 2022

  12. [12]

    FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting

    Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. InProc. 39th International Conference on Machine Learning (ICML 2022), 2022

  13. [13]

    Liu, and Schahram Dustdar

    Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X. Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=0EXmFzUn5I

  14. [14]

    A decoder-only foundation model for time-series forecasting

    Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. InForty-first International Conference on Machine Learning, 2024

  15. [15]

    Unified training of universal time series forecasting transformers

    Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InInternational Conference on Machine Learning, pages 53140–53164. PMLR, 2024

  16. [16]

    Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

  17. [17]

    Foundation models for time series analysis: A tutorial and survey

    Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6555–6565, 2024

  18. [18]

    An explanation of in-context learning as implicit bayesian inference

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=RdJVFCHjUMI

  19. [19]

    What learning algorithm is in-context learning? investigations with linear models

    Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=0g0X4H8yN4I

  20. [20]

    Transformers as algorithms: Generalization and stability in in-context learning

    Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. InInternational conference on machine learning, pages 19565–19594. PMLR, 2023

  21. [21]

    What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

    Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. In Yingzhen Li, Stephan Mandt, Shipra Agrawal, and Emtiyaz Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedi...

  22. [22]

    URLhttps://proceedings.mlr.press/v258/zhang25d.html

  23. [23]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, Miami, Fl...

  24. [24]

    Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821, 2025

    Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al. Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821, 2025

  25. [25]

    Making and evaluating point forecasts.Journal of the American Statistical Association, 106 (494):746–762, 2011

    Tilmann Gneiting. Making and evaluating point forecasts.Journal of the American Statistical Association, 106 (494):746–762, 2011

  26. [26]

    Chapman and Hall/CRC, 1995

    Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin.Bayesian data analysis. Chapman and Hall/CRC, 1995

  27. [27]

    Chan, Biao Zhang, Aleksandra Faust, and Hugo Larochelle

    Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie C.Y . Chan, Biao Zhang, Aleksandra Faust, and Hugo Larochelle. Many-shot in-context learning. InICML 2024 Workshop on In-Context Learning, 2024. URLhttps://openreview.net/forum?id=goi7DFHlqS

  28. [28]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 10 Why Do Time Series Models Need Long Context Windows?PREPRINT

  29. [29]

    Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025

    Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Doyen Sahoo, Caiming Xiong, Silvio Savarese, and Junnan Li. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025

  30. [30]

    Timer-xl: Long-context transformers for unified time series forecasting

    Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting. InThe Thirteenth International Conference on Learning Representations, 2025

  31. [31]

    Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36:75354–75371, 2023

    Xu Liu, Yutong Xia, Yuxuan Liang, Junfeng Hu, Yiwei Wang, Lei Bai, Chao Huang, Zhenguang Liu, Bryan Hooi, and Roger Zimmermann. Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36:75354–75371, 2023

  32. [32]

    Commission for Energy Regulation. CER. CER Smart Metering Project - Electricity Customer Behaviour Trial, 2009-2010 [dataset].Irish Social Science Data Archive. SN: 0012-00, 2016. URL https://www.ucd.ie/ issda/data/commissionforenergyregulationcer/

  33. [33]

    On the Properties of Neural Machine Translation: Encoder-Decoder Approaches , booktitle =

    Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neu- ral machine translation: Encoder–decoder approaches. In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi, editors,Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar...

  34. [34]

    Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

  35. [35]

    Tsmixer: An all-mlp architecture for time series forecast-ing.Transactions on Machine Learning Research, 2023

    Si-An Chen, Chun-Liang Li, Sercan O Arik, Nathanael Christian Yoder, and Tomas Pfister. Tsmixer: An all-mlp architecture for time series forecast-ing.Transactions on Machine Learning Research, 2023

  36. [36]

    Moderntcn: A modern pure convolution structure for general time series analysis

    Donghao Luo and Xue Wang. Moderntcn: A modern pure convolution structure for general time series analysis. InThe twelfth international conference on learning representations, pages 1–43, 2024

  37. [37]

    Resurrecting recurrent neural networks for long sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InInternational Conference on Machine Learning, pages 26670–26698. PMLR, 2023

  38. [38]

    Foundation models for time series: A survey.arXiv preprint arXiv:2504.04011, 2025

    Siva Rama Krishna Kottapalli, Karthik Hubli, Sandeep Chandrashekhara, Garima Jain, Sunayana Hubli, Gayathri Botla, and Ramesh Doddaiah. Foundation models for time series: A survey.arXiv preprint arXiv:2504.04011, 2025

  39. [39]

    In-context fine-tuning for time-series foundation models

    Matthew Faw, Rajat Sen, Yichen Zhou, and Abhimanyu Das. In-context fine-tuning for time-series foundation models. InForty-second International Conference on Machine Learning, 2025

  40. [40]

    Zero-shot time series forecasting with covariates via in-context learning.arXiv preprint arXiv:2506.03128, 2025

    Andreas Auer, Raghul Parthipan, Pedro Mercado, Abdul Fatir Ansari, Lorenzo Stella, Bernie Wang, Michael Bohlke-Schneider, and Syama Sundar Rangapuram. Zero-shot time series forecasting with covariates via in-context learning.arXiv preprint arXiv:2506.03128, 2025

  41. [41]

    Context is key: A benchmark for forecasting with essential textual information

    Andrew Robert Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, and Alexandre Drouin. Context is key: A benchmark for forecasting with essential textual information. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Be...

  42. [42]

    In-context time series predictor

    Jiecheng Lu, Yan Sun, and Shihao Yang. In-context time series predictor. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=dCcY2pyNIO

  43. [43]

    A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting

    Slawek Smyl. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International journal of forecasting, 36(1):75–85, 2020

  44. [44]

    Taming local effects in graph-based spatiotempo- ral forecasting.Advances in Neural Information Processing Systems, 36:55375–55393, 2023

    Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi. Taming local effects in graph-based spatiotempo- ral forecasting.Advances in Neural Information Processing Systems, 36:55375–55393, 2023

  45. [45]

    On the regularization of learnable embeddings for time series forecasting.Transactions on Machine Learning Research, 2025

    Luca Butera, Giovanni De Felice, Andrea Cini, and Cesare Alippi. On the regularization of learnable embeddings for time series forecasting.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=F5ALCh3GWG

  46. [46]

    Revisiting long-term time series forecasting: An investigation on linear mapping.arXiv preprint arXiv:2305.10721, 2023

    Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An investigation on linear mapping.arXiv preprint arXiv:2305.10721, 2023. 11 Why Do Time Series Models Need Long Context Windows?PREPRINT

  47. [47]

    Graph-based virtual sensing from sparse and partial multivariate observations

    Giovanni De Felice, Andrea Cini, Daniele Zambon, Vladimir V Gusev, and Cesare Alippi. Graph-based virtual sensing from sparse and partial multivariate observations. InInternational Conference on Learning Representations, 2024

  48. [48]

    Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024

    Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024

  49. [49]

    Timer: generative pre-trained transformers are large time series models

    Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: generative pre-trained transformers are large time series models. InProceedings of the 41st International Conference on Machine Learning, pages 32369–32399, 2024

  50. [50]

    Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  51. [51]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  52. [52]

    Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

    Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=r1ecqn4YwB

  53. [53]

    One fits all: Power general time series analysis by pretrained LM

    Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all: Power general time series analysis by pretrained LM. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=gMS6FVZvmF

  54. [54]

    Lag-llama: Towards foundation models for time series forecasting

    Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhag- watkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, et al. Lag-llama: Towards foundation models for time series forecasting. InR0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023

  55. [55]

    Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

    Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

  56. [56]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  57. [57]

    Description based text classification with reinforcement learning

    Duo Chai, Wei Wu, Qinghong Han, Fei Wu, and Jiwei Li. Description based text classification with reinforcement learning. InInternational conference on machine learning, pages 1371–1382. PMLR, 2020

  58. [58]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022

  59. [59]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  60. [60]

    Meta-learning via language model in-context tuning

    Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 719–730, 2022

  61. [61]

    Metaicl: Learning to learn in context

    Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, 2022

  62. [62]

    In-context pretraining: Language modeling beyond document boundaries

    Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A Smith, Luke Zettlemoyer, Wen-tau Yih, and Mike Lewis. In-context pretraining: Language modeling beyond document boundaries. InThe Twelfth International Conference on Learning Representations, 2024

  63. [63]

    Fforma: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92, 2020

    Pablo Montero-Manso, George Athanasopoulos, Rob J Hyndman, and Thiyanga S Talagala. Fforma: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92, 2020

  64. [64]

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks

    Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992

  65. [65]

    Hypernetworks

    David Ha, Andrew M Dai, and Quoc V Le. Hypernetworks. InInternational Conference on Learning Representa- tions, 2017. 12 Why Do Time Series Models Need Long Context Windows?PREPRINT

  66. [66]

    Meta-learning framework with applications to zero-shot time-series forecasting

    Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. Meta-learning framework with applications to zero-shot time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 9242–9250, 2021

  67. [67]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–1135. PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/...

  68. [68]

    Meta-learning how to forecast time series

    Thiyanga S Talagala, Rob J Hyndman, and George Athanasopoulos. Meta-learning how to forecast time series. Journal of Forecasting, 42(6):1476–1501, 2023

  69. [69]

    identity

    Declan A Norton, Edward Ott, Andrew Pomerance, Brian Hunt, and Michelle Girvan. Tailored forecasting from short time series via meta-learning.arXiv preprint arXiv:2501.16325, 2025. A Terminology We distinguish here several related terms. • Generative process identificationrefers to inferring which latent stochastic process generated an observed sequence, ...

  70. [70]

    TheMoiraifamily [ 15, 28] further extended the line of probabilistic foundation models with encoder-only and decoder-only architectures

    introduced a unified backbone capable of addressing multiple time series tasks and efficiently adapting to new series via lightweight fine-tuning. TheMoiraifamily [ 15, 28] further extended the line of probabilistic foundation models with encoder-only and decoder-only architectures. Several decoder-based models followed [14, 48, 29], targeting point predi...