Why Do Time Series Models Need Long Context Windows?

Andrea Cini; Cesare Alippi; Giovanni De Felice; Luca Butera

arxiv: 2606.01999 · v1 · pith:NFKF6IB4new · submitted 2026-06-01 · 💻 cs.LG · cs.AI

Why Do Time Series Models Need Long Context Windows?

Luca Butera , Giovanni De Felice , Andrea Cini , Cesare Alippi This is my paper

Pith reviewed 2026-06-28 15:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time series forecastinggenerative process identificationcontext window sizeconditional forecastingprocess identificationmemory lengthlong context windowsforecasting architectures

0 comments

The pith

Time series forecasting models require input windows longer than the processes' memory length to reach minimal error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that forecasting groups of time series has two parts: identifying which generative process produced the observed sequence and then making predictions conditional on the observations. Optimal forecasts average over all processes that could have generated the data, with weights given by how likely each process is based on the window. Longer windows matter because they make it easier to tell which process is the true one, lowering the uncertainty in the average. The authors prove that when all processes have memory length P, windows of size P or smaller cannot achieve the lowest possible error. They also show that separating the identification task from forecasting can make models faster without hurting accuracy.

Core claim

Forecasting groups of time series involves generative process identification, inferring the specific process generating the input sequence, and conditional forecasting. Optimal predictions are an average over plausible data-generating processes weighted by their likelihood given the input window. Even for processes with memory length P, an input window size strictly larger than P is necessary to achieve the minimum attainable error.

What carries the argument

The decomposition of the forecasting task into generative process identification (GPI) and conditional forecasting (CF), where GPI reduces uncertainty about the active process.

If this is right

Decoupling GPI from CF improves computational scalability without compromising accuracy.
Window sizes must be strictly larger than the memory length P to minimize error.
The benefit of long context holds for collections of finite-memory processes.
Experiments on synthetic and real-world data support the need for longer windows to resolve process identity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models could explicitly estimate the posterior over processes to achieve the optimal average forecast.
The requirement for extra context beyond memory length may generalize to other sequence modeling tasks with multiple possible sources.
Designing architectures that first classify the process type before forecasting could be more efficient.

Load-bearing premise

That the best prediction is the average over possible processes weighted by how well they explain the input window, and that longer windows reduce uncertainty about the active process.

What would settle it

An experiment on a mixture of processes each with memory P where the prediction error with window length P equals the error with windows longer than P.

Figures

Figures reproduced from arXiv: 2606.01999 by Andrea Cini, Cesare Alippi, Giovanni De Felice, Luca Butera.

**Figure 2.** Figure 2: Pre-trained foundation models MAE, H=24. For each pair of subplots, the left histogram (orange) shows the error for increasing window length W, without context, while the right one (blue) shows the error with the smallest W and increasing context length C. Evidence in foundation models Under the model in Eq. 3, GPI and CF are inherent to the problem of learning global models for groups of time series, and … view at source ↗

**Figure 3.** Figure 3: 1-step ahead forecasting error (MSE, 3 runs, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: 1-step ahead forecasting (MSE, inductive, 3 runs ±std) for different combinations of NAR domains. Columns show error on a specific domain, color identifies the training domains. context. Notably, adding the context enables near-minimum error already at W =P, supporting the idea that, even if here additional observations beyond the last P time lags are not relevant for their temporal relationship with x n t… view at source ↗

**Figure 5.** Figure 5: 24-step ahead forecasting error (MAE, inductive, 3 runs ±std) against window length W. Foundation models (Foundation, dashed) against domain-specific global models (Specialized, solid). results in the foundation models requiring a much larger input window to match the performance of the domain-specific models. Note that the foundation models’ pretraining corpora contain time series from the considered doma… view at source ↗

**Figure 6.** Figure 6: 24-step ahead forecasting error (MAE, inductive, 3 runs ±std). Color/marker denotes the model. Line-style denotes the approach: decoupling (dashed) or standard (solid). (LRU) [36] (recurrent); details in App. F.1.2. The embedding module (Eq. 8a) is always implemented by PatchTST for consistency. Depending on the implementation of the base model, e n is integrated via summation or concatenation (see App. F.… view at source ↗

**Figure 7.** Figure 7: Pareto frontiers for the decoupled (blue) and standard [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Performance of a linear model on data from multiple AR(2) processes with [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Performance of a linear model on data from multiple AR(2) processes with [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: 24-step ahead forecasting error (MAE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Example trajectories produced by foundation models (color-coded), on real-world datasets, when given as [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: 1-step ahead forecasting error (MSE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: 1-step ahead forecasting error (MSE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: 1-step ahead forecasting error (MSE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: 1-step ahead forecasting (MSE, transductive, 3 runs ±std) for different combinations of NAR domains. Columns show error on a specific domain, color identifies the training domains. Similarly, to complement [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: 24-step ahead forecasting error (MAE, transductive, 3 runs ±std). Color/marker denotes the model. Line-style denotes the approach: decoupling (dashed) or standard (solid) [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: 24-step ahead forecasting error (MAE, inductive, 3 runs ±std). Columns correspond to different models while rows represent different datasets. Line-color denotes decoupling (orange) or not (blue). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: 24-step ahead forecasting error (MAE, transductive, 3 runs ±std). Columns correspond to different models while rows represent different datasets. Line-color denotes decoupling (orange) or not (blue). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Pareto frontiers, inductive MSE (y-axis) vs FLOPs (MFLOPs, x-axis), for the decoupled (blue) and standard (orange) approach. Dots correspond to model/dataset pairs from [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: Pareto frontiers, inductive MSE (y-axis) vs CPU inference time (ms, x-axis), for the decoupled (blue) and standard (orange) approach. Dots correspond to model/dataset pairs from [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Pareto frontiers, inductive MSE (y-axis) vs GPU inference time (ms, x-axis), for the decoupled (blue) and standard (orange) approach. Dots correspond to model/dataset pairs from [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Pareto frontiers, inductive MSE (y-axis) vs peak GPU memory occupancy (MB, x-axis), for the decoupled (blue) and standard (orange) approach. Dots correspond to model/dataset pairs from [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗

**Figure 24.** Figure 24: 96-step ahead forecasting error (normalized MSE, 3 runs [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗

**Figure 26.** Figure 26: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗

**Figure 27.** Figure 27: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗

**Figure 28.** Figure 28: Parity plots for Example C.1. Data generation hyper-parameters: [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗

read the original abstract

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves that even order-P processes need window size > P for minimal error because a length-P sequence leaves residual uncertainty over which process generated it.

read the letter

The main new piece is the proof that window length must strictly exceed memory length P to reach the lowest attainable error. A single P-length sequence can be generated by multiple order-P Markov processes with positive probability, so the optimal predictor has to average over them and incurs extra error. The GPI/CF split (generative process identification plus conditional forecasting) is a clean way to frame why longer contexts help: they reduce uncertainty about which process is active.

The decoupling suggestion for scalability follows naturally from the split and looks practical. Experiments on synthetic and real data are reported, which grounds the claim beyond pure theory.

The argument is internally consistent. The stress-test note is right that the reasoning does not contain an obvious leap. The soft spot is the modeling choice that optimal forecasts are posterior averages over plausible processes; if real data or deployed models deviate from that, the result may not carry over as cleanly. The abstract does not show the full derivation, so a referee would need to verify the assumptions and any boundary cases.

This is for people designing or analyzing time series forecasting models who want a theoretical handle on context length. A reader who cares about why long windows matter beyond just "capturing dependencies" will find the framing useful. The work is coherent on its own terms and deserves peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that forecasting groups of time series involves two objectives: generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and conditional forecasting (CF). Optimal predictions are interpreted as averages over plausible data-generating processes weighted by their likelihood given the input window. It proves that even for processes with memory length P, an input window size strictly larger than P is necessary to achieve the minimum attainable error. It further shows that decoupling GPI and CF can improve computational scalability without compromising accuracy, with validation on synthetic and real-world data.

Significance. If the central result holds, the work supplies a theoretical explanation for the empirical value of long context windows in global time-series models that is distinct from long-range dependence capture, namely reduction of uncertainty over process identity. The attempted proof, the decoupling proposal, and the dual synthetic/real-data validation are explicit strengths that would be credited if the derivation is complete and the experiments are reproducible.

major comments (2)

[Theoretical results section (the proof referenced in the abstract)] The proof that a window strictly larger than P is required for minimum attainable error is the load-bearing claim. The manuscript must supply the full derivation (including the precise definition of minimum attainable error, the measure over processes, and the conditions under which the posterior-average interpretation equals the Bayes-optimal predictor) so that the result can be verified rather than asserted at the level of the abstract.
[Introduction and theoretical framework] The weakest modeling assumption—that optimal forecasts are exactly the likelihood-weighted mixture over plausible processes—underpins both the necessity proof and the GPI/CF distinction. The paper should state whether this holds only for squared-error loss, only for finite process classes, or more generally, and should indicate what happens when the model class is misspecified.

minor comments (2)

[Abstract] The abstract states the main claims clearly but does not name the concrete process class (e.g., order-P Markov chains) used in the necessity proof; adding this would help readers assess scope.
[Experiments] Experimental sections should report the exact procedure used to generate the synthetic data that isolates the GPI effect and the quantitative improvement obtained by the proposed decoupling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comments correctly identify areas where the theoretical presentation requires expansion for verifiability and precision. We address each point below and will revise the manuscript to incorporate the requested clarifications and full derivations.

read point-by-point responses

Referee: [Theoretical results section (the proof referenced in the abstract)] The proof that a window strictly larger than P is required for minimum attainable error is the load-bearing claim. The manuscript must supply the full derivation (including the precise definition of minimum attainable error, the measure over processes, and the conditions under which the posterior-average interpretation equals the Bayes-optimal predictor) so that the result can be verified rather than asserted at the level of the abstract.

Authors: We agree that the full derivation must be supplied for independent verification. The current manuscript presents the key steps at a high level in the theoretical results section but omits some intermediate lemmas and explicit definitions. In the revised version we will expand this section to include: (i) the definition of minimum attainable error as the Bayes risk (expected squared error) under the posterior over the finite process class; (ii) the prior measure as a probability mass function on the process class; and (iii) the proof that the posterior-weighted average equals the Bayes-optimal predictor under squared loss. We will also add the complete argument showing why any window of length exactly P leaves positive posterior mass on multiple processes, preventing attainment of the minimum error. revision: yes
Referee: [Introduction and theoretical framework] The weakest modeling assumption—that optimal forecasts are exactly the likelihood-weighted mixture over plausible processes—underpins both the necessity proof and the GPI/CF distinction. The paper should state whether this holds only for squared-error loss, only for finite process classes, or more generally, and should indicate what happens when the model class is misspecified.

Authors: We appreciate this observation. The equivalence between the posterior-weighted predictor and the Bayes-optimal forecast holds specifically for squared-error loss and finite process classes equipped with a known prior; we will state these scope conditions explicitly in the revised introduction and theoretical framework. Under model misspecification (true process outside the class), GPI selects the maximum-likelihood process within the class and CF proceeds conditionally on that choice, but optimality is no longer guaranteed. We will add a short discussion of this limitation, noting that the necessity result for window length > P continues to apply in an approximate sense when misspecification is not severe. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances a theoretical argument and proof that window size must exceed memory length P to minimize error, grounded in the distinction between generative process identification and conditional forecasting. This follows from standard properties of Markov processes (a single sequence of length P can arise from multiple processes) and the resulting mixture predictor, without reducing to any fitted parameter, self-defined quantity, or load-bearing self-citation chain. The provided abstract and reader analysis confirm the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that time series are generated by processes with finite memory length P that can be distinguished by longer observations, plus the interpretive step that optimal forecasts are posterior-weighted averages over those processes. No free parameters or new physical entities are introduced.

axioms (1)

domain assumption Time series data are generated by one of a family of processes possessing finite memory length P that remain distinguishable given sufficiently long observations.
Invoked directly in the proof that window size must exceed P.

pith-pipeline@v0.9.1-grok · 5735 in / 1315 out tokens · 33370 ms · 2026-06-28T15:49:49.158374+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 4 canonical work pages

[1]

Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

2021
[2]

Deep learning for time series forecasting: Tutorial and literature survey.ACM Computing Surveys, 55(6):1–36, 2022

Konstantinos Benidis, Syama Sundar Rangapuram, Valentin Flunkert, Yuyang Wang, Danielle Maddix, Caner Turkmen, Jan Gasthaus, Michael Bohlke-Schneider, David Salinas, Lorenzo Stella, et al. Deep learning for time series forecasting: Tutorial and literature survey.ACM Computing Surveys, 55(6):1–36, 2022

2022
[3]

Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278, 2024

Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278, 2024

Pith/arXiv arXiv 2024
[4]

Some recent advances in forecasting and control.Journal of the Royal Statistical Society

George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control.Journal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968

1968
[5]

De Gooijer and Rob J

Jan G. De Gooijer and Rob J. Hyndman. 25 years of time series forecasting.International Journal of Forecasting, 22(3):443–473, 2006. ISSN 0169-2070. doi:https://doi.org/10.1016/j.ijforecast.2006.01.001. URL https:// www.sciencedirect.com/science/article/pii/S0169207006000021. Twenty five years of forecasting

work page doi:10.1016/j.ijforecast.2006.01.001 2006
[6]

Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021

Hansika Hewamalage, Christoph Bergmeir, and Kasun Bandara. Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021

2021
[7]

Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

2020
[8]

International Journal of Forecasting 36, 54–74

Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The m4 competition: 100,000 time series and 61 forecasting methods.International Journal of Forecasting, 36(1):54–74, 2020. ISSN 0169-2070. doi:https://doi.org/10.1016/j.ijforecast.2019.04.014. URL https://www.sciencedirect.com/ science/article/pii/S0169207019301128. M4 Competition

work page doi:10.1016/j.ijforecast.2019.04.014 2020
[9]

Principles and algorithms for forecasting groups of time series: Locality and globality.International Journal of Forecasting, 37(4):1632–1653, 2021

Pablo Montero-Manso and Rob J Hyndman. Principles and algorithms for forecasting groups of time series: Locality and globality.International Journal of Forecasting, 37(4):1632–1653, 2021. 9 Why Do Time Series Models Need Long Context Windows?PREPRINT

2021
[10]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InThe Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, volume 35, pages 11106–11115. AAAI Press, 2021

2021
[11]

A time series is worth 64 words: Long-term forecasting with transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Representations, 2022

2022
[12]

FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. InProc. 39th International Conference on Machine Learning (ICML 2022), 2022

2022
[13]

Liu, and Schahram Dustdar

Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X. Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=0EXmFzUn5I

2022
[14]

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. InForty-first International Conference on Machine Learning, 2024

2024
[15]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InInternational Conference on Machine Learning, pages 53140–53164. PMLR, 2024

2024
[16]

Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

arXiv 2024
[17]

Foundation models for time series analysis: A tutorial and survey

Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6555–6565, 2024

2024
[18]

An explanation of in-context learning as implicit bayesian inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=RdJVFCHjUMI

2022
[19]

What learning algorithm is in-context learning? investigations with linear models

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=0g0X4H8yN4I

2023
[20]

Transformers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. InInternational conference on machine learning, pages 19565–19594. PMLR, 2023

2023
[21]

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. In Yingzhen Li, Stephan Mandt, Shipra Agrawal, and Emtiyaz Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedi...
[22]

URLhttps://proceedings.mlr.press/v258/zhang25d.html
[23]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, Miami, Fl...

work page doi:10.18653/v1/2024.emnlp-main.64 2024
[24]

Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821, 2025

Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al. Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821, 2025

Pith/arXiv arXiv 2025
[25]

Making and evaluating point forecasts.Journal of the American Statistical Association, 106 (494):746–762, 2011

Tilmann Gneiting. Making and evaluating point forecasts.Journal of the American Statistical Association, 106 (494):746–762, 2011

2011
[26]

Chapman and Hall/CRC, 1995

Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin.Bayesian data analysis. Chapman and Hall/CRC, 1995

1995
[27]

Chan, Biao Zhang, Aleksandra Faust, and Hugo Larochelle

Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie C.Y . Chan, Biao Zhang, Aleksandra Faust, and Hugo Larochelle. Many-shot in-context learning. InICML 2024 Workshop on In-Context Learning, 2024. URLhttps://openreview.net/forum?id=goi7DFHlqS

2024
[28]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 10 Why Do Time Series Models Need Long Context Windows?PREPRINT

1901
[29]

Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025

Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Doyen Sahoo, Caiming Xiong, Silvio Savarese, and Junnan Li. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025

arXiv 2025
[30]

Timer-xl: Long-context transformers for unified time series forecasting

Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[31]

Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36:75354–75371, 2023

Xu Liu, Yutong Xia, Yuxuan Liang, Junfeng Hu, Yiwei Wang, Lei Bai, Chao Huang, Zhenguang Liu, Bryan Hooi, and Roger Zimmermann. Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36:75354–75371, 2023

2023
[32]

Commission for Energy Regulation. CER. CER Smart Metering Project - Electricity Customer Behaviour Trial, 2009-2010 [dataset].Irish Social Science Data Archive. SN: 0012-00, 2016. URL https://www.ucd.ie/ issda/data/commissionforenergyregulationcer/

2009
[33]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches , booktitle =

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neu- ral machine translation: Encoder–decoder approaches. In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi, editors,Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar...

work page doi:10.3115/v1/w14-4012 2014
[34]

Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

2023
[35]

Tsmixer: An all-mlp architecture for time series forecast-ing.Transactions on Machine Learning Research, 2023

Si-An Chen, Chun-Liang Li, Sercan O Arik, Nathanael Christian Yoder, and Tomas Pfister. Tsmixer: An all-mlp architecture for time series forecast-ing.Transactions on Machine Learning Research, 2023

2023
[36]

Moderntcn: A modern pure convolution structure for general time series analysis

Donghao Luo and Xue Wang. Moderntcn: A modern pure convolution structure for general time series analysis. InThe twelfth international conference on learning representations, pages 1–43, 2024

2024
[37]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InInternational Conference on Machine Learning, pages 26670–26698. PMLR, 2023

2023
[38]

Foundation models for time series: A survey.arXiv preprint arXiv:2504.04011, 2025

Siva Rama Krishna Kottapalli, Karthik Hubli, Sandeep Chandrashekhara, Garima Jain, Sunayana Hubli, Gayathri Botla, and Ramesh Doddaiah. Foundation models for time series: A survey.arXiv preprint arXiv:2504.04011, 2025

arXiv 2025
[39]

In-context fine-tuning for time-series foundation models

Matthew Faw, Rajat Sen, Yichen Zhou, and Abhimanyu Das. In-context fine-tuning for time-series foundation models. InForty-second International Conference on Machine Learning, 2025

2025
[40]

Zero-shot time series forecasting with covariates via in-context learning.arXiv preprint arXiv:2506.03128, 2025

Andreas Auer, Raghul Parthipan, Pedro Mercado, Abdul Fatir Ansari, Lorenzo Stella, Bernie Wang, Michael Bohlke-Schneider, and Syama Sundar Rangapuram. Zero-shot time series forecasting with covariates via in-context learning.arXiv preprint arXiv:2506.03128, 2025

arXiv 2025
[41]

Context is key: A benchmark for forecasting with essential textual information

Andrew Robert Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, and Alexandre Drouin. Context is key: A benchmark for forecasting with essential textual information. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Be...

2025
[42]

In-context time series predictor

Jiecheng Lu, Yan Sun, and Shihao Yang. In-context time series predictor. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=dCcY2pyNIO

2025
[43]

A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting

Slawek Smyl. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International journal of forecasting, 36(1):75–85, 2020

2020
[44]

Taming local effects in graph-based spatiotempo- ral forecasting.Advances in Neural Information Processing Systems, 36:55375–55393, 2023

Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi. Taming local effects in graph-based spatiotempo- ral forecasting.Advances in Neural Information Processing Systems, 36:55375–55393, 2023

2023
[45]

On the regularization of learnable embeddings for time series forecasting.Transactions on Machine Learning Research, 2025

Luca Butera, Giovanni De Felice, Andrea Cini, and Cesare Alippi. On the regularization of learnable embeddings for time series forecasting.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=F5ALCh3GWG

2025
[46]

Revisiting long-term time series forecasting: An investigation on linear mapping.arXiv preprint arXiv:2305.10721, 2023

Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An investigation on linear mapping.arXiv preprint arXiv:2305.10721, 2023. 11 Why Do Time Series Models Need Long Context Windows?PREPRINT

Pith/arXiv arXiv 2023
[47]

Graph-based virtual sensing from sparse and partial multivariate observations

Giovanni De Felice, Andrea Cini, Daniele Zambon, Vladimir V Gusev, and Cesare Alippi. Graph-based virtual sensing from sparse and partial multivariate observations. InInternational Conference on Learning Representations, 2024

2024
[48]

Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024

Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024

arXiv 2024
[49]

Timer: generative pre-trained transformers are large time series models

Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: generative pre-trained transformers are large time series models. InProceedings of the 41st International Conference on Machine Learning, pages 32369–32399, 2024

2024
[50]

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Pith/arXiv arXiv 2014
[51]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[52]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=r1ecqn4YwB

2020
[53]

One fits all: Power general time series analysis by pretrained LM

Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all: Power general time series analysis by pretrained LM. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=gMS6FVZvmF

2023
[54]

Lag-llama: Towards foundation models for time series forecasting

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhag- watkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, et al. Lag-llama: Towards foundation models for time series forecasting. InR0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023

2023
[55]

Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

2024
[56]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[57]

Description based text classification with reinforcement learning

Duo Chai, Wei Wu, Qinghong Han, Fei Wu, and Jiwei Li. Description based text classification with reinforcement learning. InInternational conference on machine learning, pages 1371–1382. PMLR, 2020

2020
[58]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022

2022
[59]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[60]

Meta-learning via language model in-context tuning

Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 719–730, 2022

2022
[61]

Metaicl: Learning to learn in context

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, 2022

2022
[62]

In-context pretraining: Language modeling beyond document boundaries

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A Smith, Luke Zettlemoyer, Wen-tau Yih, and Mike Lewis. In-context pretraining: Language modeling beyond document boundaries. InThe Twelfth International Conference on Learning Representations, 2024

2024
[63]

Fforma: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92, 2020

Pablo Montero-Manso, George Athanasopoulos, Rob J Hyndman, and Thiyanga S Talagala. Fforma: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92, 2020

2020
[64]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992

1992
[65]

Hypernetworks

David Ha, Andrew M Dai, and Quoc V Le. Hypernetworks. InInternational Conference on Learning Representa- tions, 2017. 12 Why Do Time Series Models Need Long Context Windows?PREPRINT

2017
[66]

Meta-learning framework with applications to zero-shot time-series forecasting

Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. Meta-learning framework with applications to zero-shot time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 9242–9250, 2021

2021
[67]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–1135. PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/...

2017
[68]

Meta-learning how to forecast time series

Thiyanga S Talagala, Rob J Hyndman, and George Athanasopoulos. Meta-learning how to forecast time series. Journal of Forecasting, 42(6):1476–1501, 2023

2023
[69]

identity

Declan A Norton, Edward Ott, Andrew Pomerance, Brian Hunt, and Michelle Girvan. Tailored forecasting from short time series via meta-learning.arXiv preprint arXiv:2501.16325, 2025. A Terminology We distinguish here several related terms. • Generative process identificationrefers to inferring which latent stochastic process generated an observed sequence, ...

arXiv 2025
[70]

TheMoiraifamily [ 15, 28] further extended the line of probabilistic foundation models with encoder-only and decoder-only architectures

introduced a unified backbone capable of addressing multiple time series tasks and efficiently adapting to new series via lightweight fine-tuning. TheMoiraifamily [ 15, 28] further extended the line of probabilistic foundation models with encoder-only and decoder-only architectures. Several decoder-based models followed [14, 48, 29], targeting point predi...

arXiv

[1] [1]

Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

2021

[2] [2]

Deep learning for time series forecasting: Tutorial and literature survey.ACM Computing Surveys, 55(6):1–36, 2022

Konstantinos Benidis, Syama Sundar Rangapuram, Valentin Flunkert, Yuyang Wang, Danielle Maddix, Caner Turkmen, Jan Gasthaus, Michael Bohlke-Schneider, David Salinas, Lorenzo Stella, et al. Deep learning for time series forecasting: Tutorial and literature survey.ACM Computing Surveys, 55(6):1–36, 2022

2022

[3] [3]

Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278, 2024

Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278, 2024

Pith/arXiv arXiv 2024

[4] [4]

Some recent advances in forecasting and control.Journal of the Royal Statistical Society

George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control.Journal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968

1968

[5] [5]

De Gooijer and Rob J

Jan G. De Gooijer and Rob J. Hyndman. 25 years of time series forecasting.International Journal of Forecasting, 22(3):443–473, 2006. ISSN 0169-2070. doi:https://doi.org/10.1016/j.ijforecast.2006.01.001. URL https:// www.sciencedirect.com/science/article/pii/S0169207006000021. Twenty five years of forecasting

work page doi:10.1016/j.ijforecast.2006.01.001 2006

[6] [6]

Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021

Hansika Hewamalage, Christoph Bergmeir, and Kasun Bandara. Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021

2021

[7] [7]

Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020

2020

[8] [8]

International Journal of Forecasting 36, 54–74

Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The m4 competition: 100,000 time series and 61 forecasting methods.International Journal of Forecasting, 36(1):54–74, 2020. ISSN 0169-2070. doi:https://doi.org/10.1016/j.ijforecast.2019.04.014. URL https://www.sciencedirect.com/ science/article/pii/S0169207019301128. M4 Competition

work page doi:10.1016/j.ijforecast.2019.04.014 2020

[9] [9]

Principles and algorithms for forecasting groups of time series: Locality and globality.International Journal of Forecasting, 37(4):1632–1653, 2021

Pablo Montero-Manso and Rob J Hyndman. Principles and algorithms for forecasting groups of time series: Locality and globality.International Journal of Forecasting, 37(4):1632–1653, 2021. 9 Why Do Time Series Models Need Long Context Windows?PREPRINT

2021

[10] [10]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InThe Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, volume 35, pages 11106–11115. AAAI Press, 2021

2021

[11] [11]

A time series is worth 64 words: Long-term forecasting with transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Representations, 2022

2022

[12] [12]

FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. InProc. 39th International Conference on Machine Learning (ICML 2022), 2022

2022

[13] [13]

Liu, and Schahram Dustdar

Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X. Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=0EXmFzUn5I

2022

[14] [14]

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. InForty-first International Conference on Machine Learning, 2024

2024

[15] [15]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InInternational Conference on Machine Learning, pages 53140–53164. PMLR, 2024

2024

[16] [16]

Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885, 2024

arXiv 2024

[17] [17]

Foundation models for time series analysis: A tutorial and survey

Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6555–6565, 2024

2024

[18] [18]

An explanation of in-context learning as implicit bayesian inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=RdJVFCHjUMI

2022

[19] [19]

What learning algorithm is in-context learning? investigations with linear models

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=0g0X4H8yN4I

2023

[20] [20]

Transformers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. InInternational conference on machine learning, pages 19565–19594. PMLR, 2023

2023

[21] [21]

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. In Yingzhen Li, Stephan Mandt, Shipra Agrawal, and Emtiyaz Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedi...

[22] [22]

URLhttps://proceedings.mlr.press/v258/zhang25d.html

[23] [23]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, Miami, Fl...

work page doi:10.18653/v1/2024.emnlp-main.64 2024

[24] [24]

Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821, 2025

Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al. Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821, 2025

Pith/arXiv arXiv 2025

[25] [25]

Making and evaluating point forecasts.Journal of the American Statistical Association, 106 (494):746–762, 2011

Tilmann Gneiting. Making and evaluating point forecasts.Journal of the American Statistical Association, 106 (494):746–762, 2011

2011

[26] [26]

Chapman and Hall/CRC, 1995

Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin.Bayesian data analysis. Chapman and Hall/CRC, 1995

1995

[27] [27]

Chan, Biao Zhang, Aleksandra Faust, and Hugo Larochelle

Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie C.Y . Chan, Biao Zhang, Aleksandra Faust, and Hugo Larochelle. Many-shot in-context learning. InICML 2024 Workshop on In-Context Learning, 2024. URLhttps://openreview.net/forum?id=goi7DFHlqS

2024

[28] [28]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 10 Why Do Time Series Models Need Long Context Windows?PREPRINT

1901

[29] [29]

Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025

Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Doyen Sahoo, Caiming Xiong, Silvio Savarese, and Junnan Li. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025

arXiv 2025

[30] [30]

Timer-xl: Long-context transformers for unified time series forecasting

Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[31] [31]

Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36:75354–75371, 2023

Xu Liu, Yutong Xia, Yuxuan Liang, Junfeng Hu, Yiwei Wang, Lei Bai, Chao Huang, Zhenguang Liu, Bryan Hooi, and Roger Zimmermann. Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36:75354–75371, 2023

2023

[32] [32]

Commission for Energy Regulation. CER. CER Smart Metering Project - Electricity Customer Behaviour Trial, 2009-2010 [dataset].Irish Social Science Data Archive. SN: 0012-00, 2016. URL https://www.ucd.ie/ issda/data/commissionforenergyregulationcer/

2009

[33] [33]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches , booktitle =

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neu- ral machine translation: Encoder–decoder approaches. In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi, editors,Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar...

work page doi:10.3115/v1/w14-4012 2014

[34] [34]

Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

2023

[35] [35]

Tsmixer: An all-mlp architecture for time series forecast-ing.Transactions on Machine Learning Research, 2023

Si-An Chen, Chun-Liang Li, Sercan O Arik, Nathanael Christian Yoder, and Tomas Pfister. Tsmixer: An all-mlp architecture for time series forecast-ing.Transactions on Machine Learning Research, 2023

2023

[36] [36]

Moderntcn: A modern pure convolution structure for general time series analysis

Donghao Luo and Xue Wang. Moderntcn: A modern pure convolution structure for general time series analysis. InThe twelfth international conference on learning representations, pages 1–43, 2024

2024

[37] [37]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InInternational Conference on Machine Learning, pages 26670–26698. PMLR, 2023

2023

[38] [38]

Foundation models for time series: A survey.arXiv preprint arXiv:2504.04011, 2025

Siva Rama Krishna Kottapalli, Karthik Hubli, Sandeep Chandrashekhara, Garima Jain, Sunayana Hubli, Gayathri Botla, and Ramesh Doddaiah. Foundation models for time series: A survey.arXiv preprint arXiv:2504.04011, 2025

arXiv 2025

[39] [39]

In-context fine-tuning for time-series foundation models

Matthew Faw, Rajat Sen, Yichen Zhou, and Abhimanyu Das. In-context fine-tuning for time-series foundation models. InForty-second International Conference on Machine Learning, 2025

2025

[40] [40]

Zero-shot time series forecasting with covariates via in-context learning.arXiv preprint arXiv:2506.03128, 2025

Andreas Auer, Raghul Parthipan, Pedro Mercado, Abdul Fatir Ansari, Lorenzo Stella, Bernie Wang, Michael Bohlke-Schneider, and Syama Sundar Rangapuram. Zero-shot time series forecasting with covariates via in-context learning.arXiv preprint arXiv:2506.03128, 2025

arXiv 2025

[41] [41]

Context is key: A benchmark for forecasting with essential textual information

Andrew Robert Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, and Alexandre Drouin. Context is key: A benchmark for forecasting with essential textual information. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Be...

2025

[42] [42]

In-context time series predictor

Jiecheng Lu, Yan Sun, and Shihao Yang. In-context time series predictor. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=dCcY2pyNIO

2025

[43] [43]

A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting

Slawek Smyl. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International journal of forecasting, 36(1):75–85, 2020

2020

[44] [44]

Taming local effects in graph-based spatiotempo- ral forecasting.Advances in Neural Information Processing Systems, 36:55375–55393, 2023

Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi. Taming local effects in graph-based spatiotempo- ral forecasting.Advances in Neural Information Processing Systems, 36:55375–55393, 2023

2023

[45] [45]

On the regularization of learnable embeddings for time series forecasting.Transactions on Machine Learning Research, 2025

Luca Butera, Giovanni De Felice, Andrea Cini, and Cesare Alippi. On the regularization of learnable embeddings for time series forecasting.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=F5ALCh3GWG

2025

[46] [46]

Revisiting long-term time series forecasting: An investigation on linear mapping.arXiv preprint arXiv:2305.10721, 2023

Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An investigation on linear mapping.arXiv preprint arXiv:2305.10721, 2023. 11 Why Do Time Series Models Need Long Context Windows?PREPRINT

Pith/arXiv arXiv 2023

[47] [47]

Graph-based virtual sensing from sparse and partial multivariate observations

Giovanni De Felice, Andrea Cini, Daniele Zambon, Vladimir V Gusev, and Cesare Alippi. Graph-based virtual sensing from sparse and partial multivariate observations. InInternational Conference on Learning Representations, 2024

2024

[48] [48]

Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024

Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024

arXiv 2024

[49] [49]

Timer: generative pre-trained transformers are large time series models

Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: generative pre-trained transformers are large time series models. InProceedings of the 41st International Conference on Machine Learning, pages 32369–32399, 2024

2024

[50] [50]

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Pith/arXiv arXiv 2014

[51] [51]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[52] [52]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=r1ecqn4YwB

2020

[53] [53]

One fits all: Power general time series analysis by pretrained LM

Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all: Power general time series analysis by pretrained LM. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=gMS6FVZvmF

2023

[54] [54]

Lag-llama: Towards foundation models for time series forecasting

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhag- watkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, et al. Lag-llama: Towards foundation models for time series forecasting. InR0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023

2023

[55] [55]

Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

2024

[56] [56]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019

[57] [57]

Description based text classification with reinforcement learning

Duo Chai, Wei Wu, Qinghong Han, Fei Wu, and Jiwei Li. Description based text classification with reinforcement learning. InInternational conference on machine learning, pages 1371–1382. PMLR, 2020

2020

[58] [58]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022

2022

[59] [59]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[60] [60]

Meta-learning via language model in-context tuning

Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 719–730, 2022

2022

[61] [61]

Metaicl: Learning to learn in context

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, 2022

2022

[62] [62]

In-context pretraining: Language modeling beyond document boundaries

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A Smith, Luke Zettlemoyer, Wen-tau Yih, and Mike Lewis. In-context pretraining: Language modeling beyond document boundaries. InThe Twelfth International Conference on Learning Representations, 2024

2024

[63] [63]

Fforma: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92, 2020

Pablo Montero-Manso, George Athanasopoulos, Rob J Hyndman, and Thiyanga S Talagala. Fforma: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92, 2020

2020

[64] [64]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992

1992

[65] [65]

Hypernetworks

David Ha, Andrew M Dai, and Quoc V Le. Hypernetworks. InInternational Conference on Learning Representa- tions, 2017. 12 Why Do Time Series Models Need Long Context Windows?PREPRINT

2017

[66] [66]

Meta-learning framework with applications to zero-shot time-series forecasting

Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. Meta-learning framework with applications to zero-shot time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 9242–9250, 2021

2021

[67] [67]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–1135. PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/...

2017

[68] [68]

Meta-learning how to forecast time series

Thiyanga S Talagala, Rob J Hyndman, and George Athanasopoulos. Meta-learning how to forecast time series. Journal of Forecasting, 42(6):1476–1501, 2023

2023

[69] [69]

identity

Declan A Norton, Edward Ott, Andrew Pomerance, Brian Hunt, and Michelle Girvan. Tailored forecasting from short time series via meta-learning.arXiv preprint arXiv:2501.16325, 2025. A Terminology We distinguish here several related terms. • Generative process identificationrefers to inferring which latent stochastic process generated an observed sequence, ...

arXiv 2025

[70] [70]

TheMoiraifamily [ 15, 28] further extended the line of probabilistic foundation models with encoder-only and decoder-only architectures

introduced a unified backbone capable of addressing multiple time series tasks and efficiently adapting to new series via lightweight fine-tuning. TheMoiraifamily [ 15, 28] further extended the line of probabilistic foundation models with encoder-only and decoder-only architectures. Several decoder-based models followed [14, 48, 29], targeting point predi...

arXiv