Retrieval Augmented Time Series Forecasting
Pith reviewed 2026-05-23 16:55 UTC · model grok-4.3
The pith
Retrieving similar past time series and feeding them into foundation models raises zero-shot forecasting accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Retrieval Augmented Forecasting (RAF) is a framework that retrieves related time-series examples and incorporates them into the input of time-series foundation models; this procedure improves forecasting accuracy across diverse domains, and the gains become larger for bigger TSFM sizes.
What carries the argument
Retrieval Augmented Forecasting (RAF) framework, which selects related time-series examples and augments the model input with them.
If this is right
- RAF delivers measurable accuracy lifts on many different time-series domains.
- The accuracy improvement scales up with the size of the underlying time-series foundation model.
- The approach directly targets the dynamic and event-driven character of time-series data.
- It provides a route to stronger zero-shot forecasting without model retraining.
Where Pith is reading between the lines
- The same retrieval step could be run online so that the database grows with newly observed series.
- RAF might mitigate concept drift by preferentially retrieving recent matching examples.
- Smaller foundation models augmented by RAF could reach performance levels that currently require much larger models.
Load-bearing premise
The retrieved time-series examples are relevant and non-noisy enough that adding them raises accuracy instead of introducing harmful context or distribution shift.
What would settle it
A controlled test in which deliberately irrelevant or noisy retrieved series are supplied and forecast error rises above the no-retrieval baseline.
Figures
read the original abstract
Retrieval-augmented generation (RAG) is a central component of modern LLM systems, particularly in scenarios where up-to-date information is crucial for accurately responding to user queries or when queries exceed the scope of the training data. The advent of time-series foundation models (TSFM), such as Chronos, and the need for effective zero-shot forecasting performance across various time-series domains motivates the question: Do benefits of RAG similarly carry over to time series forecasting? In this paper, we advocate that the dynamic and event-driven nature of time-series data makes RAG a crucial component of TSFMs and introduce a principled RAG framework for time-series forecasting, called Retrieval Augmented Forecasting (RAF). Within RAF, we develop efficient strategies for retrieving related time-series examples and incorporating them into forecast. Through experiments and mechanistic studies, we demonstrate that RAF indeed improves the forecasting accuracy across diverse time series domains and the improvement is more significant for larger TSFM sizes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Retrieval Augmented Forecasting (RAF), a RAG framework for time-series foundation models (TSFMs) such as Chronos. It develops retrieval strategies for related time-series examples and their incorporation into zero-shot forecasts, claiming via experiments and mechanistic studies that RAF improves accuracy across diverse domains with larger gains for bigger TSFM sizes.
Significance. If the results hold with proper verification of retrieval quality, the work would be significant for extending RAG benefits to non-stationary time-series forecasting and highlighting scale-dependent advantages in TSFMs.
major comments (3)
- [Abstract] Abstract: the central claim that RAF improves accuracy (and more for larger TSFMs) rests on unverified retrieval quality, yet the abstract supplies no information on baselines, datasets, statistical significance, or controls for retrieval failure modes such as distribution shift from non-stationary mismatched series.
- [Experiments] Experiments section: without explicit ablations or tests injecting noisy/irrelevant retrieved examples (e.g., via perturbed similarity metrics), gains cannot be attributed to RAF rather than input length or prompting artifacts, undermining the attribution to relevant context.
- [Mechanistic studies] Mechanistic studies: these must demonstrate that larger models better exploit retrieved patterns without overfitting noise; absent such controls, the scale-dependent improvement claim lacks support given time-series event-driven variability.
minor comments (2)
- Clarify the exact similarity metric and incorporation method (e.g., concatenation vs. attention) in the RAF framework description.
- Add missing references to prior RAG work in LLMs and existing TSFM baselines for context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the presentation of experimental details, attribution of gains, and mechanistic analysis. We address each major comment below and commit to revisions that incorporate additional controls and clarifications without misrepresenting our existing results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that RAF improves accuracy (and more for larger TSFMs) rests on unverified retrieval quality, yet the abstract supplies no information on baselines, datasets, statistical significance, or controls for retrieval failure modes such as distribution shift from non-stationary mismatched series.
Authors: We agree the abstract is concise and would benefit from additional context. In the revision, we will expand it to briefly note the datasets (multi-domain TSFM benchmarks), baselines (zero-shot TSFM forecasts), statistical significance of improvements, and mention of retrieval quality controls (e.g., similarity thresholds and failure mode checks) already present in the main text and appendix. This will better support the central claim without altering its substance. revision: yes
-
Referee: [Experiments] Experiments section: without explicit ablations or tests injecting noisy/irrelevant retrieved examples (e.g., via perturbed similarity metrics), gains cannot be attributed to RAF rather than input length or prompting artifacts, undermining the attribution to relevant context.
Authors: This is a valid point; our current experiments include relevant vs. zero-shot comparisons but lack explicit noise-injection ablations. We will add these in the revised experiments section, including tests with perturbed similarity metrics and random/irrelevant retrieval to show performance degradation and confirm attribution to relevant context rather than length or prompting effects. revision: yes
-
Referee: [Mechanistic studies] Mechanistic studies: these must demonstrate that larger models better exploit retrieved patterns without overfitting noise; absent such controls, the scale-dependent improvement claim lacks support given time-series event-driven variability.
Authors: We acknowledge the need for stronger controls here. The existing mechanistic analysis shows scaling trends and attention patterns, but to directly test exploitation without noise overfitting, we will augment the section with comparisons of relevant vs. irrelevant retrieval across model sizes and analysis of how larger models discriminate patterns amid event-driven variability. revision: yes
Circularity Check
No circularity: empirical framework validated by experiments, no derivations or self-referential fits.
full rationale
The paper proposes Retrieval Augmented Forecasting (RAF) as a practical framework for incorporating retrieved time-series examples into TSFM inference. All central claims of accuracy improvement are presented as outcomes of experiments and mechanistic studies across domains, with no equations, parameter fits, or derivations that reduce the reported gains to quantities defined by the same inputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the work contains no mathematical derivation chain at all. The reader's assessment of score 2.0 is consistent with an honest non-finding for an empirical contribution whose soundness depends on experimental controls rather than definitional equivalence.
Axiom & Free-Parameter Ledger
invented entities (1)
-
RAF framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Learning the language ...
work page 2024
-
[2]
George Athanasopoulos, Rob Hyndman, Haiyan Song, and Doris C. Wu. The tourism forecasting competition. International Journal of Forecasting, 27(3):822–844, 2011
work page 2011
-
[3]
Timothy L Bailey, Mikael Boden, Fabian A Buske, Martin Frith, Charles E Grant, Luca Clementi, Jingyuan Ren, Wilfred W Li, and William S Noble. Meme suite: tools for motif discovery and searching.Nucleic acids research, 37(suppl_2):W202–W208, 2009
work page 2009
-
[4]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...
work page 2022
-
[5]
Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O. Arik, and Tomas Pfister. Tsmixer: An all-mlp architecture for time series forecasting, 2023
work page 2023
-
[6]
Forecastpfn: Synthetically-trained zero-shot forecasting
Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha V Naidu, and Colin White. Forecastpfn: Synthetically-trained zero-shot forecasting. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 2403–2426. Curran Associates, Inc., 2023
work page 2023
-
[7]
Adarnn: Adaptive learning and forecasting of time series, 2021
Yuntao Du, Jindong Wang, Wenjie Feng, Sinno Pan, Tao Qin, Renjun Xu, and Chongjun Wang. Adarnn: Adaptive learning and forecasting of time series, 2021
work page 2021
-
[8]
Angela Fan, Claire Gardent, Chloé Braud, and Antoine Bordes. Augmenting transformers with knn-based composite memory for dialog.Transactions of the Association for Computational Linguistics, 9:82–99, 2021
work page 2021
-
[9]
Azul Garza, Cristian Challu, and Max Mergenthaler-Canseco. Timegpt-1, 2024
work page 2024
-
[10]
Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. Monash time series forecasting archive, 2021
work page 2021
-
[11]
Mamba: Linear-time sequence modeling with selective state spaces, 2024
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024
work page 2024
-
[12]
Realm: Retrieval-augmented language model pre-training, 2020
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training, 2020
work page 2020
-
[13]
Rob J. Hyndman and Anne B. Koehler. Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4):679–688, 2006
work page 2006
-
[14]
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering.arXiv preprint arXiv:2007.01282, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2007
-
[15]
Domain adaptation for time series forecasting via attention sharing
Xiaoyong Jin, Youngsuk Park, Danielle Maddix, Hao Wang, and Yuyang Wang. Domain adaptation for time series forecasting via attention sharing. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learn...
work page 2022
-
[16]
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020
work page 2020
-
[17]
Baleen: Robust multi-hop reasoning at scale via condensed retrieval, 2022
Omar Khattab, Christopher Potts, and Matei Zaharia. Baleen: Robust multi-hop reasoning at scale via condensed retrieval, 2022
work page 2022
-
[18]
Colbert: Efficient and effective passage search via con- textualized late interaction over bert
Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via con- textualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020. 12
work page 2020
-
[19]
Reversible instance normalization for accurate time-series forecasting against distribution shift
Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations, 2022
work page 2022
-
[20]
Lemi Leblond et al. Alphacode 2 technical report. Technical report, DeepMind, 2023
work page 2023
-
[21]
Latent retrieval for weakly supervised open domain question answering, 2019
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering, 2019
work page 2019
-
[22]
Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021
work page 2021
-
[23]
A survey on retrieval-augmented text generation
Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110, 2022
-
[24]
Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran As...
work page 2019
-
[25]
Foundation models for time series analysis: A tutorial and survey
Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, volume 619 ofKDD ’24, page 6555–6565. ACM, August 2024
work page 2024
-
[26]
Arik, Nicolas Loeff, and Tomas Pfister
Bryan Lim, Sercan O. Arik, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International Journal of Forecasting, 37(4):1748–1764, 2021
work page 2021
-
[27]
Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting
Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In International Conference on Learning Representations, 2022
work page 2022
-
[28]
itransformer: Inverted transformers are effective for time series forecasting, 2024
Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting, 2024
work page 2024
-
[29]
Query rewrit- ing for retrieval-augmented large language models,
Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models.arXiv preprint arXiv:2305.14283, 2023
-
[30]
Accuracy of forecasting: An empirical investigation
Spyros Makridakis, Michèle Hibon, and Claus Moser. Accuracy of forecasting: An empirical investigation. Journal of the Royal Statistical Society. Series A (General), 142(2):97–145, 1979
work page 1979
-
[31]
Dynamic time warping.Information retrieval for music and motion, pages 69–84, 2007
Meinard Müller. Dynamic time warping.Information retrieval for music and motion, pages 69–84, 2007
work page 2007
- [32]
-
[33]
Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam
Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations, 2023. 13
work page 2023
-
[34]
A time series is worth 64 words: Long-term forecasting with transformers
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[35]
Can generalist foundation models outcompete special-purpose tuning? case study in medicine, 2023
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. Can generalist foundation models outcompete special-purpose tuning? case study in medicine, 2023
work page 2023
-
[36]
In-context learning and induction heads, 2022
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page 2022
-
[37]
Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio
Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. Meta-learning framework with applications to zero-shot time-series forecasting, 2020
work page 2020
-
[38]
Bernardo Pérez Orozco and Stephen J Roberts. Zero-shot and few-shot time series forecasting with ordinal regression recurrent neural networks, 2020
work page 2020
-
[39]
Anjos, Sebastian Lautz, and Aleksandar Kolev
Egon Persak, Miguel F. Anjos, Sebastian Lautz, and Aleksandar Kolev. Multiple-resolution tokenization for time series forecasting with an application to pricing, 2024
work page 2024
-
[40]
Language models are unsupervised multitask learners, 2019
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019
work page 2019
-
[41]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023
work page 2023
-
[42]
Lag-llama: Towards foundation models for probabilistic time series forecasting, 2024
Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Hena Ghonia, Rishika Bhagwatkar, Arian Khorasani, Mohammad Javad Darvishi Bayazi, George Adamopoulos, Roland Riachi, Nadhir Hassen, Marin Biloš, Sahil Garg, Anderson Schneider, Nicolas Chapados, Alexandre Drouin, Valentina Zantedeschi, Yuriy Nevmyvaka, and Irina Rish. Lag-llama: Towards foundation models ...
work page 2024
-
[43]
Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting, 2021
work page 2021
-
[44]
Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana. Agentic retrieval- augmented generation for time series analysis.arXiv preprint arXiv:2408.14484, 2024
-
[45]
David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilis- tic forecasting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181–1191, 2020
work page 2020
-
[46]
Retrieval-augmented mining of temporal logic specifications from data
Gaia Saveri and Luca Bortolussi. Retrieval-augmented mining of temporal logic specifications from data. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 315–331. Springer, 2024. 14
work page 2024
-
[47]
Roformer: Enhanced transformer with rotary position embedding, 2023
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023
work page 2023
-
[48]
Timepfn: Effective multivariate time series forecasting with synthetic data
Ege Onur Taga, Muhammed Emrullah Ildiz, and Samet Oymak. Timepfn: Effective multivariate time series forecasting with synthetic data. InNeurIPS Workshop on Time Series in the Age of Large Models, 2024
work page 2024
-
[49]
Totem: Tokenized time series embeddings for general time series analysis, 2024
Sabera Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series embeddings for general time series analysis, 2024
work page 2024
-
[50]
Instance normalization: The missing ingredient for fast stylization, 2017
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization, 2017
work page 2017
-
[51]
Tianfeng Wang and Gaojie Cui. Ratsf: Empowering customer service volume management through retrieval-augmented time-series forecasting.arXiv preprint arXiv:2403.04180, 2024
-
[52]
Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski. Deep factors for forecasting. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6607–6617. PMLR, 09–15 Jun 2019
work page 2019
-
[53]
Unified training of universal time series forecasting transformers, 2024
Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers, 2024
work page 2024
-
[54]
Autoformer: Decomposition transformers with Auto-Correlation for long-term series forecasting
Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with Auto-Correlation for long-term series forecasting. InAdvances in Neural Information Processing Systems, 2021
work page 2021
-
[55]
Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and Eamonn Keogh. Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In 2016 IEEE 16th international conference on data mining (ICDM), pages 1317–1322. Ieee, 2016
work page 2016
-
[56]
Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InInternational Conference on Learning Representations, 2023
work page 2023
-
[57]
Informer: Beyond efficient transformer for long sequence time-series forecasting
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, volume 35, pages 11106–11115. AAAI Press, 2021
work page 2021
-
[58]
FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. InProc. 39th International Conference on Machine Learning (ICML 2022), 2022. 15 A Theoretical Results A.1 TS-R Problem In Section 2, we asserted that a two layer transformer architecture can solve th...
work page 2022
-
[59]
Set W2 = c.Φ⊥RΦ⊥ As c → ∞, we haveg(f2(f1(xL−C+1))) → Υ. 16 Proof 1 Realize that withW1, the positional encodings will be ignored and only token embeddings will remain. Moreover, N XtrΦ xL−C+1 ∥xL−C+1∥ℓ2 will return a vector v of size L − C with the largest element at j, for index j, corresponding to the matching retrieval motif. Asc → ∞, the softmax will...
work page 2016
-
[60]
The data was sourced from the Johns Hopkins repository. NN5 dataset ([10]) consists of 111 daily time series of cash withdrawals from Automated Teller Machines (ATMs) in the UK, and was utilized in the NN5 forecasting competition. E.2 Benchmark II Datasets Tourism dataset ([10, 2]), derived from a Kaggle competition, includes 366 monthly and 427 quarterly...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.