Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting

Kumar Prateek; Rishi Ahuja; Simranjit Singh; Vijay Kumar

arxiv: 2605.08217 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.IR

Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting

Rishi Ahuja , Kumar Prateek , Simranjit Singh , Vijay Kumar This is my paper

Pith reviewed 2026-05-12 01:20 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords time series forecastingretrieval augmented forecastinginverse scaling lawlong context modelsfoundation modelsETTh1 benchmarkattention mechanisms

0 comments

The pith

In stochastic time series forecasting, longer input contexts increase prediction error while selective retrieval from shorter windows reduces it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the assumption that more historical data always improves time series forecasts by testing long-context models on the ETTh1 benchmark. It finds that forecasting error rises as context length grows, with a 3000-step window causing over 68% worse performance. This occurs because attention mechanisms fail to ignore irrelevant noise in distant history. Instead, Retrieval-Augmented Forecasting (RAFT) uses a fixed 720-step window and selectively retrieves relevant past segments, achieving an MSE of 0.379 that beats both long-context setups and zero-shot models like Chronos and Moirai. The retrieval provides an inductive bias by treating relevant history as dynamic exogenous inputs, which raw sequences cannot supply.

Core claim

The central discovery is that time series foundation models exhibit an inverse scaling law where forecasting error increases with longer context lengths on the ETTh1 dataset, contradicting the premise borrowed from NLP that more history enhances forecast quality in stochastic domains. RAFT counters this by employing selective retrieval to inject only the most relevant historical segments as dynamic exogenous variables, yielding superior performance with less computation.

What carries the argument

Retrieval-Augmented Forecasting (RAFT), which selects relevant historical segments from past data and feeds them as dynamic exogenous variables to a base model with a fixed context window.

If this is right

Models should prioritize selective retrieval over extending context length to handle noise in historical data.
Attention-based architectures are ill-suited for filtering irrelevant volatility in long time series.
Zero-shot foundation models can be surpassed by retrieval methods despite using fewer resources.
Incorporating retrieved segments provides an inductive bias that raw long sequences lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This inverse scaling may appear in other noisy sequential data tasks like natural language with irrelevant passages.
Hybrid models combining retrieval with long-context could further improve results.
The approach might reduce computational costs in deploying forecasting systems.
Generalization to multivariate or other benchmarks like ETTh2 would strengthen the case.

Load-bearing premise

That the observed performance drop with longer contexts is caused primarily by attention's difficulty in disregarding irrelevant historical noise rather than other factors like model capacity or data characteristics.

What would settle it

Running the same long-context models on additional time series datasets and checking if error consistently increases with context length beyond 720 steps, or testing RAFT against long-context at matched computational budgets.

Figures

Figures reproduced from arXiv: 2605.08217 by Kumar Prateek, Rishi Ahuja, Simranjit Singh, Vijay Kumar.

read the original abstract

Time Series Foundation Models (TSFMs) have borrowed the long context paradigm from natural language processing under the premise that feeding more history into the model improves forecast quality. But in stochastic domains, distant history is often just high-frequency noise, not signal. Hence, the proposed work tests whether this premise actually holds by running continuous context architectures (PatchTST included) through the ETTh1 benchmark. The obtained results contradict the premise: an inverse scaling law shows up clearly, with forecasting error rising as context gets longer. A 3,000-step window causes performance to drop by over 68%, evidence that attention mechanisms are poor at ignoring irrelevant historical volatility. Retrieval-Augmented Forecasting (RAFT) is evaluated as an alternative. RAFT achieves a mean squared error (MSE) of 0.379 with a fixed 720-step window and selective retrieval, outperforming both long-context configurations and zero-shot foundation models (Chronos, Moirai) despite requiring far less computation. In addition, the retrieval step injects only the most relevant historical segments as dynamic exogenous variables, which gives the model a context-informed inductive bias it cannot build on its own from raw sequences. Therefore, foundation models going forward need to shift architecturally toward selective retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows clear inverse scaling with context length on ETTh1 for attention-based models and RAFT beats them, but needs non-attention baselines to pin down why.

read the letter

The main thing to know is that this work finds an inverse scaling law in time series forecasting on the ETTh1 benchmark: performance gets worse as the input context grows longer, with a 68% drop at 3000 steps for models like PatchTST. Their Retrieval-Augmented Forecasting (RAFT) approach, which pulls relevant past segments instead of using everything, achieves an MSE of 0.379 using only a 720-step window and beats both the long-context versions and some zero-shot foundation models like Chronos and Moirai. What stands out is the empirical demonstration that more history isn't always better in noisy stochastic series, and the idea of using retrieval to inject an inductive bias. The numbers are concrete, and it highlights a potential efficiency win since RAFT uses less computation. The soft spot is in the interpretation. The authors link the degradation to attention mechanisms struggling with irrelevant volatility, but the experiments stick to attention-based continuous-context models. Without running the same long windows through non-attention forecasters like DLinear or simple linear models, it's possible the drop comes from the signal itself rather than the architecture. That would change the takeaway from 'switch to retrieval' to 'be careful with long noisy contexts in general.' The full paper should clarify the baselines and ablations to make this stick. This paper is for people building or evaluating time series foundation models who are thinking about context length and retrieval. A reader interested in scaling behaviors or alternatives to pure long-context transformers would get value from the benchmark results. It deserves a serious referee because the core observation is falsifiable and challenges a common assumption, even if the mechanistic story needs more controls. I'd recommend sending it to peer review with requests for those additional baselines.

Referee Report

2 major / 2 minor

Summary. The paper argues that the long-context paradigm from NLP does not transfer to time series forecasting in stochastic domains, as distant history often constitutes noise rather than signal. Experiments on the ETTh1 benchmark with continuous-context models (including PatchTST) demonstrate an inverse scaling law, with forecasting error rising as context length increases (a 68% performance drop at a 3,000-step window). The authors attribute this to attention mechanisms' inability to ignore irrelevant historical volatility. They propose Retrieval-Augmented Forecasting (RAFT), which uses selective retrieval over a fixed 720-step window to inject relevant historical segments as dynamic exogenous variables, achieving an MSE of 0.379 and outperforming both long-context configurations and zero-shot foundation models such as Chronos and Moirai while requiring less computation.

Significance. If the empirical results hold after addressing controls, the work would provide a concrete challenge to the default long-context scaling assumption in time series foundation models and offer retrieval as a computationally lighter alternative that supplies an inductive bias unavailable from raw sequences. The demonstration of inverse scaling on ETTh1 and the reported RAFT performance numbers constitute falsifiable, benchmark-grounded evidence that could influence architectural choices in the field.

major comments (2)

[Experiments (ETTh1 benchmark)] Experiments section (ETTh1 benchmark with PatchTST and similar models): only attention-based continuous-context architectures are evaluated on long windows. No non-attention baselines (e.g., DLinear, linear models, or MLPs) are reported on the identical long-context inputs. If those models also exhibit performance degradation with increasing context length, the degradation would be a property of the stochastic signal rather than attention, weakening the mechanistic claim that attention is 'poor at ignoring irrelevant historical volatility' and the architectural recommendation for retrieval.
[Abstract and Results] Abstract and results reporting: concrete performance numbers are given (68% drop at 3,000 steps, RAFT MSE of 0.379) but without accompanying details on training procedures, hyperparameter selection, number of runs, error bars, or ablation studies on the retrieval mechanism itself. These omissions make it difficult to assess whether the inverse scaling and RAFT gains are robust or sensitive to implementation choices.

minor comments (2)

[Method] The paper introduces RAFT as a new method but provides limited description of how the retrieval step is implemented (e.g., similarity metric, database construction, or integration as exogenous variables). Adding a dedicated methods subsection with pseudocode or a diagram would improve clarity.
Notation for context lengths (e.g., 720-step vs. 3,000-step windows) and performance metrics should be consistently defined early in the paper to avoid ambiguity when comparing configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. These have highlighted important areas for strengthening the experimental controls and reporting in our manuscript. We address each major comment below and commit to revisions that improve the robustness and clarity of the work without altering its core claims.

read point-by-point responses

Referee: Experiments section (ETTh1 benchmark with PatchTST and similar models): only attention-based continuous-context architectures are evaluated on long windows. No non-attention baselines (e.g., DLinear, linear models, or MLPs) are reported on the identical long-context inputs. If those models also exhibit performance degradation with increasing context length, the degradation would be a property of the stochastic signal rather than attention, weakening the mechanistic claim that attention is 'poor at ignoring irrelevant historical volatility' and the architectural recommendation for retrieval.

Authors: We acknowledge the value of this control experiment. Our evaluation deliberately targeted attention-based architectures because these form the backbone of the long-context scaling paradigm in current time series foundation models (e.g., PatchTST and similar transformer variants). The inverse scaling we observe therefore directly challenges the assumptions underlying those models. That said, we agree that non-attention baselines would help isolate whether the degradation is architecture-specific or inherent to the stochastic properties of ETTh1. In the revised manuscript we will add results for DLinear and MLP models trained on the identical long-context inputs (up to 3000 steps) to provide this comparison and refine the mechanistic interpretation. revision: yes
Referee: Abstract and results reporting: concrete performance numbers are given (68% drop at 3,000 steps, RAFT MSE of 0.379) but without accompanying details on training procedures, hyperparameter selection, number of runs, error bars, or ablation studies on the retrieval mechanism itself. These omissions make it difficult to assess whether the inverse scaling and RAFT gains are robust or sensitive to implementation choices.

Authors: We agree that additional methodological details are essential for reproducibility and for allowing readers to judge robustness. The current manuscript provides the headline numbers but omits the supporting experimental protocol. In the revision we will add a dedicated subsection detailing: (i) the full training procedure and optimizer settings, (ii) the hyperparameter search strategy and ranges, (iii) the number of independent runs (five random seeds), (iv) error bars as standard deviations across runs, and (v) ablation studies on the retrieval mechanism (varying retrieval window size, similarity metric, and number of retrieved segments). These additions will directly address concerns about sensitivity to implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark results

full rationale

The paper reports direct experimental measurements on the ETTh1 benchmark: forecasting error increases with context length for PatchTST and similar continuous-context models, and RAFT with fixed 720-step retrieval achieves lower MSE than long-context or zero-shot baselines. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing premises, or ansatzes are present in the provided text. All claims reduce to observable benchmark outcomes rather than any internal definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the ETTh1 benchmark being representative of stochastic domains and on the interpretation that selective retrieval supplies an inductive bias unavailable from raw long sequences.

axioms (1)

domain assumption Distant history in stochastic time series domains consists primarily of high-frequency noise rather than useful signal.
This premise is used to explain why long contexts degrade performance and to motivate retrieval.

invented entities (1)

Retrieval-Augmented Forecasting (RAFT) no independent evidence
purpose: To inject only the most relevant historical segments as dynamic exogenous variables for improved inductive bias.
Introduced in the abstract as the proposed architectural alternative to long-context models.

pith-pipeline@v0.9.0 · 5526 in / 1475 out tokens · 63174 ms · 2026-05-12T01:20:13.441705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Forty-first International Conference on Machine Learning , year =

Unified Training of Universal Time Series Forecasting Transformers , author =. Forty-first International Conference on Machine Learning , year =

work page
[2]

Transactions on Machine Learning Research , year =

Chronos: Learning the Language of Time Series , author =. Transactions on Machine Learning Research , year =

work page
[3]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Context is Key: A Benchmark for Forecasting with Essential Textual Information , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =

work page 2025
[4]

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =

ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , volume =

work page 2025
[5]

International Conference on Learning Representations , year =

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations , year =

work page
[6]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Retrieval Augmented Time Series Forecasting , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =

work page 2025
[7]

AAAI Conference on Artificial Intelligence , volume =

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , author =. AAAI Conference on Artificial Intelligence , volume =

work page
[8]

The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =

Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks , author =. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =

work page

[1] [1]

Forty-first International Conference on Machine Learning , year =

Unified Training of Universal Time Series Forecasting Transformers , author =. Forty-first International Conference on Machine Learning , year =

work page

[2] [2]

Transactions on Machine Learning Research , year =

Chronos: Learning the Language of Time Series , author =. Transactions on Machine Learning Research , year =

work page

[3] [3]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Context is Key: A Benchmark for Forecasting with Essential Textual Information , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =

work page 2025

[4] [4]

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =

ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , volume =

work page 2025

[5] [5]

International Conference on Learning Representations , year =

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations , year =

work page

[6] [6]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Retrieval Augmented Time Series Forecasting , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =

work page 2025

[7] [7]

AAAI Conference on Artificial Intelligence , volume =

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , author =. AAAI Conference on Artificial Intelligence , volume =

work page

[8] [8]

The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =

Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks , author =. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =

work page