Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting
Pith reviewed 2026-05-12 01:20 UTC · model grok-4.3
The pith
In stochastic time series forecasting, longer input contexts increase prediction error while selective retrieval from shorter windows reduces it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that time series foundation models exhibit an inverse scaling law where forecasting error increases with longer context lengths on the ETTh1 dataset, contradicting the premise borrowed from NLP that more history enhances forecast quality in stochastic domains. RAFT counters this by employing selective retrieval to inject only the most relevant historical segments as dynamic exogenous variables, yielding superior performance with less computation.
What carries the argument
Retrieval-Augmented Forecasting (RAFT), which selects relevant historical segments from past data and feeds them as dynamic exogenous variables to a base model with a fixed context window.
If this is right
- Models should prioritize selective retrieval over extending context length to handle noise in historical data.
- Attention-based architectures are ill-suited for filtering irrelevant volatility in long time series.
- Zero-shot foundation models can be surpassed by retrieval methods despite using fewer resources.
- Incorporating retrieved segments provides an inductive bias that raw long sequences lack.
Where Pith is reading between the lines
- This inverse scaling may appear in other noisy sequential data tasks like natural language with irrelevant passages.
- Hybrid models combining retrieval with long-context could further improve results.
- The approach might reduce computational costs in deploying forecasting systems.
- Generalization to multivariate or other benchmarks like ETTh2 would strengthen the case.
Load-bearing premise
That the observed performance drop with longer contexts is caused primarily by attention's difficulty in disregarding irrelevant historical noise rather than other factors like model capacity or data characteristics.
What would settle it
Running the same long-context models on additional time series datasets and checking if error consistently increases with context length beyond 720 steps, or testing RAFT against long-context at matched computational budgets.
Figures
read the original abstract
Time Series Foundation Models (TSFMs) have borrowed the long context paradigm from natural language processing under the premise that feeding more history into the model improves forecast quality. But in stochastic domains, distant history is often just high-frequency noise, not signal. Hence, the proposed work tests whether this premise actually holds by running continuous context architectures (PatchTST included) through the ETTh1 benchmark. The obtained results contradict the premise: an inverse scaling law shows up clearly, with forecasting error rising as context gets longer. A 3,000-step window causes performance to drop by over 68%, evidence that attention mechanisms are poor at ignoring irrelevant historical volatility. Retrieval-Augmented Forecasting (RAFT) is evaluated as an alternative. RAFT achieves a mean squared error (MSE) of 0.379 with a fixed 720-step window and selective retrieval, outperforming both long-context configurations and zero-shot foundation models (Chronos, Moirai) despite requiring far less computation. In addition, the retrieval step injects only the most relevant historical segments as dynamic exogenous variables, which gives the model a context-informed inductive bias it cannot build on its own from raw sequences. Therefore, foundation models going forward need to shift architecturally toward selective retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that the long-context paradigm from NLP does not transfer to time series forecasting in stochastic domains, as distant history often constitutes noise rather than signal. Experiments on the ETTh1 benchmark with continuous-context models (including PatchTST) demonstrate an inverse scaling law, with forecasting error rising as context length increases (a 68% performance drop at a 3,000-step window). The authors attribute this to attention mechanisms' inability to ignore irrelevant historical volatility. They propose Retrieval-Augmented Forecasting (RAFT), which uses selective retrieval over a fixed 720-step window to inject relevant historical segments as dynamic exogenous variables, achieving an MSE of 0.379 and outperforming both long-context configurations and zero-shot foundation models such as Chronos and Moirai while requiring less computation.
Significance. If the empirical results hold after addressing controls, the work would provide a concrete challenge to the default long-context scaling assumption in time series foundation models and offer retrieval as a computationally lighter alternative that supplies an inductive bias unavailable from raw sequences. The demonstration of inverse scaling on ETTh1 and the reported RAFT performance numbers constitute falsifiable, benchmark-grounded evidence that could influence architectural choices in the field.
major comments (2)
- [Experiments (ETTh1 benchmark)] Experiments section (ETTh1 benchmark with PatchTST and similar models): only attention-based continuous-context architectures are evaluated on long windows. No non-attention baselines (e.g., DLinear, linear models, or MLPs) are reported on the identical long-context inputs. If those models also exhibit performance degradation with increasing context length, the degradation would be a property of the stochastic signal rather than attention, weakening the mechanistic claim that attention is 'poor at ignoring irrelevant historical volatility' and the architectural recommendation for retrieval.
- [Abstract and Results] Abstract and results reporting: concrete performance numbers are given (68% drop at 3,000 steps, RAFT MSE of 0.379) but without accompanying details on training procedures, hyperparameter selection, number of runs, error bars, or ablation studies on the retrieval mechanism itself. These omissions make it difficult to assess whether the inverse scaling and RAFT gains are robust or sensitive to implementation choices.
minor comments (2)
- [Method] The paper introduces RAFT as a new method but provides limited description of how the retrieval step is implemented (e.g., similarity metric, database construction, or integration as exogenous variables). Adding a dedicated methods subsection with pseudocode or a diagram would improve clarity.
- Notation for context lengths (e.g., 720-step vs. 3,000-step windows) and performance metrics should be consistently defined early in the paper to avoid ambiguity when comparing configurations.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments. These have highlighted important areas for strengthening the experimental controls and reporting in our manuscript. We address each major comment below and commit to revisions that improve the robustness and clarity of the work without altering its core claims.
read point-by-point responses
-
Referee: Experiments section (ETTh1 benchmark with PatchTST and similar models): only attention-based continuous-context architectures are evaluated on long windows. No non-attention baselines (e.g., DLinear, linear models, or MLPs) are reported on the identical long-context inputs. If those models also exhibit performance degradation with increasing context length, the degradation would be a property of the stochastic signal rather than attention, weakening the mechanistic claim that attention is 'poor at ignoring irrelevant historical volatility' and the architectural recommendation for retrieval.
Authors: We acknowledge the value of this control experiment. Our evaluation deliberately targeted attention-based architectures because these form the backbone of the long-context scaling paradigm in current time series foundation models (e.g., PatchTST and similar transformer variants). The inverse scaling we observe therefore directly challenges the assumptions underlying those models. That said, we agree that non-attention baselines would help isolate whether the degradation is architecture-specific or inherent to the stochastic properties of ETTh1. In the revised manuscript we will add results for DLinear and MLP models trained on the identical long-context inputs (up to 3000 steps) to provide this comparison and refine the mechanistic interpretation. revision: yes
-
Referee: Abstract and results reporting: concrete performance numbers are given (68% drop at 3,000 steps, RAFT MSE of 0.379) but without accompanying details on training procedures, hyperparameter selection, number of runs, error bars, or ablation studies on the retrieval mechanism itself. These omissions make it difficult to assess whether the inverse scaling and RAFT gains are robust or sensitive to implementation choices.
Authors: We agree that additional methodological details are essential for reproducibility and for allowing readers to judge robustness. The current manuscript provides the headline numbers but omits the supporting experimental protocol. In the revision we will add a dedicated subsection detailing: (i) the full training procedure and optimizer settings, (ii) the hyperparameter search strategy and ranges, (iii) the number of independent runs (five random seeds), (iv) error bars as standard deviations across runs, and (v) ablation studies on the retrieval mechanism (varying retrieval window size, similarity metric, and number of retrieved segments). These additions will directly address concerns about sensitivity to implementation choices. revision: yes
Circularity Check
No circularity: purely empirical benchmark results
full rationale
The paper reports direct experimental measurements on the ETTh1 benchmark: forecasting error increases with context length for PatchTST and similar continuous-context models, and RAFT with fixed 720-step retrieval achieves lower MSE than long-context or zero-shot baselines. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing premises, or ansatzes are present in the provided text. All claims reduce to observable benchmark outcomes rather than any internal definitional or self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Distant history in stochastic time series domains consists primarily of high-frequency noise rather than useful signal.
invented entities (1)
-
Retrieval-Augmented Forecasting (RAFT)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Forty-first International Conference on Machine Learning , year =
Unified Training of Universal Time Series Forecasting Transformers , author =. Forty-first International Conference on Machine Learning , year =
-
[2]
Transactions on Machine Learning Research , year =
Chronos: Learning the Language of Time Series , author =. Transactions on Machine Learning Research , year =
-
[3]
Proceedings of the 42nd International Conference on Machine Learning , pages =
Context is Key: A Benchmark for Forecasting with Essential Textual Information , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =
work page 2025
-
[4]
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =
ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , volume =
work page 2025
-
[5]
International Conference on Learning Representations , year =
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations , year =
-
[6]
Proceedings of the 42nd International Conference on Machine Learning , pages =
Retrieval Augmented Time Series Forecasting , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =
work page 2025
-
[7]
AAAI Conference on Artificial Intelligence , volume =
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , author =. AAAI Conference on Artificial Intelligence , volume =
-
[8]
Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks , author =. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.