Recency Biased Causal Attention for Time-series Forecasting
Pith reviewed 2026-05-23 04:18 UTC · model grok-4.3
The pith
Reweighting attention scores with a smooth heavy-tailed decay adds recency bias to causal Transformers for time-series forecasting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that reweighting attention scores with a smooth heavy-tailed decay introduces recency bias into causal attention, strengthening local temporal dependencies for time-series data while preserving flexibility to model data-specific broader correlations, and that this leads to consistent improvements in sequential modeling and competitive or superior performance on forecasting benchmarks.
What carries the argument
Recency-biased causal attention, which reweights standard attention scores by a smooth heavy-tailed decay function to emphasize recent observations.
If this is right
- The reweighting consistently improves sequential modeling by aligning attention more closely with read-ignore-write operations of RNNs.
- Local temporal dependencies are strengthened while the model retains capacity for broader and data-specific correlations.
- The approach achieves competitive and often superior performance on challenging time-series forecasting benchmarks.
Where Pith is reading between the lines
- If the decay reliably favors recent timesteps, the method could reduce the effective context length needed for accurate forecasts on many datasets.
- The same reweighting might transfer to other causal sequential tasks where local structure dominates but occasional long-range links remain useful.
Load-bearing premise
That reweighting attention scores with a smooth heavy-tailed decay reliably strengthens local temporal dependencies without introducing new failure modes or requiring task-specific tuning of the decay shape.
What would settle it
A head-to-head comparison on multiple time-series benchmarks where the recency-biased model shows no improvement or degrades performance relative to unmodified causal attention would falsify the central claim.
Figures
read the original abstract
Recency bias is a useful inductive prior for sequential modeling: it emphasizes nearby observations and can still allow longer-range dependencies. Standard Transformer attention lacks this property, relying on all-to-all interactions that overlook the causal and often local structure of temporal data. We propose a simple mechanism to introduce recency bias by reweighting attention scores with a smooth heavy-tailed decay. This adjustment strengthens local temporal dependencies without sacrificing the flexibility to capture broader and data-specific correlations. We show that recency-biased attention consistently improves sequential modeling, aligning Transformer more closely with the read, ignore, and write operations of RNNs. Finally, we demonstrate that our approach achieves competitive and often superior performance on challenging time-series forecasting benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes recency-biased causal attention for time-series forecasting. Standard Transformer attention is modified by reweighting scores with a smooth heavy-tailed decay to emphasize nearby observations while preserving flexibility for longer-range dependencies. The authors claim this aligns attention more closely with RNN read/ignore/write operations and yields competitive or superior results on challenging forecasting benchmarks.
Significance. If the performance claims hold under rigorous evaluation, the method would supply a lightweight inductive bias that could improve Transformer applicability to temporal data without architectural overhaul or heavy hyperparameter search.
major comments (2)
- [Abstract] Abstract: the claim of 'consistent improvements' and 'competitive and often superior performance' is asserted without any reported experimental protocol, dataset list, baseline implementations, statistical significance tests, or ablation results, rendering the central empirical claim impossible to evaluate from the provided text.
- [Abstract] Abstract: the reweighting rule is introduced as an independent design choice, yet the weakest assumption—that a fixed smooth heavy-tailed decay reliably strengthens local dependencies without new failure modes or task-specific tuning—is left unexamined and unsupported by any analysis or sensitivity study.
minor comments (1)
- [Abstract] The abstract could more precisely specify the functional form of the heavy-tailed decay and whether its parameters are learned or fixed a priori.
Simulated Author's Rebuttal
We thank the referee for their comments. We address each major comment point by point below, drawing on the full manuscript for clarification.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'consistent improvements' and 'competitive and often superior performance' is asserted without any reported experimental protocol, dataset list, baseline implementations, statistical significance tests, or ablation results, rendering the central empirical claim impossible to evaluate from the provided text.
Authors: The abstract is a concise summary. The full manuscript details the experimental protocol, datasets (ETTh1/2, ETTm1/2, Electricity, Traffic, Weather), baselines, statistical tests, and ablations in Section 4 and the appendix. These elements support the abstract claims. We will revise the abstract to briefly note the evaluation on standard forecasting benchmarks. revision: yes
-
Referee: [Abstract] Abstract: the reweighting rule is introduced as an independent design choice, yet the weakest assumption—that a fixed smooth heavy-tailed decay reliably strengthens local dependencies without new failure modes or task-specific tuning—is left unexamined and unsupported by any analysis or sensitivity study.
Authors: Section 4.3 and the appendix contain sensitivity analysis on the decay parameter together with ablations across horizons and datasets. These show consistent gains from the fixed heavy-tailed decay without task-specific tuning and without introducing new failure modes relative to standard attention. The analysis therefore supports the design choice as presented. revision: no
Circularity Check
No significant circularity identified
full rationale
The paper introduces recency-biased attention as an explicit design choice (reweighting attention scores with a smooth heavy-tailed decay) presented as an independent inductive prior. No equations, predictions, or performance claims in the abstract or description reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations. The central claim of improved performance on benchmarks is framed as an empirical outcome of the proposed mechanism rather than a tautological restatement of inputs. This is the most common honest finding for papers whose core contribution is a modeling heuristic rather than a derived theorem.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Our model is motivated by the observation that many physical systems exhibit heavy-tailed autocorrelations, e.g., the pairwise correlation strength may decay as a power law distribution, as the time delay grows [9]. ... we add a temporally decaying mask to the attention mechanism, specifically to the key-query overlap ... The mask decays attention weights and pairwise dependencies to resemble a power law
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Powerformer is powered by various weighting schemes: power-law decays and Butterworth filters. The former resembles naturally occurring time-series
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Neural equilibria for long-term prediction of nonlinear conservation laws
NeurDE learns the equilibrium closure within a kinetic solver to outperform larger neural models on long-term predictions of nonlinear conservation laws including shocks.
Reference graph
Works this paper leans on
-
[1]
Tsc- mamba: Mamba meets multi-view learning for time series classification, 2024
Md Atik Ahamed and Qiang Cheng. Tsc- mamba: Mamba meets multi-view learning for time series classification, 2024
work page 2024
-
[2]
Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sun- dar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, 9 Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Learning the langu...
work page 2024
-
[3]
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for se- quence modeling. CoRR, abs/1803.01271, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Ro- drigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doum- bouya, Esin Durmus, Ste...
work page 2021
-
[5]
George E. P. Box and Gwilym M. Jenkins. Time Series Analysis: Forecasting and Control. Holden-Day, 1970
work page 1970
-
[6]
On the theory of filter amplifiers
Stephen Butterworth. On the theory of filter amplifiers. Experimental Wireless and the Wireless Engineer, 7:536,541, 1930
work page 1930
-
[7]
Cristian Challu, Kin G. Olivares, Boris N. Ore- shkin, Federico Garza Ramirez, Max Mergen- thaler Canseco, and Artur Dubrawski. Nhits: Neural hierarchical interpolation for time se- ries forecasting. Proceedings of the AAAI Con- ference on Artificial Intelligence, 37(6):6989– 6997, Jun. 2023
work page 2023
-
[8]
TSMixer: An all-MLP architecture for time series forecast-ing
Si-An Chen, Chun-Liang Li, Sercan O Arik, Nathanael Christian Yoder, and Tomas Pfister. TSMixer: An all-MLP architecture for time series forecast-ing. Transactions on Machine Learning Research, 2023
work page 2023
-
[9]
Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Review, 51(4):661–703, 2009
work page 2009
-
[10]
Long-term forecasting with tiDE: Time- series dense encoder
Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with tiDE: Time- series dense encoder. Transactions on Ma- chine Learning Research, 2023
work page 2023
-
[11]
Mqtransformer: Multi-horizon fore- casts with context dependent and feedback- aware attention, 2022
Carson Eisenach, Yagna Patel, and Dhruv Madeka. Mqtransformer: Multi-horizon fore- casts with context dependent and feedback- aware attention, 2022
work page 2022
-
[12]
Unsupervised scalable rep- resentation learning for multivariate time se- ries
Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable rep- resentation learning for multivariate time se- ries. In H. Wallach, H. Larochelle, A. Beygelz- imer, F. d 'Alch´ e-Buc, E. Fox, and R. Gar- nett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Asso- ciates, Inc., 2019
work page 2019
-
[13]
Azul Garza, Cristian Challu, and Max Mergenthaler-Canseco. Timegpt-1, 2024
work page 2024
-
[14]
Mamba: Linear-time sequence modeling with selective state spaces, 2024
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024
work page 2024
-
[15]
Temporal convolutional networks for anomaly detection 10 in time series
Yangdong He and Jiabao Zhao. Temporal convolutional networks for anomaly detection 10 in time series. Journal of Physics: Conference Series, 1213(4):042050, jun 2019
work page 2019
-
[16]
Sepp Hochreiter and J¨ urgen Schmidhuber. Long Short-Term Memory. Neural Compu- tation, 9(8):1735–1780, 11 1997
work page 1997
-
[17]
Charles C. Holt. Forecasting Seasonals and Trends by Exponentially Weighted Moving Av- erages. O.N.R. research memorandum. De- fense Technical Information Center, 1957
work page 1957
-
[18]
R.J. Hyndman and G. Athanasopoulos. Fore- casting: principles and practice . OTexts, 2018
work page 2018
-
[19]
Reformer: The efficient transformer
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Rep- resentations, 2020
work page 2020
-
[20]
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and break- ing the memory bottleneck of transformer on time series forecasting. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alch´ e Buc, Edward A. Fox, and Roman Gar- nett, editors, Advances in Neural Information Proces...
work page 2019
-
[21]
Revisiting long-term time series forecasting: An investigation on linear mapping, 2023
Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An investigation on linear mapping, 2023
work page 2023
-
[22]
SCINet: Time series modeling and fore- casting with sample convolution and interac- tion
Minhao Liu, Ailing Zeng, Muxi Chen, Zhi- jian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. SCINet: Time series modeling and fore- casting with sample convolution and interac- tion. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Ad- vances in Neural Information Processing Sys- tems, 2022
work page 2022
-
[23]
Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X. Liu, and Schahram Dust- dar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In International Conference on Learning Representations, 2022
work page 2022
-
[24]
itransformer: Inverted transformers are effective for time series forecasting
Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[25]
Autotimes: Autoregressive time series forecasters via large language models, 2024
Yong Liu, Guo Qin, Xiangdong Huang, Jian- min Wang, and Mingsheng Long. Autotimes: Autoregressive time series forecasters via large language models, 2024
work page 2024
-
[26]
Non-stationary transform- ers: Exploring the stationarity in time se- ries forecasting
Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transform- ers: Exploring the stationarity in time se- ries forecasting. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Pro- cessing Systems, volume 35, pages 9881–9893. Curran Associates, Inc., 2022
work page 2022
-
[27]
A time se- ries is worth 64 words: Long-term forecasting with transformers
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time se- ries is worth 64 words: Long-term forecasting with transformers. In The Eleventh Inter- national Conference on Learning Representa- tions, 2023
work page 2023
-
[28]
Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio
Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neu- ral basis expansion analysis for interpretable time series forecasting. In International Con- ference on Learning Representations, 2020
work page 2020
-
[29]
Language models are unsupervised multitask learners, 2019
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019
work page 2019
-
[30]
Deepar: Probabilistic forecasting with autoregressive recurrent networks
David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020
work page 2020
-
[31]
Timeseries anomaly detection using tem- poral hierarchical one-class network
Lifeng Shen, Zhuocong Li, and James Kwok. Timeseries anomaly detection using tem- poral hierarchical one-class network. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, vol- ume 33, pages 13016–13026. Curran Asso- ciates, Inc., 2020
work page 2020
-
[32]
Totem: Tokenized time series em- beddings for general time series analysis, 2024
Sabera Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series em- beddings for general time series analysis, 2024
work page 2024
-
[33]
Sean J. Taylor and Benjamin Letham. Fore- casting at scale. The American Statistician , 72(1):37–45, 2018
work page 2018
-
[34]
Wavenet: A genera- tive model for raw audio
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexan- der Graves, Nal Kalchbrenner, Andrew Senior, 11 and Koray Kavukcuoglu. Wavenet: A genera- tive model for raw audio. In Arxiv, 2016
work page 2016
-
[35]
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. In I. Guyon, U. Von Luxburg, S. Ben- gio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[37]
Is mamba effective for time series forecasting? ArXiv, abs/2403.11144, 2024
Zihan Wang, Fanheng Kong, Shi Feng, Ming Wang, Han Zhao, Daling Wang, and Yifei Zhang. Is mamba effective for time series forecasting? ArXiv, abs/2403.11144, 2024
-
[38]
Peter R. Winters. Forecasting sales by expo- nentially weighted moving averages. Manage- ment Science, 6(3):324–342, 1960
work page 1960
-
[39]
Etsformer: Exponential smoothing transformers for time- series forecasting, 2022
Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Etsformer: Exponential smoothing transformers for time- series forecasting, 2022
work page 2022
-
[40]
Times- net: Temporal 2d-variation modeling for gen- eral time series analysis
Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Times- net: Temporal 2d-variation modeling for gen- eral time series analysis. In The Eleventh International Conference on Learning Repre- sentations, 2023
work page 2023
-
[41]
Flow- former: Linearizing transformers with conser- vation flows
Haixu Wu, Jialong Wu, Jiehui Xu, Jian- min Wang, and Mingsheng Long. Flow- former: Linearizing transformers with conser- vation flows. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Ma- chine Learning...
work page 2022
-
[42]
Autoformer: Decomposi- tion transformers with auto-correlation for long-term series forecasting
Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposi- tion transformers with auto-correlation for long-term series forecasting. In A. Beygelz- imer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Infor- mation Processing Systems, 2021
work page 2021
-
[43]
Anomaly transformer: Time series anomaly detection with association dis- crepancy
Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Anomaly transformer: Time series anomaly detection with association dis- crepancy. In International Conference on Learning Representations, 2022
work page 2022
-
[44]
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? Proceedings of the AAAI Conference on Artificial Intelligence , 37(9):11121–11128, Jun. 2023
work page 2023
-
[45]
Effectively modeling time series with sim- ple discrete state spaces
Michael Zhang, Khaled Kamal Saab, Michael Poli, Tri Dao, Karan Goel, and Christopher Re. Effectively modeling time series with sim- ple discrete state spaces. In The Eleventh International Conference on Learning Repre- sentations, 2023
work page 2023
-
[46]
Tianping Zhang, Yizhuo Zhang, Wei Cao, Jiang Bian, Xiaohan Yi, Shun Zheng, and Jian Li. Less is more: Fast multivariate time series forecasting with light sampling-oriented mlp structures, 2022
work page 2022
-
[47]
Informer: Beyond efficient transformer for long sequence time-series fore- casting
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series fore- casting. Proceedings of the AAAI Conference on Artificial Intelligence , 35(12):11106–11115, May 2021
work page 2021
-
[48]
FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, ed- itors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedi...
work page 2022
-
[49]
Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all: Power gen- eral time series analysis by pretrained LM. In Thirty-seventh Conference on Neural Informa- tion Processing Systems, 2023. 12 Appendix A Butterworth Filter The Butterworth filter [6] is often used in signal processing for low-, high-, and band-pass filters. It is designed ...
work page 2023
-
[50]
We show the pairwise correlation dependence in Fig
These measurements are sampled every hour. We show the pairwise correlation dependence in Fig. A3. Traffic4 [42] provides occupancy rates on San Francisco Bay Area freeways from 826 sensors. This data comes from the California Department of Transportation and is sampled hourly. We show the pairwise correlation dependence in Fig. A2. C Architecture Details...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.