Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation
Pith reviewed 2026-05-18 18:53 UTC · model grok-4.3
The pith
Attention as robust state estimation cuts perplexity vs RoPE
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation. Attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention while achieving lower perplexity than RoPE on language modeling benchmarks and remaining stable for zero-shot longer contexts. It also interprets positional mechanisms dynamically through transport and uncertainty propagation.
What carries the argument
The key mechanism is precision-weighted state estimation where attention weights reflect consistency with the linear SDE model of token trajectories.
If this is right
- RFA achieves lower perplexity than RoPE within the training window on language modeling benchmarks.
- RFA remains stable under zero-shot extrapolation to longer contexts.
- The framework provides a dynamical interpretation of standard positional mechanisms such as rotational embeddings.
- Recency biases connect to uncertainty propagation induced by stochastic dynamics.
Where Pith is reading between the lines
- This formulation could inspire attention variants that use more advanced SDE models to capture complex dependencies in sequences.
- Applying the state estimation view to other domains like time series or graph data might yield similar robustness benefits.
- Testing RFA on tasks requiring very long contexts could reveal if the stability advantage scales further.
Load-bearing premise
The approach depends on isotropic noise and decay assumptions that allow matching the speed of standard attention while setting weights by model consistency.
What would settle it
Running the same language modeling experiments without the isotropic noise assumption and checking if perplexity rises or extrapolation fails would test the claim.
Figures
read the original abstract
We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention. On language modeling benchmarks, RFA achieves lower perplexity than RoPE within the training window while remaining stable under zero-shot extrapolation to longer contexts. The framework also provides a dynamical interpretation of standard positional mechanisms, connecting rotational embeddings and recency biases to transport and uncertainty propagation induced by stochastic dynamics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Robust Filter Attention (RFA), a formulation of self-attention as precision-weighted state estimation. Tokens are modeled as noisy observations of a latent trajectory governed by a linear SDE; attention weights are derived from consistency under this model rather than static similarity. Under isotropic noise and exponential decay assumptions, RFA matches the O(n²) complexity of standard attention. On language modeling benchmarks, RFA reports lower perplexity than RoPE within the training window and improved stability under zero-shot extrapolation to longer contexts. The work also supplies a dynamical interpretation of positional mechanisms such as rotational embeddings and recency biases in terms of transport and uncertainty propagation.
Significance. If the derivation and empirical results hold under the stated assumptions, the paper supplies a principled dynamical-systems view that could unify disparate positional encodings and motivate new attention variants with stronger extrapolation properties. The explicit link to state estimation offers a route for theoretical analysis of attention stability. The reported perplexity gains and zero-shot robustness, if reproducible across scales and tasks, would be of interest to researchers seeking interpretable and robust transformer components.
major comments (3)
- [§2.3, Eq. (12)] §2.3, Eq. (12): The equivalence to standard attention complexity is shown only after imposing isotropic noise and a specific exponential decay form; the manuscript does not demonstrate that the resulting precision-weighted estimator retains O(n²) cost or the claimed robustness when these assumptions are relaxed to accommodate the anisotropic correlations and discrete jumps present in token sequences.
- [§4.2, Table 1] §4.2, Table 1: The perplexity improvements over RoPE are presented without an ablation that isolates the contribution of the SDE-derived weighting from other implementation choices (e.g., initialization, optimizer settings). It is therefore unclear whether the gains can be attributed to the dynamical interpretation rather than hyper-parameter differences.
- [§3.1] §3.1: The claim that RFA remains stable under zero-shot length extrapolation rests on the decay assumption propagating uncertainty correctly; the paper should supply a concrete counter-example or sensitivity test showing behavior when the linear SDE observation model is violated by structured language dependencies.
minor comments (2)
- [§2.1] The notation for the precision matrix in §2.1 could be clarified with an explicit definition of how it is computed from the SDE parameters to avoid ambiguity with standard attention scaling.
- [Figure 3] Figure 3 caption should state the number of random seeds and report standard deviation for the extrapolation curves.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: [§2.3, Eq. (12)] The equivalence to standard attention complexity is shown only after imposing isotropic noise and a specific exponential decay form; the manuscript does not demonstrate that the resulting precision-weighted estimator retains O(n²) cost or the claimed robustness when these assumptions are relaxed to accommodate the anisotropic correlations and discrete jumps present in token sequences.
Authors: We agree that the O(n²) complexity and robustness properties are derived specifically under the isotropic noise and exponential decay assumptions stated in §2.3. These choices enable an efficient, closed-form implementation while linking attention to state estimation. In the revised manuscript we will expand the discussion following Eq. (12) to explicitly note the limitations under relaxed assumptions, including potential complexity increases for anisotropic noise and the handling of discrete jumps, and we will frame the current formulation as a tractable baseline for such extensions. revision: yes
-
Referee: [§4.2, Table 1] The perplexity improvements over RoPE are presented without an ablation that isolates the contribution of the SDE-derived weighting from other implementation choices (e.g., initialization, optimizer settings). It is therefore unclear whether the gains can be attributed to the dynamical interpretation rather than hyper-parameter differences.
Authors: The referee correctly identifies the absence of targeted ablations. Although we matched hyper-parameters across models, we did not isolate the effect of the precision-weighted estimator from other implementation details. We will add ablation experiments in the revised §4.2 that systematically vary initialization, optimizer settings, and a non-dynamical weighting baseline, allowing clearer attribution of the reported perplexity gains. revision: yes
-
Referee: [§3.1] The claim that RFA remains stable under zero-shot length extrapolation rests on the decay assumption propagating uncertainty correctly; the paper should supply a concrete counter-example or sensitivity test showing behavior when the linear SDE observation model is violated by structured language dependencies.
Authors: We acknowledge that the stability claim relies on the decay assumption and that a systematic sensitivity analysis for violations by structured language dependencies would strengthen the work. Our zero-shot extrapolation results on the evaluated benchmarks provide supporting empirical evidence. A comprehensive counter-example study across all possible dependency structures lies beyond the present scope; we will add a limitations paragraph in §3.1 discussing this point and suggesting directions for future sensitivity tests. revision: partial
- A full sensitivity analysis or concrete counter-examples demonstrating RFA behavior when the linear SDE observation model is violated by arbitrary structured language dependencies.
Circularity Check
Derivation from linear SDE model is self-contained with explicit assumptions; no circular reduction
full rationale
The paper derives Robust Filter Attention by modeling each token as a noisy observation of a latent trajectory governed by a linear SDE, with attention weights obtained from consistency under that model. The match to standard attention complexity is obtained only after imposing the stated isotropic noise and decay assumptions, which are presented as modeling choices rather than fitted quantities or self-definitions. No equations reduce by construction to the target performance metrics, no load-bearing self-citations are invoked to justify uniqueness, and no ansatz is smuggled via prior work. The reported perplexity and extrapolation results are therefore empirical outcomes of the formulation, not tautological re-statements of its inputs. The derivation chain remains independent of the final benchmark numbers.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Each token is a noisy observation of a latent trajectory governed by a linear stochastic differential equation
- ad hoc to paper Isotropic noise and decay assumptions hold
Reference graph
Works this paper leans on
-
[1]
Cope: A lightweight complex positional encoding, 2025
Avinash Amballa. Cope: A lightweight complex positional encoding, 2025. URL https://arxiv.org/abs/2508.18308
-
[2]
B.D.O. Anderson and J.B. Moore . Optimal Filtering. Prentice-Hall, 1979
work page 1979
-
[3]
Neural continuous-discrete state space models for irregularly-sampled time series, 2023
Abdul Fatir Ansari, Alvin Heng, Andre Lim, and Harold Soh. Neural continuous-discrete state space models for irregularly-sampled time series, 2023. URL https://arxiv.org/abs/2301.11308
-
[4]
Element-wise attention layers: an option for optimization, 2023
Giovanni Araujo Bacochina and Rodrigo Clemente Thom de Souza. Element-wise attention layers: an option for optimization, 2023. URL https://arxiv.org/abs/2302.05488
-
[5]
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URL https://arxiv.org/abs/1409.0473
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Zico Kolter, and Vladlen Koltun
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models, 2019. URL https://arxiv.org/abs/1909.01377
-
[7]
Theory and implementation of complex-valued neural networks, 2023
Jose Agustin Barrachina, Chengfang Ren, Gilles Vieillard, Christele Morisseau, and Jean-Philippe Ovarlez. Theory and implementation of complex-valued neural networks, 2023. URL https://arxiv.org/abs/2302.08286
-
[8]
A survey of complex-valued neural networks, 2021
Joshua Bassey, Lijun Qian, and Xianfang Li. A survey of complex-valued neural networks, 2021. URL https://arxiv.org/abs/2101.12249
-
[9]
Learning Stochastic Recurrent Networks
Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks, 2015. URL https://arxiv.org/abs/1411.7610
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
Mambamixer: Efficient selective state space models with dual token and channel selection, 2024
Ali Behrouz, Michele Santacatterina, and Ramin Zabih. Mambamixer: Efficient selective state space models with dual token and channel selection, 2024. URL https://arxiv.org/abs/2403.19888
-
[11]
Arthur S. Bianchessi, Rodrigo C. Barros, and Lucas S. Kupssinskü. Bayesian attention mechanism: A probabilistic framework for positional encoding and context length extrapolation, 2025. URL https://arxiv.org/abs/2505.22842
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to SSMs : Distilling quadratic knowledge to subquadratic models, 2025. URL https://arxiv.org/abs/2408.10189
-
[13]
On the expressivity role of LayerNorm in Transformers' attention, 2023
Shaked Brody, Uri Alon, and Eran Yahav. On the expressivity role of LayerNorm in Transformers' attention, 2023. URL https://arxiv.org/abs/2305.02582
-
[14]
Revisiting kernel attention with correlated Gaussian process representation, 2025
Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, and Trong Nghia Hoang. Revisiting kernel attention with correlated Gaussian process representation, 2025. URL https://arxiv.org/abs/2502.20525
-
[15]
C. Sidney Burrus, J. A. Barreto, and Ivan W. Selesnick. Iterative reweighted least-squares design of FIR filters. IEEE Transactions on Signal Processing, 42 0 (11): 0 2926--2936, Nov 1994. doi:10.1109/78.326612
-
[16]
Chao Chen, Haoyu Geng, Nianzu Yang, Junchi Yan, Daiyue Xue, Jianping Yu, and Xiaokang Yang. Learning self-modulating attention in continuous time space with applications to sequential recommendation, 2022. URL https://arxiv.org/abs/2204.06517
-
[17]
Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations, 2019. URL https://arxiv.org/abs/1806.07366
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [18]
-
[19]
Continuous-time attention for sequential learning
Jen-Tzung Chien and Yi-Hsiang Chen. Continuous-time attention for sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7116--7124. AAAI Press, 2021. doi:10.1609/aaai.v35i8.16875. URL https://doi.org/10.1609/aaai.v35i8.16875
-
[20]
A Recurrent Latent Variable Model for Sequential Data
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.org/abs/1506.02216
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
Adaptive Kalman -informed Transformer
Nadav Cohen and Itzik Klein. Adaptive Kalman -informed Transformer . arXiv preprint arXiv:2401.09987, 2024. URL https://doi.org/10.48550/arXiv.2401.09987. Version v2: 7 Mar 2025
-
[22]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL : Attentive language models beyond a fixed-length context, 2019
work page 2019
-
[23]
Tri Dao and Albert Gu. Transformers are SSMs : Generalized models and efficient algorithms through structured state space duality, 2024. URL https://arxiv.org/abs/2405.21060
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Building blocks for a complex-valued transformer architecture
Florian Eilers and Xiaoyi Jiang. Building blocks for a complex-valued transformer architecture. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 1–5. IEEE, June 2023. doi:10.1109/icassp49357.2023.10095349. URL http://dx.doi.org/10.1109/ICASSP49357.2023.10095349
-
[25]
Element-wise attention is all you need, 2025
Guoxin Feng. Element-wise attention is all you need, 2025. URL https://arxiv.org/abs/2501.05730
-
[26]
Sequential Neural Models with Stochastic Layers
Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes, 2018. URL https://arxiv.org/abs/1807.01622
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
A mathematical perspective on Transformers , 2024
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on Transformers , 2024. URL https://arxiv.org/abs/2312.10794
-
[29]
Can a Transformer represent a Kalman filter?, 2024
Gautam Goel and Peter Bartlett. Can a Transformer represent a Kalman filter?, 2024. URL https://arxiv.org/abs/2312.06937
-
[30]
Learning fast approximations of sparse coding
Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 399--406. Omnipress, 2010. URL https://icml.cc/Conferences/2010/papers/449.pdf
work page 2010
-
[31]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Hippo: Recurrent memory with optimal polynomial projections, 2020
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re. Hippo: Recurrent memory with optimal polynomial projections, 2020. URL https://arxiv.org/abs/2008.07669
-
[33]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URL https://arxiv.org/abs/2111.00396
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
High-resolution Image Synthesis with Latent Diffusion Models,
Hongji Guo, Hanjing Wang, and Qiang Ji. Uncertainty-guided probabilistic Transformer for complex action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20020--20029, 2022. doi:10.1109/CVPR52688.2022.01942
-
[35]
Geometric interpretation of layer normalization and a comparative analysis with RMSNorm , 2025
Akshat Gupta, Atahan Ozdemir, and Gopala Anumanchipalli. Geometric interpretation of layer normalization and a comparative analysis with RMSNorm , 2025. URL https://arxiv.org/abs/2409.12951
- [36]
-
[37]
Kalman Filtering and Neural Networks
Simon Haykin, editor. Kalman Filtering and Neural Networks. John Wiley & Sons, Inc., New York, 2001. ISBN 9780471369981. doi:10.1002/0471221546
-
[38]
Query-key normalization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for Transformers , 2020. URL https://arxiv.org/abs/2010.04245
-
[39]
Uncertainty-aware attention for reliable interpretation and prediction, 2018
Jay Heo, Hae Beom Lee, Saehoon Kim, Juho Lee, Kwang Joon Kim, Eunho Yang, and Sung Ju Hwang. Uncertainty-aware attention for reliable interpretation and prediction, 2018. URL https://arxiv.org/abs/1805.09653
-
[40]
Complex-Valued Neural Networks
Akira Hirose. Complex-Valued Neural Networks. Studies in Computational Intelligence. Springer Berlin, Heidelberg, 2 edition, 2012. doi:10.1007/978-3-642-27632-3
-
[41]
Georgios Ioannides, Aman Chadha, and Aaron Elkins. Gaussian adaptive attention is all you need: Robust contextual representations across multiple modalities, 2024. URL https://arxiv.org/html/2401.11143v3
- [42]
-
[43]
ACE-NODE : Attentive co-evolving neural ordinary differential equations, 2021 a
Sheo Yon Jhin, Minju Jo, Taeyong Kong, Jinsung Jeon, and Noseong Park. ACE-NODE : Attentive co-evolving neural ordinary differential equations, 2021 a . URL https://arxiv.org/abs/2105.14953
-
[44]
Sheo Yon Jhin, Heejoo Shin, Seoyoung Hong, Solhee Park, and Noseong Park. Attentive neural controlled differential equations for time-series classification and forecasting, 2021 b . URL https://arxiv.org/abs/2109.01876
-
[45]
R. E. Kalman and R. S. Bucy. New results in linear filtering and prediction theory. Journal of Basic Engineering, 83 0 (1): 0 95--108, 1961. doi:10.1115/1.3658902. URL http://dx.doi.org/10.1115/1.3658902
-
[46]
A new approach to linear filtering and prediction problems
Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME--Journal of Basic Engineering, 82 0 (Series D): 0 35--45, 1960
work page 1960
-
[47]
Transformers are RNNs : Fast autoregressive Transformers with linear attention, 2020
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs : Fast autoregressive Transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236
-
[48]
Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes, 2019. URL https://arxiv.org/abs/1901.05761
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[49]
Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep Kalman filters, 2015. URL https://arxiv.org/abs/1511.05121
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[50]
Structured Inference Networks for Nonlinear State Space Models
Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models, 2016. URL https://arxiv.org/abs/1609.09869
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[51]
Unveiling the power of complex-valued Transformers in wireless communications, 2025
Yang Leng, Qingfeng Lin, Long-Yin Yung, Jingreng Lei, Yang Li, and Yik-Chung Wu. Unveiling the power of complex-valued Transformers in wireless communications, 2025. URL https://arxiv.org/abs/2502.11151
-
[52]
Scaled-dot-product attention as one-sided entropic optimal transport, 2025
Elon Litman. Scaled-dot-product attention as one-sided entropic optimal transport, 2025. URL https://arxiv.org/abs/2508.08369
-
[53]
Haiping Liu, Lijing Lin, Jingyuan Sun, Zhegong Shangguan, Mauricio A. Alvarez, and Hongpeng Zhou. Rethinking RoPE : A mathematical blueprint for n-dimensional positional embedding, 2025. URL https://arxiv.org/abs/2504.06308
-
[54]
Neural extended Kalman filters for learning and predicting dynamics of structural systems
Wei Liu, Zhilu Lai, Kiran Bacsa, and Eleni Chatzi. Neural extended Kalman filters for learning and predicting dynamics of structural systems. Structural Health Monitoring, 23 0 (2): 0 1037–1052, June 2023. ISSN 1741-3168. doi:10.1177/14759217231179912. URL http://dx.doi.org/10.1177/14759217231179912
-
[55]
Learning to encode position for transformer with continuous dynamical model, 2020
Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model, 2020. URL https://arxiv.org/abs/2003.09229
-
[56]
Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere, 2025. URL https://arxiv.org/abs/2410.01131
-
[57]
Inverse distance weighting attention, 2023
Calvin McCarter. Inverse distance weighting attention, 2023. URL https://arxiv.org/abs/2310.18805
-
[58]
R. Mehra. On the identification of variances and adaptive Kalman filtering. IEEE Transactions on Automatic Control, 15 0 (2): 0 175--184, 1970. doi:10.1109/TAC.1970.1099422
-
[59]
SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient
Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan. Slang: Fast structured covariance approximations for Bayesian deep learning with natural gradient, 2019. URL https://arxiv.org/abs/1811.04504
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[60]
Traveling words: A geometric interpretation of Transformers , 2023
Raul Molina. Traveling words: A geometric interpretation of Transformers , 2023. URL https://arxiv.org/abs/2309.07315
-
[61]
Javier R. Movellan and Prasad Gabbur. Probabilistic Transformers , 2020. URL https://arxiv.org/abs/2010.15583
-
[62]
Identification and control of dynamical systems using neural networks
Kumpati S Narendra and Kannan Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1 0 (1): 0 4--27, Mar 1990. doi:10.1109/72.80202
-
[63]
Stefan K. Nielsen, Laziz U. Abdullaev, Rachel S. Y. Teo, and Tan M. Nguyen. Elliptical attention, 2024. URL https://arxiv.org/abs/2406.13770
-
[64]
Alexander Norcliffe, Cristian Bodnar, Ben Day, Jacob Moss, and Pietro Liò. Neural ODE processes, 2021. URL https://arxiv.org/abs/2103.12413
-
[65]
Moseley, Akshay Chaudhari, and Curtis Langlotz
Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael E. Moseley, Akshay Chaudhari, and Curtis Langlotz. Liere: Lie rotational positional encodings, 2025. URL https://arxiv.org/abs/2406.10322
-
[66]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Toeplitz neural network for sequence modeling, 2023
Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Toeplitz neural network for sequence modeling, 2023. URL https://arxiv.org/abs/2305.04749
-
[69]
Peter Racioppo. Adaptive filter attention. Master's thesis, University of California, Los Angeles, Los Angeles, CA, 2025. URL https://escholarship.org/content/qt0xn6488h/qt0xn6488h.pdf
work page 2025
-
[70]
Hopfield Networks is All You Need
Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need, 2021. URL https://arxiv.org/abs/2008.02217
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[71]
Provable benefits of complex parameterizations for structured state space models, 2024
Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, and Nadav Cohen. Provable benefits of complex parameterizations for structured state space models, 2024. URL https://arxiv.org/abs/2410.14067
-
[72]
H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3 0 (8): 0 1445--1450, 1965. doi:10.2514/3.3166. URL https://doi.org/10.2514/3.3166
-
[73]
Guy Revach, Nir Shlezinger, Xiaoyong Ni, Adria Lopez Escoriza, Ruud J. G. van Sloun, and Yonina C. Eldar. Kalmannet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70: 0 1532–1547, 2022. ISSN 1941-0476. doi:10.1109/tsp.2022.3158588. URL http://dx.doi.org/10.1109/TSP.2022.3158588
-
[74]
Towards understanding how attention mechanism works in deep learning, 2024
Tianyu Ruan and Shihua Zhang. Towards understanding how attention mechanism works in deep learning, 2024. URL https://arxiv.org/abs/2412.18288
-
[75]
Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent ODEs for irregularly-sampled time series, 2019. URL https://arxiv.org/abs/1907.03907
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[76]
Probabilistic attention based on Gaussian processes for deep multiple instance learning, 2023
Arne Schmidt, Pablo Morales-Álvarez, and Rafael Molina. Probabilistic attention based on Gaussian processes for deep multiple instance learning, 2023. URL https://arxiv.org/abs/2302.04061
-
[77]
F. Schweppe. Evaluation of likelihood functions for Gaussian signals. IEEE Transactions on Information Theory, 11 0 (1): 0 61--70, 1965. doi:10.1109/TIT.1965.1053737
-
[78]
Patrick Seifner, Kostadin Cvejoski, David Berghaus, Cesar Ojeda, and Ramses J. Sanchez. Foundation inference models for stochastic differential equations: A Transformer -based approach for zero-shot function estimation, 2025. https://doi.org/10.48550/arXiv.2502.19049
-
[79]
Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, and MingKai Zheng. ComplexFormer : Disruptively advancing Transformer inference ability via head-specific complex vector attention, 2025. URL https://arxiv.org/abs/2505.10222
-
[80]
Self-Attention with Relative Position Representations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations, 2018. URL https://arxiv.org/abs/1803.02155
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.