pith. sign in

arxiv: 2509.04154 · v5 · submitted 2025-09-04 · 💻 cs.LG · cs.AI

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

Pith reviewed 2026-05-18 18:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords robust filter attentionself-attentionstate estimationstochastic differential equationlanguage modelingperplexityextrapolationpositional embeddings
0
0 comments X p. Extension

The pith

Attention as robust state estimation cuts perplexity vs RoPE

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Robust Filter Attention as a way to view self-attention through the lens of state estimation. Tokens are seen as noisy measurements of a hidden path evolving according to a linear stochastic differential equation. Weights come from how well each token fits the expected dynamics and uncertainty, instead of just matching features. This keeps the same computational cost as regular attention under basic assumptions about noise. Experiments show it gets better language modeling scores than standard positional methods and handles much longer sequences without retraining.

Core claim

Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation. Attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention while achieving lower perplexity than RoPE on language modeling benchmarks and remaining stable for zero-shot longer contexts. It also interprets positional mechanisms dynamically through transport and uncertainty propagation.

What carries the argument

The key mechanism is precision-weighted state estimation where attention weights reflect consistency with the linear SDE model of token trajectories.

If this is right

  • RFA achieves lower perplexity than RoPE within the training window on language modeling benchmarks.
  • RFA remains stable under zero-shot extrapolation to longer contexts.
  • The framework provides a dynamical interpretation of standard positional mechanisms such as rotational embeddings.
  • Recency biases connect to uncertainty propagation induced by stochastic dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This formulation could inspire attention variants that use more advanced SDE models to capture complex dependencies in sequences.
  • Applying the state estimation view to other domains like time series or graph data might yield similar robustness benefits.
  • Testing RFA on tasks requiring very long contexts could reveal if the stability advantage scales further.

Load-bearing premise

The approach depends on isotropic noise and decay assumptions that allow matching the speed of standard attention while setting weights by model consistency.

What would settle it

Running the same language modeling experiments without the isotropic noise assumption and checking if perplexity rises or extrapolation fails would test the claim.

Figures

Figures reproduced from arXiv: 2509.04154 by Peter Racioppo.

Figure 1
Figure 1. Figure 1: Filter performance on different 2D systems: ground-truth trajectory (black), measured (blue), and predicted (red). (a) system with only measurement noise (σ 2 = 0.0, η2 = 1.0). (b) system with both process noise and measurement noise (σ 2 = 0.3, η2 = 0.5). (c) higher noise (σ 2 = 0.5, η2 = 2.0). To check that the model has learned the right dynamics, we compute "pulled-forward" estimates at four time point… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of an AFA layer’s "pulled-forward" state estimates at different stages of training. The true trajectory is shown as a solid black line, and the pulled￾forward estimates as colored point clouds. (a) State estimates early in training. (b) State estimates midway through training. (c) State estimates after training is complete. and interpretable attention matrices of AFA. We compare the attention ma… view at source ↗
Figure 3
Figure 3. Figure 3: Attention matrices produced by training standard attention and AFA on a 2D LTI with process and measurement noise. (a) First layer of standard attention (b) Second layer of standard attention. (c) Single layer of AFA. These experiments, along with animations of the predicted trajectories, pulled-forward estimates, and attention matrices during the course of training are available at the same Github link pr… view at source ↗
read the original abstract

We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention. On language modeling benchmarks, RFA achieves lower perplexity than RoPE within the training window while remaining stable under zero-shot extrapolation to longer contexts. The framework also provides a dynamical interpretation of standard positional mechanisms, connecting rotational embeddings and recency biases to transport and uncertainty propagation induced by stochastic dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Robust Filter Attention (RFA), a formulation of self-attention as precision-weighted state estimation. Tokens are modeled as noisy observations of a latent trajectory governed by a linear SDE; attention weights are derived from consistency under this model rather than static similarity. Under isotropic noise and exponential decay assumptions, RFA matches the O(n²) complexity of standard attention. On language modeling benchmarks, RFA reports lower perplexity than RoPE within the training window and improved stability under zero-shot extrapolation to longer contexts. The work also supplies a dynamical interpretation of positional mechanisms such as rotational embeddings and recency biases in terms of transport and uncertainty propagation.

Significance. If the derivation and empirical results hold under the stated assumptions, the paper supplies a principled dynamical-systems view that could unify disparate positional encodings and motivate new attention variants with stronger extrapolation properties. The explicit link to state estimation offers a route for theoretical analysis of attention stability. The reported perplexity gains and zero-shot robustness, if reproducible across scales and tasks, would be of interest to researchers seeking interpretable and robust transformer components.

major comments (3)
  1. [§2.3, Eq. (12)] §2.3, Eq. (12): The equivalence to standard attention complexity is shown only after imposing isotropic noise and a specific exponential decay form; the manuscript does not demonstrate that the resulting precision-weighted estimator retains O(n²) cost or the claimed robustness when these assumptions are relaxed to accommodate the anisotropic correlations and discrete jumps present in token sequences.
  2. [§4.2, Table 1] §4.2, Table 1: The perplexity improvements over RoPE are presented without an ablation that isolates the contribution of the SDE-derived weighting from other implementation choices (e.g., initialization, optimizer settings). It is therefore unclear whether the gains can be attributed to the dynamical interpretation rather than hyper-parameter differences.
  3. [§3.1] §3.1: The claim that RFA remains stable under zero-shot length extrapolation rests on the decay assumption propagating uncertainty correctly; the paper should supply a concrete counter-example or sensitivity test showing behavior when the linear SDE observation model is violated by structured language dependencies.
minor comments (2)
  1. [§2.1] The notation for the precision matrix in §2.1 could be clarified with an explicit definition of how it is computed from the SDE parameters to avoid ambiguity with standard attention scaling.
  2. [Figure 3] Figure 3 caption should state the number of random seeds and report standard deviation for the extrapolation curves.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§2.3, Eq. (12)] The equivalence to standard attention complexity is shown only after imposing isotropic noise and a specific exponential decay form; the manuscript does not demonstrate that the resulting precision-weighted estimator retains O(n²) cost or the claimed robustness when these assumptions are relaxed to accommodate the anisotropic correlations and discrete jumps present in token sequences.

    Authors: We agree that the O(n²) complexity and robustness properties are derived specifically under the isotropic noise and exponential decay assumptions stated in §2.3. These choices enable an efficient, closed-form implementation while linking attention to state estimation. In the revised manuscript we will expand the discussion following Eq. (12) to explicitly note the limitations under relaxed assumptions, including potential complexity increases for anisotropic noise and the handling of discrete jumps, and we will frame the current formulation as a tractable baseline for such extensions. revision: yes

  2. Referee: [§4.2, Table 1] The perplexity improvements over RoPE are presented without an ablation that isolates the contribution of the SDE-derived weighting from other implementation choices (e.g., initialization, optimizer settings). It is therefore unclear whether the gains can be attributed to the dynamical interpretation rather than hyper-parameter differences.

    Authors: The referee correctly identifies the absence of targeted ablations. Although we matched hyper-parameters across models, we did not isolate the effect of the precision-weighted estimator from other implementation details. We will add ablation experiments in the revised §4.2 that systematically vary initialization, optimizer settings, and a non-dynamical weighting baseline, allowing clearer attribution of the reported perplexity gains. revision: yes

  3. Referee: [§3.1] The claim that RFA remains stable under zero-shot length extrapolation rests on the decay assumption propagating uncertainty correctly; the paper should supply a concrete counter-example or sensitivity test showing behavior when the linear SDE observation model is violated by structured language dependencies.

    Authors: We acknowledge that the stability claim relies on the decay assumption and that a systematic sensitivity analysis for violations by structured language dependencies would strengthen the work. Our zero-shot extrapolation results on the evaluated benchmarks provide supporting empirical evidence. A comprehensive counter-example study across all possible dependency structures lies beyond the present scope; we will add a limitations paragraph in §3.1 discussing this point and suggesting directions for future sensitivity tests. revision: partial

standing simulated objections not resolved
  • A full sensitivity analysis or concrete counter-examples demonstrating RFA behavior when the linear SDE observation model is violated by arbitrary structured language dependencies.

Circularity Check

0 steps flagged

Derivation from linear SDE model is self-contained with explicit assumptions; no circular reduction

full rationale

The paper derives Robust Filter Attention by modeling each token as a noisy observation of a latent trajectory governed by a linear SDE, with attention weights obtained from consistency under that model. The match to standard attention complexity is obtained only after imposing the stated isotropic noise and decay assumptions, which are presented as modeling choices rather than fitted quantities or self-definitions. No equations reduce by construction to the target performance metrics, no load-bearing self-citations are invoked to justify uniqueness, and no ansatz is smuggled via prior work. The reported perplexity and extrapolation results are therefore empirical outcomes of the formulation, not tautological re-statements of its inputs. The derivation chain remains independent of the final benchmark numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on treating token sequences as observations of a latent linear SDE trajectory and on the isotropic noise plus decay assumptions that enable both the attention formulation and the complexity match.

axioms (2)
  • domain assumption Each token is a noisy observation of a latent trajectory governed by a linear stochastic differential equation
    Invoked in the abstract to define how attention weights are determined by model consistency.
  • ad hoc to paper Isotropic noise and decay assumptions hold
    Explicitly required for RFA to match standard attention complexity.

pith-pipeline@v0.9.0 · 5632 in / 1387 out tokens · 35629 ms · 2026-05-18T18:53:04.188467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 25 internal anchors

  1. [1]

    Cope: A lightweight complex positional encoding, 2025

    Avinash Amballa. Cope: A lightweight complex positional encoding, 2025. URL https://arxiv.org/abs/2508.18308

  2. [2]

    Anderson and J.B

    B.D.O. Anderson and J.B. Moore . Optimal Filtering. Prentice-Hall, 1979

  3. [3]

    Neural continuous-discrete state space models for irregularly-sampled time series, 2023

    Abdul Fatir Ansari, Alvin Heng, Andre Lim, and Harold Soh. Neural continuous-discrete state space models for irregularly-sampled time series, 2023. URL https://arxiv.org/abs/2301.11308

  4. [4]

    Element-wise attention layers: an option for optimization, 2023

    Giovanni Araujo Bacochina and Rodrigo Clemente Thom de Souza. Element-wise attention layers: an option for optimization, 2023. URL https://arxiv.org/abs/2302.05488

  5. [5]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URL https://arxiv.org/abs/1409.0473

  6. [6]

    Zico Kolter, and Vladlen Koltun

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models, 2019. URL https://arxiv.org/abs/1909.01377

  7. [7]

    Theory and implementation of complex-valued neural networks, 2023

    Jose Agustin Barrachina, Chengfang Ren, Gilles Vieillard, Christele Morisseau, and Jean-Philippe Ovarlez. Theory and implementation of complex-valued neural networks, 2023. URL https://arxiv.org/abs/2302.08286

  8. [8]

    A survey of complex-valued neural networks, 2021

    Joshua Bassey, Lijun Qian, and Xianfang Li. A survey of complex-valued neural networks, 2021. URL https://arxiv.org/abs/2101.12249

  9. [9]

    Learning Stochastic Recurrent Networks

    Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks, 2015. URL https://arxiv.org/abs/1411.7610

  10. [10]

    Mambamixer: Efficient selective state space models with dual token and channel selection, 2024

    Ali Behrouz, Michele Santacatterina, and Ramin Zabih. Mambamixer: Efficient selective state space models with dual token and channel selection, 2024. URL https://arxiv.org/abs/2403.19888

  11. [11]

    Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

    Arthur S. Bianchessi, Rodrigo C. Barros, and Lucas S. Kupssinskü. Bayesian attention mechanism: A probabilistic framework for positional encoding and context length extrapolation, 2025. URL https://arxiv.org/abs/2505.22842

  12. [12]

    Li, Eric P

    Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to SSMs : Distilling quadratic knowledge to subquadratic models, 2025. URL https://arxiv.org/abs/2408.10189

  13. [13]

    On the expressivity role of LayerNorm in Transformers' attention, 2023

    Shaked Brody, Uri Alon, and Eran Yahav. On the expressivity role of LayerNorm in Transformers' attention, 2023. URL https://arxiv.org/abs/2305.02582

  14. [14]

    Revisiting kernel attention with correlated Gaussian process representation, 2025

    Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, and Trong Nghia Hoang. Revisiting kernel attention with correlated Gaussian process representation, 2025. URL https://arxiv.org/abs/2502.20525

  15. [15]

    Sidney Burrus, J

    C. Sidney Burrus, J. A. Barreto, and Ivan W. Selesnick. Iterative reweighted least-squares design of FIR filters. IEEE Transactions on Signal Processing, 42 0 (11): 0 2926--2936, Nov 1994. doi:10.1109/78.326612

  16. [16]

    Learning self-modulating attention in continuous time space with applications to sequential recommendation, 2022

    Chao Chen, Haoyu Geng, Nianzu Yang, Junchi Yan, Daiyue Xue, Jianping Yu, and Xiaokang Yang. Learning self-modulating attention in continuous time space with applications to sequential recommendation, 2022. URL https://arxiv.org/abs/2204.06517

  17. [17]

    Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations, 2019. URL https://arxiv.org/abs/1806.07366

  18. [18]

    Yingyi Chen, Qinghua Tao, Francesco Tonin, and Johan A. K. Suykens. Self-attention through kernel-eigen pair sparse variational Gaussian processes, 2024. URL https://arxiv.org/abs/2402.01476

  19. [19]

    Continuous-time attention for sequential learning

    Jen-Tzung Chien and Yi-Hsiang Chen. Continuous-time attention for sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7116--7124. AAAI Press, 2021. doi:10.1609/aaai.v35i8.16875. URL https://doi.org/10.1609/aaai.v35i8.16875

  20. [20]

    A Recurrent Latent Variable Model for Sequential Data

    Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.org/abs/1506.02216

  21. [21]

    Adaptive Kalman -informed Transformer

    Nadav Cohen and Itzik Klein. Adaptive Kalman -informed Transformer . arXiv preprint arXiv:2401.09987, 2024. URL https://doi.org/10.48550/arXiv.2401.09987. Version v2: 7 Mar 2025

  22. [22]

    Le, and Ruslan Salakhutdinov

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL : Attentive language models beyond a fixed-length context, 2019

  23. [23]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are SSMs : Generalized models and efficient algorithms through structured state space duality, 2024. URL https://arxiv.org/abs/2405.21060

  24. [24]

    Building blocks for a complex-valued transformer architecture

    Florian Eilers and Xiaoyi Jiang. Building blocks for a complex-valued transformer architecture. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 1–5. IEEE, June 2023. doi:10.1109/icassp49357.2023.10095349. URL http://dx.doi.org/10.1109/ICASSP49357.2023.10095349

  25. [25]

    Element-wise attention is all you need, 2025

    Guoxin Feng. Element-wise attention is all you need, 2025. URL https://arxiv.org/abs/2501.05730

  26. [26]

    Sequential Neural Models with Stochastic Layers

    Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571

  27. [27]

    Neural Processes

    Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes, 2018. URL https://arxiv.org/abs/1807.01622

  28. [28]

    A mathematical perspective on Transformers , 2024

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on Transformers , 2024. URL https://arxiv.org/abs/2312.10794

  29. [29]

    Can a Transformer represent a Kalman filter?, 2024

    Gautam Goel and Peter Bartlett. Can a Transformer represent a Kalman filter?, 2024. URL https://arxiv.org/abs/2312.06937

  30. [30]

    Learning fast approximations of sparse coding

    Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 399--406. Omnipress, 2010. URL https://icml.cc/Conferences/2010/papers/449.pdf

  31. [31]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

  32. [32]

    Hippo: Recurrent memory with optimal polynomial projections, 2020

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re. Hippo: Recurrent memory with optimal polynomial projections, 2020. URL https://arxiv.org/abs/2008.07669

  33. [33]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URL https://arxiv.org/abs/2111.00396

  34. [34]

    High-resolution Image Synthesis with Latent Diffusion Models,

    Hongji Guo, Hanjing Wang, and Qiang Ji. Uncertainty-guided probabilistic Transformer for complex action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20020--20029, 2022. doi:10.1109/CVPR52688.2022.01942

  35. [35]

    Geometric interpretation of layer normalization and a comparative analysis with RMSNorm , 2025

    Akshat Gupta, Atahan Ozdemir, and Gopala Anumanchipalli. Geometric interpretation of layer normalization and a comparative analysis with RMSNorm , 2025. URL https://arxiv.org/abs/2409.12951

  36. [36]

    M. M. Hammad. Comprehensive survey of complex-valued neural networks: Insights into backpropagation and activation functions, 2024. URL https://arxiv.org/abs/2407.19258

  37. [37]

    Kalman Filtering and Neural Networks

    Simon Haykin, editor. Kalman Filtering and Neural Networks. John Wiley & Sons, Inc., New York, 2001. ISBN 9780471369981. doi:10.1002/0471221546

  38. [38]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for Transformers , 2020. URL https://arxiv.org/abs/2010.04245

  39. [39]

    Uncertainty-aware attention for reliable interpretation and prediction, 2018

    Jay Heo, Hae Beom Lee, Saehoon Kim, Juho Lee, Kwang Joon Kim, Eunho Yang, and Sung Ju Hwang. Uncertainty-aware attention for reliable interpretation and prediction, 2018. URL https://arxiv.org/abs/1805.09653

  40. [40]

    Complex-Valued Neural Networks

    Akira Hirose. Complex-Valued Neural Networks. Studies in Computational Intelligence. Springer Berlin, Heidelberg, 2 edition, 2012. doi:10.1007/978-3-642-27632-3

  41. [41]

    Gaussian adaptive attention is all you need: Robust contextual representations across multiple modalities, 2024

    Georgios Ioannides, Aman Chadha, and Aaron Elkins. Gaussian adaptive attention is all you need: Robust contextual representations across multiple modalities, 2024. URL https://arxiv.org/html/2401.11143v3

  42. [42]

    Hadi Jahanshahi and Zheng H. Zhu. Uncertainty propagation networks for neural ordinary differential equations, 2025. URL https://arxiv.org/abs/2508.16815

  43. [43]

    ACE-NODE : Attentive co-evolving neural ordinary differential equations, 2021 a

    Sheo Yon Jhin, Minju Jo, Taeyong Kong, Jinsung Jeon, and Noseong Park. ACE-NODE : Attentive co-evolving neural ordinary differential equations, 2021 a . URL https://arxiv.org/abs/2105.14953

  44. [44]

    Attentive neural controlled differential equations for time-series classification and forecasting, 2021 b

    Sheo Yon Jhin, Heejoo Shin, Seoyoung Hong, Solhee Park, and Noseong Park. Attentive neural controlled differential equations for time-series classification and forecasting, 2021 b . URL https://arxiv.org/abs/2109.01876

  45. [45]

    R. E. Kalman and R. S. Bucy. New results in linear filtering and prediction theory. Journal of Basic Engineering, 83 0 (1): 0 95--108, 1961. doi:10.1115/1.3658902. URL http://dx.doi.org/10.1115/1.3658902

  46. [46]

    A new approach to linear filtering and prediction problems

    Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME--Journal of Basic Engineering, 82 0 (Series D): 0 35--45, 1960

  47. [47]

    Transformers are RNNs : Fast autoregressive Transformers with linear attention, 2020

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs : Fast autoregressive Transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236

  48. [48]

    Attentive Neural Processes

    Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes, 2019. URL https://arxiv.org/abs/1901.05761

  49. [49]

    Deep Kalman Filters

    Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep Kalman filters, 2015. URL https://arxiv.org/abs/1511.05121

  50. [50]

    Structured Inference Networks for Nonlinear State Space Models

    Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models, 2016. URL https://arxiv.org/abs/1609.09869

  51. [51]

    Unveiling the power of complex-valued Transformers in wireless communications, 2025

    Yang Leng, Qingfeng Lin, Long-Yin Yung, Jingreng Lei, Yang Li, and Yik-Chung Wu. Unveiling the power of complex-valued Transformers in wireless communications, 2025. URL https://arxiv.org/abs/2502.11151

  52. [52]

    Scaled-dot-product attention as one-sided entropic optimal transport, 2025

    Elon Litman. Scaled-dot-product attention as one-sided entropic optimal transport, 2025. URL https://arxiv.org/abs/2508.08369

  53. [53]

    Alvarez, and Hongpeng Zhou

    Haiping Liu, Lijing Lin, Jingyuan Sun, Zhegong Shangguan, Mauricio A. Alvarez, and Hongpeng Zhou. Rethinking RoPE : A mathematical blueprint for n-dimensional positional embedding, 2025. URL https://arxiv.org/abs/2504.06308

  54. [54]

    Neural extended Kalman filters for learning and predicting dynamics of structural systems

    Wei Liu, Zhilu Lai, Kiran Bacsa, and Eleni Chatzi. Neural extended Kalman filters for learning and predicting dynamics of structural systems. Structural Health Monitoring, 23 0 (2): 0 1037–1052, June 2023. ISSN 1741-3168. doi:10.1177/14759217231179912. URL http://dx.doi.org/10.1177/14759217231179912

  55. [55]

    Learning to encode position for transformer with continuous dynamical model, 2020

    Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model, 2020. URL https://arxiv.org/abs/2003.09229

  56. [56]

    ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

    Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere, 2025. URL https://arxiv.org/abs/2410.01131

  57. [57]

    Inverse distance weighting attention, 2023

    Calvin McCarter. Inverse distance weighting attention, 2023. URL https://arxiv.org/abs/2310.18805

  58. [58]

    R. Mehra. On the identification of variances and adaptive Kalman filtering. IEEE Transactions on Automatic Control, 15 0 (2): 0 175--184, 1970. doi:10.1109/TAC.1970.1099422

  59. [59]

    SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

    Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan. Slang: Fast structured covariance approximations for Bayesian deep learning with natural gradient, 2019. URL https://arxiv.org/abs/1811.04504

  60. [60]

    Traveling words: A geometric interpretation of Transformers , 2023

    Raul Molina. Traveling words: A geometric interpretation of Transformers , 2023. URL https://arxiv.org/abs/2309.07315

  61. [61]

    Movellan and Prasad Gabbur

    Javier R. Movellan and Prasad Gabbur. Probabilistic Transformers , 2020. URL https://arxiv.org/abs/2010.15583

  62. [62]

    Identification and control of dynamical systems using neural networks

    Kumpati S Narendra and Kannan Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1 0 (1): 0 4--27, Mar 1990. doi:10.1109/72.80202

  63. [63]

    Nielsen, Laziz U

    Stefan K. Nielsen, Laziz U. Abdullaev, Rachel S. Y. Teo, and Tan M. Nguyen. Elliptical attention, 2024. URL https://arxiv.org/abs/2406.13770

  64. [64]

    Norcliffe, C

    Alexander Norcliffe, Cristian Bodnar, Ben Day, Jacob Moss, and Pietro Liò. Neural ODE processes, 2021. URL https://arxiv.org/abs/2103.12413

  65. [65]

    Moseley, Akshay Chaudhari, and Curtis Langlotz

    Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael E. Moseley, Akshay Chaudhari, and Curtis Langlotz. Liere: Lie rotational positional encodings, 2025. URL https://arxiv.org/abs/2406.10322

  66. [66]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, ...

  67. [67]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

  68. [68]

    Toeplitz neural network for sequence modeling, 2023

    Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Toeplitz neural network for sequence modeling, 2023. URL https://arxiv.org/abs/2305.04749

  69. [69]

    Adaptive filter attention

    Peter Racioppo. Adaptive filter attention. Master's thesis, University of California, Los Angeles, Los Angeles, CA, 2025. URL https://escholarship.org/content/qt0xn6488h/qt0xn6488h.pdf

  70. [70]

    Hopfield Networks is All You Need

    Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need, 2021. URL https://arxiv.org/abs/2008.02217

  71. [71]

    Provable benefits of complex parameterizations for structured state space models, 2024

    Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, and Nadav Cohen. Provable benefits of complex parameterizations for structured state space models, 2024. URL https://arxiv.org/abs/2410.14067

  72. [72]

    H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3 0 (8): 0 1445--1450, 1965. doi:10.2514/3.3166. URL https://doi.org/10.2514/3.3166

  73. [73]

    Guy Revach, Nir Shlezinger, Xiaoyong Ni, Adria Lopez Escoriza, Ruud J. G. van Sloun, and Yonina C. Eldar. Kalmannet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70: 0 1532–1547, 2022. ISSN 1941-0476. doi:10.1109/tsp.2022.3158588. URL http://dx.doi.org/10.1109/TSP.2022.3158588

  74. [74]

    Towards understanding how attention mechanism works in deep learning, 2024

    Tianyu Ruan and Shihua Zhang. Towards understanding how attention mechanism works in deep learning, 2024. URL https://arxiv.org/abs/2412.18288

  75. [75]

    Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent ODEs for irregularly-sampled time series, 2019. URL https://arxiv.org/abs/1907.03907

  76. [76]

    Probabilistic attention based on Gaussian processes for deep multiple instance learning, 2023

    Arne Schmidt, Pablo Morales-Álvarez, and Rafael Molina. Probabilistic attention based on Gaussian processes for deep multiple instance learning, 2023. URL https://arxiv.org/abs/2302.04061

  77. [77]

    Schweppe

    F. Schweppe. Evaluation of likelihood functions for Gaussian signals. IEEE Transactions on Information Theory, 11 0 (1): 0 61--70, 1965. doi:10.1109/TIT.1965.1053737

  78. [78]

    Patrick Seifner, Kostadin Cvejoski, David Berghaus, Cesar Ojeda, and Ramses J. Sanchez. Foundation inference models for stochastic differential equations: A Transformer -based approach for zero-shot function estimation, 2025. https://doi.org/10.48550/arXiv.2502.19049

  79. [79]

    ComplexFormer : Disruptively advancing Transformer inference ability via head-specific complex vector attention, 2025

    Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, and MingKai Zheng. ComplexFormer : Disruptively advancing Transformer inference ability via head-specific complex vector attention, 2025. URL https://arxiv.org/abs/2505.10222

  80. [80]

    Self-Attention with Relative Position Representations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations, 2018. URL https://arxiv.org/abs/1803.02155

Showing first 80 references.