Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

arxiv: 2509.04154 · v5 · submitted 2025-09-04 · 💻 cs.LG · cs.AI

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

Peter Racioppo This is my paper

Pith reviewed 2026-05-18 18:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords robust filter attentionself-attentionstate estimationstochastic differential equationlanguage modelingperplexityextrapolationpositional embeddings

0 comments p. Extension

The pith

Attention as robust state estimation cuts perplexity vs RoPE

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Robust Filter Attention as a way to view self-attention through the lens of state estimation. Tokens are seen as noisy measurements of a hidden path evolving according to a linear stochastic differential equation. Weights come from how well each token fits the expected dynamics and uncertainty, instead of just matching features. This keeps the same computational cost as regular attention under basic assumptions about noise. Experiments show it gets better language modeling scores than standard positional methods and handles much longer sequences without retraining.

Core claim

Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation. Attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention while achieving lower perplexity than RoPE on language modeling benchmarks and remaining stable for zero-shot longer contexts. It also interprets positional mechanisms dynamically through transport and uncertainty propagation.

What carries the argument

The key mechanism is precision-weighted state estimation where attention weights reflect consistency with the linear SDE model of token trajectories.

If this is right

RFA achieves lower perplexity than RoPE within the training window on language modeling benchmarks.
RFA remains stable under zero-shot extrapolation to longer contexts.
The framework provides a dynamical interpretation of standard positional mechanisms such as rotational embeddings.
Recency biases connect to uncertainty propagation induced by stochastic dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This formulation could inspire attention variants that use more advanced SDE models to capture complex dependencies in sequences.
Applying the state estimation view to other domains like time series or graph data might yield similar robustness benefits.
Testing RFA on tasks requiring very long contexts could reveal if the stability advantage scales further.

Load-bearing premise

The approach depends on isotropic noise and decay assumptions that allow matching the speed of standard attention while setting weights by model consistency.

What would settle it

Running the same language modeling experiments without the isotropic noise assumption and checking if perplexity rises or extrapolation fails would test the claim.

Figures

Figures reproduced from arXiv: 2509.04154 by Peter Racioppo.

**Figure 1.** Figure 1: Filter performance on different 2D systems: ground-truth trajectory (black), measured (blue), and predicted (red). (a) system with only measurement noise (σ 2 = 0.0, η2 = 1.0). (b) system with both process noise and measurement noise (σ 2 = 0.3, η2 = 0.5). (c) higher noise (σ 2 = 0.5, η2 = 2.0). To check that the model has learned the right dynamics, we compute "pulled-forward" estimates at four time point… view at source ↗

**Figure 2.** Figure 2: Comparison of an AFA layer’s "pulled-forward" state estimates at different stages of training. The true trajectory is shown as a solid black line, and the pulledforward estimates as colored point clouds. (a) State estimates early in training. (b) State estimates midway through training. (c) State estimates after training is complete. and interpretable attention matrices of AFA. We compare the attention ma… view at source ↗

**Figure 3.** Figure 3: Attention matrices produced by training standard attention and AFA on a 2D LTI with process and measurement noise. (a) First layer of standard attention (b) Second layer of standard attention. (c) Single layer of AFA. These experiments, along with animations of the predicted trajectories, pulled-forward estimates, and attention matrices during the course of training are available at the same Github link pr… view at source ↗

read the original abstract

We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention. On language modeling benchmarks, RFA achieves lower perplexity than RoPE within the training window while remaining stable under zero-shot extrapolation to longer contexts. The framework also provides a dynamical interpretation of standard positional mechanisms, connecting rotational embeddings and recency biases to transport and uncertainty propagation induced by stochastic dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RFA maps self-attention to precision-weighted state estimation under a linear SDE, which is a distinct framing, but the isotropic noise and decay assumptions look like the load-bearing part for both complexity and claimed robustness.

read the letter

The main point is that this paper derives attention weights from consistency under a linear SDE trajectory model rather than feature similarity, treating tokens as noisy observations of a latent process. That gives a dynamical-systems reading of self-attention and, more interestingly, of positional mechanisms like RoPE, which it ties to transport and uncertainty propagation from the stochastic dynamics. The abstract states that under isotropic noise and decay the method keeps the same O(n^2) cost as standard attention while showing lower perplexity than RoPE inside the training window and better zero-shot stability on longer contexts. Those are the concrete claims worth checking. The framing itself is new relative to the cited prior work and supplies a first-principles story for why certain positional schemes help with extrapolation. The derivations appear to be parameter-free once the SDE model is fixed, which is a plus for reproducibility. The soft spot is the set of assumptions needed to recover both the complexity match and the robustness. Isotropic noise plus a specific exponential decay lets the estimator stay efficient and stable, but language data has anisotropic correlations, context-dependent structure, and discrete jumps that sit awkwardly with a continuous linear SDE observation model. If those assumptions are relaxed or replaced by data-driven noise, the paper would need to show whether the mechanism still equals standard attention cost or remains a direct state estimator. Without that sensitivity check, it is hard to attribute the reported gains cleanly to the dynamical interpretation rather than to the particular noise model chosen. This is worth a reading group for anyone working on long-context transformers or on mechanistic explanations of attention. A reader who cares about connecting core components to stochastic processes will get usable ideas even if the experiments require more controls. I would send it to peer review so referees can verify the derivations and test how well the SDE assumptions actually fit discrete token statistics.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Robust Filter Attention (RFA), a formulation of self-attention as precision-weighted state estimation. Tokens are modeled as noisy observations of a latent trajectory governed by a linear SDE; attention weights are derived from consistency under this model rather than static similarity. Under isotropic noise and exponential decay assumptions, RFA matches the O(n²) complexity of standard attention. On language modeling benchmarks, RFA reports lower perplexity than RoPE within the training window and improved stability under zero-shot extrapolation to longer contexts. The work also supplies a dynamical interpretation of positional mechanisms such as rotational embeddings and recency biases in terms of transport and uncertainty propagation.

Significance. If the derivation and empirical results hold under the stated assumptions, the paper supplies a principled dynamical-systems view that could unify disparate positional encodings and motivate new attention variants with stronger extrapolation properties. The explicit link to state estimation offers a route for theoretical analysis of attention stability. The reported perplexity gains and zero-shot robustness, if reproducible across scales and tasks, would be of interest to researchers seeking interpretable and robust transformer components.

major comments (3)

[§2.3, Eq. (12)] §2.3, Eq. (12): The equivalence to standard attention complexity is shown only after imposing isotropic noise and a specific exponential decay form; the manuscript does not demonstrate that the resulting precision-weighted estimator retains O(n²) cost or the claimed robustness when these assumptions are relaxed to accommodate the anisotropic correlations and discrete jumps present in token sequences.
[§4.2, Table 1] §4.2, Table 1: The perplexity improvements over RoPE are presented without an ablation that isolates the contribution of the SDE-derived weighting from other implementation choices (e.g., initialization, optimizer settings). It is therefore unclear whether the gains can be attributed to the dynamical interpretation rather than hyper-parameter differences.
[§3.1] §3.1: The claim that RFA remains stable under zero-shot length extrapolation rests on the decay assumption propagating uncertainty correctly; the paper should supply a concrete counter-example or sensitivity test showing behavior when the linear SDE observation model is violated by structured language dependencies.

minor comments (2)

[§2.1] The notation for the precision matrix in §2.1 could be clarified with an explicit definition of how it is computed from the SDE parameters to avoid ambiguity with standard attention scaling.
[Figure 3] Figure 3 caption should state the number of random seeds and report standard deviation for the extrapolation curves.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: [§2.3, Eq. (12)] The equivalence to standard attention complexity is shown only after imposing isotropic noise and a specific exponential decay form; the manuscript does not demonstrate that the resulting precision-weighted estimator retains O(n²) cost or the claimed robustness when these assumptions are relaxed to accommodate the anisotropic correlations and discrete jumps present in token sequences.

Authors: We agree that the O(n²) complexity and robustness properties are derived specifically under the isotropic noise and exponential decay assumptions stated in §2.3. These choices enable an efficient, closed-form implementation while linking attention to state estimation. In the revised manuscript we will expand the discussion following Eq. (12) to explicitly note the limitations under relaxed assumptions, including potential complexity increases for anisotropic noise and the handling of discrete jumps, and we will frame the current formulation as a tractable baseline for such extensions. revision: yes
Referee: [§4.2, Table 1] The perplexity improvements over RoPE are presented without an ablation that isolates the contribution of the SDE-derived weighting from other implementation choices (e.g., initialization, optimizer settings). It is therefore unclear whether the gains can be attributed to the dynamical interpretation rather than hyper-parameter differences.

Authors: The referee correctly identifies the absence of targeted ablations. Although we matched hyper-parameters across models, we did not isolate the effect of the precision-weighted estimator from other implementation details. We will add ablation experiments in the revised §4.2 that systematically vary initialization, optimizer settings, and a non-dynamical weighting baseline, allowing clearer attribution of the reported perplexity gains. revision: yes
Referee: [§3.1] The claim that RFA remains stable under zero-shot length extrapolation rests on the decay assumption propagating uncertainty correctly; the paper should supply a concrete counter-example or sensitivity test showing behavior when the linear SDE observation model is violated by structured language dependencies.

Authors: We acknowledge that the stability claim relies on the decay assumption and that a systematic sensitivity analysis for violations by structured language dependencies would strengthen the work. Our zero-shot extrapolation results on the evaluated benchmarks provide supporting empirical evidence. A comprehensive counter-example study across all possible dependency structures lies beyond the present scope; we will add a limitations paragraph in §3.1 discussing this point and suggesting directions for future sensitivity tests. revision: partial

standing simulated objections not resolved

A full sensitivity analysis or concrete counter-examples demonstrating RFA behavior when the linear SDE observation model is violated by arbitrary structured language dependencies.

Circularity Check

0 steps flagged

Derivation from linear SDE model is self-contained with explicit assumptions; no circular reduction

full rationale

The paper derives Robust Filter Attention by modeling each token as a noisy observation of a latent trajectory governed by a linear SDE, with attention weights obtained from consistency under that model. The match to standard attention complexity is obtained only after imposing the stated isotropic noise and decay assumptions, which are presented as modeling choices rather than fitted quantities or self-definitions. No equations reduce by construction to the target performance metrics, no load-bearing self-citations are invoked to justify uniqueness, and no ansatz is smuggled via prior work. The reported perplexity and extrapolation results are therefore empirical outcomes of the formulation, not tautological re-statements of its inputs. The derivation chain remains independent of the final benchmark numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on treating token sequences as observations of a latent linear SDE trajectory and on the isotropic noise plus decay assumptions that enable both the attention formulation and the complexity match.

axioms (2)

domain assumption Each token is a noisy observation of a latent trajectory governed by a linear stochastic differential equation
Invoked in the abstract to define how attention weights are determined by model consistency.
ad hoc to paper Isotropic noise and decay assumptions hold
Explicitly required for RFA to match standard attention complexity.

pith-pipeline@v0.9.0 · 5632 in / 1387 out tokens · 35629 ms · 2026-05-18T18:53:04.188467+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 25 internal anchors

[1]

Cope: A lightweight complex positional encoding, 2025

Avinash Amballa. Cope: A lightweight complex positional encoding, 2025. URL https://arxiv.org/abs/2508.18308

work page arXiv 2025
[2]

Anderson and J.B

B.D.O. Anderson and J.B. Moore . Optimal Filtering. Prentice-Hall, 1979

work page 1979
[3]

Neural continuous-discrete state space models for irregularly-sampled time series, 2023

Abdul Fatir Ansari, Alvin Heng, Andre Lim, and Harold Soh. Neural continuous-discrete state space models for irregularly-sampled time series, 2023. URL https://arxiv.org/abs/2301.11308

work page arXiv 2023
[4]

Element-wise attention layers: an option for optimization, 2023

Giovanni Araujo Bacochina and Rodrigo Clemente Thom de Souza. Element-wise attention layers: an option for optimization, 2023. URL https://arxiv.org/abs/2302.05488

work page arXiv 2023
[5]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URL https://arxiv.org/abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models, 2019. URL https://arxiv.org/abs/1909.01377

work page arXiv 2019
[7]

Theory and implementation of complex-valued neural networks, 2023

Jose Agustin Barrachina, Chengfang Ren, Gilles Vieillard, Christele Morisseau, and Jean-Philippe Ovarlez. Theory and implementation of complex-valued neural networks, 2023. URL https://arxiv.org/abs/2302.08286

work page arXiv 2023
[8]

A survey of complex-valued neural networks, 2021

Joshua Bassey, Lijun Qian, and Xianfang Li. A survey of complex-valued neural networks, 2021. URL https://arxiv.org/abs/2101.12249

work page arXiv 2021
[9]

Learning Stochastic Recurrent Networks

Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks, 2015. URL https://arxiv.org/abs/1411.7610

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Mambamixer: Efficient selective state space models with dual token and channel selection, 2024

Ali Behrouz, Michele Santacatterina, and Ramin Zabih. Mambamixer: Efficient selective state space models with dual token and channel selection, 2024. URL https://arxiv.org/abs/2403.19888

work page arXiv 2024
[11]

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Arthur S. Bianchessi, Rodrigo C. Barros, and Lucas S. Kupssinskü. Bayesian attention mechanism: A probabilistic framework for positional encoding and context length extrapolation, 2025. URL https://arxiv.org/abs/2505.22842

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Li, Eric P

Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to SSMs : Distilling quadratic knowledge to subquadratic models, 2025. URL https://arxiv.org/abs/2408.10189

work page arXiv 2025
[13]

On the expressivity role of LayerNorm in Transformers' attention, 2023

Shaked Brody, Uri Alon, and Eran Yahav. On the expressivity role of LayerNorm in Transformers' attention, 2023. URL https://arxiv.org/abs/2305.02582

work page arXiv 2023
[14]

Revisiting kernel attention with correlated Gaussian process representation, 2025

Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, and Trong Nghia Hoang. Revisiting kernel attention with correlated Gaussian process representation, 2025. URL https://arxiv.org/abs/2502.20525

work page arXiv 2025
[15]

Sidney Burrus, J

C. Sidney Burrus, J. A. Barreto, and Ivan W. Selesnick. Iterative reweighted least-squares design of FIR filters. IEEE Transactions on Signal Processing, 42 0 (11): 0 2926--2936, Nov 1994. doi:10.1109/78.326612

work page doi:10.1109/78.326612 1994
[16]

Learning self-modulating attention in continuous time space with applications to sequential recommendation, 2022

Chao Chen, Haoyu Geng, Nianzu Yang, Junchi Yan, Daiyue Xue, Jianping Yu, and Xiaokang Yang. Learning self-modulating attention in continuous time space with applications to sequential recommendation, 2022. URL https://arxiv.org/abs/2204.06517

work page arXiv 2022
[17]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations, 2019. URL https://arxiv.org/abs/1806.07366

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Yingyi Chen, Qinghua Tao, Francesco Tonin, and Johan A. K. Suykens. Self-attention through kernel-eigen pair sparse variational Gaussian processes, 2024. URL https://arxiv.org/abs/2402.01476

work page arXiv 2024
[19]

Continuous-time attention for sequential learning

Jen-Tzung Chien and Yi-Hsiang Chen. Continuous-time attention for sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7116--7124. AAAI Press, 2021. doi:10.1609/aaai.v35i8.16875. URL https://doi.org/10.1609/aaai.v35i8.16875

work page doi:10.1609/aaai.v35i8.16875 2021
[20]

A Recurrent Latent Variable Model for Sequential Data

Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.org/abs/1506.02216

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Adaptive Kalman -informed Transformer

Nadav Cohen and Itzik Klein. Adaptive Kalman -informed Transformer . arXiv preprint arXiv:2401.09987, 2024. URL https://doi.org/10.48550/arXiv.2401.09987. Version v2: 7 Mar 2025

work page doi:10.48550/arxiv.2401.09987 2024
[22]

Le, and Ruslan Salakhutdinov

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL : Attentive language models beyond a fixed-length context, 2019

work page 2019
[23]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are SSMs : Generalized models and efficient algorithms through structured state space duality, 2024. URL https://arxiv.org/abs/2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Building blocks for a complex-valued transformer architecture

Florian Eilers and Xiaoyi Jiang. Building blocks for a complex-valued transformer architecture. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 1–5. IEEE, June 2023. doi:10.1109/icassp49357.2023.10095349. URL http://dx.doi.org/10.1109/ICASSP49357.2023.10095349

work page doi:10.1109/icassp49357.2023.10095349 2023
[25]

Element-wise attention is all you need, 2025

Guoxin Feng. Element-wise attention is all you need, 2025. URL https://arxiv.org/abs/2501.05730

work page arXiv 2025
[26]

Sequential Neural Models with Stochastic Layers

Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Neural Processes

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes, 2018. URL https://arxiv.org/abs/1807.01622

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

A mathematical perspective on Transformers , 2024

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on Transformers , 2024. URL https://arxiv.org/abs/2312.10794

work page arXiv 2024
[29]

Can a Transformer represent a Kalman filter?, 2024

Gautam Goel and Peter Bartlett. Can a Transformer represent a Kalman filter?, 2024. URL https://arxiv.org/abs/2312.06937

work page arXiv 2024
[30]

Learning fast approximations of sparse coding

Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 399--406. Omnipress, 2010. URL https://icml.cc/Conferences/2010/papers/449.pdf

work page 2010
[31]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Hippo: Recurrent memory with optimal polynomial projections, 2020

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re. Hippo: Recurrent memory with optimal polynomial projections, 2020. URL https://arxiv.org/abs/2008.07669

work page arXiv 2020
[33]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URL https://arxiv.org/abs/2111.00396

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

High-resolution Image Synthesis with Latent Diffusion Models,

Hongji Guo, Hanjing Wang, and Qiang Ji. Uncertainty-guided probabilistic Transformer for complex action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20020--20029, 2022. doi:10.1109/CVPR52688.2022.01942

work page doi:10.1109/cvpr52688.2022.01942 2022
[35]

Geometric interpretation of layer normalization and a comparative analysis with RMSNorm , 2025

Akshat Gupta, Atahan Ozdemir, and Gopala Anumanchipalli. Geometric interpretation of layer normalization and a comparative analysis with RMSNorm , 2025. URL https://arxiv.org/abs/2409.12951

work page arXiv 2025
[36]

M. M. Hammad. Comprehensive survey of complex-valued neural networks: Insights into backpropagation and activation functions, 2024. URL https://arxiv.org/abs/2407.19258

work page arXiv 2024
[37]

Kalman Filtering and Neural Networks

Simon Haykin, editor. Kalman Filtering and Neural Networks. John Wiley & Sons, Inc., New York, 2001. ISBN 9780471369981. doi:10.1002/0471221546

work page doi:10.1002/0471221546 2001
[38]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for Transformers , 2020. URL https://arxiv.org/abs/2010.04245

work page arXiv 2020
[39]

Uncertainty-aware attention for reliable interpretation and prediction, 2018

Jay Heo, Hae Beom Lee, Saehoon Kim, Juho Lee, Kwang Joon Kim, Eunho Yang, and Sung Ju Hwang. Uncertainty-aware attention for reliable interpretation and prediction, 2018. URL https://arxiv.org/abs/1805.09653

work page arXiv 2018
[40]

Complex-Valued Neural Networks

Akira Hirose. Complex-Valued Neural Networks. Studies in Computational Intelligence. Springer Berlin, Heidelberg, 2 edition, 2012. doi:10.1007/978-3-642-27632-3

work page doi:10.1007/978-3-642-27632-3 2012
[41]

Gaussian adaptive attention is all you need: Robust contextual representations across multiple modalities, 2024

Georgios Ioannides, Aman Chadha, and Aaron Elkins. Gaussian adaptive attention is all you need: Robust contextual representations across multiple modalities, 2024. URL https://arxiv.org/html/2401.11143v3

work page arXiv 2024
[42]

Hadi Jahanshahi and Zheng H. Zhu. Uncertainty propagation networks for neural ordinary differential equations, 2025. URL https://arxiv.org/abs/2508.16815

work page arXiv 2025
[43]

ACE-NODE : Attentive co-evolving neural ordinary differential equations, 2021 a

Sheo Yon Jhin, Minju Jo, Taeyong Kong, Jinsung Jeon, and Noseong Park. ACE-NODE : Attentive co-evolving neural ordinary differential equations, 2021 a . URL https://arxiv.org/abs/2105.14953

work page arXiv 2021
[44]

Attentive neural controlled differential equations for time-series classification and forecasting, 2021 b

Sheo Yon Jhin, Heejoo Shin, Seoyoung Hong, Solhee Park, and Noseong Park. Attentive neural controlled differential equations for time-series classification and forecasting, 2021 b . URL https://arxiv.org/abs/2109.01876

work page arXiv 2021
[45]

R. E. Kalman and R. S. Bucy. New results in linear filtering and prediction theory. Journal of Basic Engineering, 83 0 (1): 0 95--108, 1961. doi:10.1115/1.3658902. URL http://dx.doi.org/10.1115/1.3658902

work page doi:10.1115/1.3658902 1961
[46]

A new approach to linear filtering and prediction problems

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME--Journal of Basic Engineering, 82 0 (Series D): 0 35--45, 1960

work page 1960
[47]

Transformers are RNNs : Fast autoregressive Transformers with linear attention, 2020

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs : Fast autoregressive Transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236

work page arXiv 2020
[48]

Attentive Neural Processes

Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes, 2019. URL https://arxiv.org/abs/1901.05761

work page internal anchor Pith review Pith/arXiv arXiv 2019
[49]

Deep Kalman Filters

Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep Kalman filters, 2015. URL https://arxiv.org/abs/1511.05121

work page internal anchor Pith review Pith/arXiv arXiv 2015
[50]

Structured Inference Networks for Nonlinear State Space Models

Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models, 2016. URL https://arxiv.org/abs/1609.09869

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

Unveiling the power of complex-valued Transformers in wireless communications, 2025

Yang Leng, Qingfeng Lin, Long-Yin Yung, Jingreng Lei, Yang Li, and Yik-Chung Wu. Unveiling the power of complex-valued Transformers in wireless communications, 2025. URL https://arxiv.org/abs/2502.11151

work page arXiv 2025
[52]

Scaled-dot-product attention as one-sided entropic optimal transport, 2025

Elon Litman. Scaled-dot-product attention as one-sided entropic optimal transport, 2025. URL https://arxiv.org/abs/2508.08369

work page arXiv 2025
[53]

Alvarez, and Hongpeng Zhou

Haiping Liu, Lijing Lin, Jingyuan Sun, Zhegong Shangguan, Mauricio A. Alvarez, and Hongpeng Zhou. Rethinking RoPE : A mathematical blueprint for n-dimensional positional embedding, 2025. URL https://arxiv.org/abs/2504.06308

work page arXiv 2025
[54]

Neural extended Kalman filters for learning and predicting dynamics of structural systems

Wei Liu, Zhilu Lai, Kiran Bacsa, and Eleni Chatzi. Neural extended Kalman filters for learning and predicting dynamics of structural systems. Structural Health Monitoring, 23 0 (2): 0 1037–1052, June 2023. ISSN 1741-3168. doi:10.1177/14759217231179912. URL http://dx.doi.org/10.1177/14759217231179912

work page doi:10.1177/14759217231179912 2023
[55]

Learning to encode position for transformer with continuous dynamical model, 2020

Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model, 2020. URL https://arxiv.org/abs/2003.09229

work page arXiv 2020
[56]

ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere, 2025. URL https://arxiv.org/abs/2410.01131

work page arXiv 2025
[57]

Inverse distance weighting attention, 2023

Calvin McCarter. Inverse distance weighting attention, 2023. URL https://arxiv.org/abs/2310.18805

work page arXiv 2023
[58]

R. Mehra. On the identification of variances and adaptive Kalman filtering. IEEE Transactions on Automatic Control, 15 0 (2): 0 175--184, 1970. doi:10.1109/TAC.1970.1099422

work page doi:10.1109/tac.1970.1099422 1970
[59]

SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan. Slang: Fast structured covariance approximations for Bayesian deep learning with natural gradient, 2019. URL https://arxiv.org/abs/1811.04504

work page internal anchor Pith review Pith/arXiv arXiv 2019
[60]

Traveling words: A geometric interpretation of Transformers , 2023

Raul Molina. Traveling words: A geometric interpretation of Transformers , 2023. URL https://arxiv.org/abs/2309.07315

work page arXiv 2023
[61]

Movellan and Prasad Gabbur

Javier R. Movellan and Prasad Gabbur. Probabilistic Transformers , 2020. URL https://arxiv.org/abs/2010.15583

work page arXiv 2020
[62]

Identification and control of dynamical systems using neural networks

Kumpati S Narendra and Kannan Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1 0 (1): 0 4--27, Mar 1990. doi:10.1109/72.80202

work page doi:10.1109/72.80202 1990
[63]

Nielsen, Laziz U

Stefan K. Nielsen, Laziz U. Abdullaev, Rachel S. Y. Teo, and Tan M. Nguyen. Elliptical attention, 2024. URL https://arxiv.org/abs/2406.13770

work page arXiv 2024
[64]

Norcliffe, C

Alexander Norcliffe, Cristian Bodnar, Ben Day, Jacob Moss, and Pietro Liò. Neural ODE processes, 2021. URL https://arxiv.org/abs/2103.12413

work page arXiv 2021
[65]

Moseley, Akshay Chaudhari, and Curtis Langlotz

Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael E. Moseley, Akshay Chaudhari, and Curtis Langlotz. Liere: Lie rotational positional encodings, 2025. URL https://arxiv.org/abs/2406.10322

work page arXiv 2025
[66]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Toeplitz neural network for sequence modeling, 2023

Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Toeplitz neural network for sequence modeling, 2023. URL https://arxiv.org/abs/2305.04749

work page arXiv 2023
[69]

Adaptive filter attention

Peter Racioppo. Adaptive filter attention. Master's thesis, University of California, Los Angeles, Los Angeles, CA, 2025. URL https://escholarship.org/content/qt0xn6488h/qt0xn6488h.pdf

work page 2025
[70]

Hopfield Networks is All You Need

Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need, 2021. URL https://arxiv.org/abs/2008.02217

work page internal anchor Pith review Pith/arXiv arXiv 2021
[71]

Provable benefits of complex parameterizations for structured state space models, 2024

Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, and Nadav Cohen. Provable benefits of complex parameterizations for structured state space models, 2024. URL https://arxiv.org/abs/2410.14067

work page arXiv 2024
[72]

H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3 0 (8): 0 1445--1450, 1965. doi:10.2514/3.3166. URL https://doi.org/10.2514/3.3166

work page doi:10.2514/3.3166 1965
[73]

Guy Revach, Nir Shlezinger, Xiaoyong Ni, Adria Lopez Escoriza, Ruud J. G. van Sloun, and Yonina C. Eldar. Kalmannet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70: 0 1532–1547, 2022. ISSN 1941-0476. doi:10.1109/tsp.2022.3158588. URL http://dx.doi.org/10.1109/TSP.2022.3158588

work page doi:10.1109/tsp.2022.3158588 2022
[74]

Towards understanding how attention mechanism works in deep learning, 2024

Tianyu Ruan and Shihua Zhang. Towards understanding how attention mechanism works in deep learning, 2024. URL https://arxiv.org/abs/2412.18288

work page arXiv 2024
[75]

Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent ODEs for irregularly-sampled time series, 2019. URL https://arxiv.org/abs/1907.03907

work page internal anchor Pith review Pith/arXiv arXiv 2019
[76]

Probabilistic attention based on Gaussian processes for deep multiple instance learning, 2023

Arne Schmidt, Pablo Morales-Álvarez, and Rafael Molina. Probabilistic attention based on Gaussian processes for deep multiple instance learning, 2023. URL https://arxiv.org/abs/2302.04061

work page arXiv 2023
[77]

Schweppe

F. Schweppe. Evaluation of likelihood functions for Gaussian signals. IEEE Transactions on Information Theory, 11 0 (1): 0 61--70, 1965. doi:10.1109/TIT.1965.1053737

work page doi:10.1109/tit.1965.1053737 1965
[78]

Patrick Seifner, Kostadin Cvejoski, David Berghaus, Cesar Ojeda, and Ramses J. Sanchez. Foundation inference models for stochastic differential equations: A Transformer -based approach for zero-shot function estimation, 2025. https://doi.org/10.48550/arXiv.2502.19049

work page doi:10.48550/arxiv.2502.19049 2025
[79]

ComplexFormer : Disruptively advancing Transformer inference ability via head-specific complex vector attention, 2025

Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, and MingKai Zheng. ComplexFormer : Disruptively advancing Transformer inference ability via head-specific complex vector attention, 2025. URL https://arxiv.org/abs/2505.10222

work page arXiv 2025
[80]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations, 2018. URL https://arxiv.org/abs/1803.02155

work page internal anchor Pith review Pith/arXiv arXiv 2018

Showing first 80 references.

[1] [1]

Cope: A lightweight complex positional encoding, 2025

Avinash Amballa. Cope: A lightweight complex positional encoding, 2025. URL https://arxiv.org/abs/2508.18308

work page arXiv 2025

[2] [2]

Anderson and J.B

B.D.O. Anderson and J.B. Moore . Optimal Filtering. Prentice-Hall, 1979

work page 1979

[3] [3]

Neural continuous-discrete state space models for irregularly-sampled time series, 2023

Abdul Fatir Ansari, Alvin Heng, Andre Lim, and Harold Soh. Neural continuous-discrete state space models for irregularly-sampled time series, 2023. URL https://arxiv.org/abs/2301.11308

work page arXiv 2023

[4] [4]

Element-wise attention layers: an option for optimization, 2023

Giovanni Araujo Bacochina and Rodrigo Clemente Thom de Souza. Element-wise attention layers: an option for optimization, 2023. URL https://arxiv.org/abs/2302.05488

work page arXiv 2023

[5] [5]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URL https://arxiv.org/abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models, 2019. URL https://arxiv.org/abs/1909.01377

work page arXiv 2019

[7] [7]

Theory and implementation of complex-valued neural networks, 2023

Jose Agustin Barrachina, Chengfang Ren, Gilles Vieillard, Christele Morisseau, and Jean-Philippe Ovarlez. Theory and implementation of complex-valued neural networks, 2023. URL https://arxiv.org/abs/2302.08286

work page arXiv 2023

[8] [8]

A survey of complex-valued neural networks, 2021

Joshua Bassey, Lijun Qian, and Xianfang Li. A survey of complex-valued neural networks, 2021. URL https://arxiv.org/abs/2101.12249

work page arXiv 2021

[9] [9]

Learning Stochastic Recurrent Networks

Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks, 2015. URL https://arxiv.org/abs/1411.7610

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Mambamixer: Efficient selective state space models with dual token and channel selection, 2024

Ali Behrouz, Michele Santacatterina, and Ramin Zabih. Mambamixer: Efficient selective state space models with dual token and channel selection, 2024. URL https://arxiv.org/abs/2403.19888

work page arXiv 2024

[11] [11]

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Arthur S. Bianchessi, Rodrigo C. Barros, and Lucas S. Kupssinskü. Bayesian attention mechanism: A probabilistic framework for positional encoding and context length extrapolation, 2025. URL https://arxiv.org/abs/2505.22842

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Li, Eric P

Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to SSMs : Distilling quadratic knowledge to subquadratic models, 2025. URL https://arxiv.org/abs/2408.10189

work page arXiv 2025

[13] [13]

On the expressivity role of LayerNorm in Transformers' attention, 2023

Shaked Brody, Uri Alon, and Eran Yahav. On the expressivity role of LayerNorm in Transformers' attention, 2023. URL https://arxiv.org/abs/2305.02582

work page arXiv 2023

[14] [14]

Revisiting kernel attention with correlated Gaussian process representation, 2025

Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, and Trong Nghia Hoang. Revisiting kernel attention with correlated Gaussian process representation, 2025. URL https://arxiv.org/abs/2502.20525

work page arXiv 2025

[15] [15]

Sidney Burrus, J

C. Sidney Burrus, J. A. Barreto, and Ivan W. Selesnick. Iterative reweighted least-squares design of FIR filters. IEEE Transactions on Signal Processing, 42 0 (11): 0 2926--2936, Nov 1994. doi:10.1109/78.326612

work page doi:10.1109/78.326612 1994

[16] [16]

Learning self-modulating attention in continuous time space with applications to sequential recommendation, 2022

Chao Chen, Haoyu Geng, Nianzu Yang, Junchi Yan, Daiyue Xue, Jianping Yu, and Xiaokang Yang. Learning self-modulating attention in continuous time space with applications to sequential recommendation, 2022. URL https://arxiv.org/abs/2204.06517

work page arXiv 2022

[17] [17]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations, 2019. URL https://arxiv.org/abs/1806.07366

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

Yingyi Chen, Qinghua Tao, Francesco Tonin, and Johan A. K. Suykens. Self-attention through kernel-eigen pair sparse variational Gaussian processes, 2024. URL https://arxiv.org/abs/2402.01476

work page arXiv 2024

[19] [19]

Continuous-time attention for sequential learning

Jen-Tzung Chien and Yi-Hsiang Chen. Continuous-time attention for sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7116--7124. AAAI Press, 2021. doi:10.1609/aaai.v35i8.16875. URL https://doi.org/10.1609/aaai.v35i8.16875

work page doi:10.1609/aaai.v35i8.16875 2021

[20] [20]

A Recurrent Latent Variable Model for Sequential Data

Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv.org/abs/1506.02216

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Adaptive Kalman -informed Transformer

Nadav Cohen and Itzik Klein. Adaptive Kalman -informed Transformer . arXiv preprint arXiv:2401.09987, 2024. URL https://doi.org/10.48550/arXiv.2401.09987. Version v2: 7 Mar 2025

work page doi:10.48550/arxiv.2401.09987 2024

[22] [22]

Le, and Ruslan Salakhutdinov

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL : Attentive language models beyond a fixed-length context, 2019

work page 2019

[23] [23]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are SSMs : Generalized models and efficient algorithms through structured state space duality, 2024. URL https://arxiv.org/abs/2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Building blocks for a complex-valued transformer architecture

Florian Eilers and Xiaoyi Jiang. Building blocks for a complex-valued transformer architecture. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 1–5. IEEE, June 2023. doi:10.1109/icassp49357.2023.10095349. URL http://dx.doi.org/10.1109/ICASSP49357.2023.10095349

work page doi:10.1109/icassp49357.2023.10095349 2023

[25] [25]

Element-wise attention is all you need, 2025

Guoxin Feng. Element-wise attention is all you need, 2025. URL https://arxiv.org/abs/2501.05730

work page arXiv 2025

[26] [26]

Sequential Neural Models with Stochastic Layers

Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers, 2016. URL https://arxiv.org/abs/1605.07571

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

Neural Processes

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes, 2018. URL https://arxiv.org/abs/1807.01622

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

A mathematical perspective on Transformers , 2024

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on Transformers , 2024. URL https://arxiv.org/abs/2312.10794

work page arXiv 2024

[29] [29]

Can a Transformer represent a Kalman filter?, 2024

Gautam Goel and Peter Bartlett. Can a Transformer represent a Kalman filter?, 2024. URL https://arxiv.org/abs/2312.06937

work page arXiv 2024

[30] [30]

Learning fast approximations of sparse coding

Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 399--406. Omnipress, 2010. URL https://icml.cc/Conferences/2010/papers/449.pdf

work page 2010

[31] [31]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Hippo: Recurrent memory with optimal polynomial projections, 2020

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re. Hippo: Recurrent memory with optimal polynomial projections, 2020. URL https://arxiv.org/abs/2008.07669

work page arXiv 2020

[33] [33]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URL https://arxiv.org/abs/2111.00396

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

High-resolution Image Synthesis with Latent Diffusion Models,

Hongji Guo, Hanjing Wang, and Qiang Ji. Uncertainty-guided probabilistic Transformer for complex action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20020--20029, 2022. doi:10.1109/CVPR52688.2022.01942

work page doi:10.1109/cvpr52688.2022.01942 2022

[35] [35]

Geometric interpretation of layer normalization and a comparative analysis with RMSNorm , 2025

Akshat Gupta, Atahan Ozdemir, and Gopala Anumanchipalli. Geometric interpretation of layer normalization and a comparative analysis with RMSNorm , 2025. URL https://arxiv.org/abs/2409.12951

work page arXiv 2025

[36] [36]

M. M. Hammad. Comprehensive survey of complex-valued neural networks: Insights into backpropagation and activation functions, 2024. URL https://arxiv.org/abs/2407.19258

work page arXiv 2024

[37] [37]

Kalman Filtering and Neural Networks

Simon Haykin, editor. Kalman Filtering and Neural Networks. John Wiley & Sons, Inc., New York, 2001. ISBN 9780471369981. doi:10.1002/0471221546

work page doi:10.1002/0471221546 2001

[38] [38]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for Transformers , 2020. URL https://arxiv.org/abs/2010.04245

work page arXiv 2020

[39] [39]

Uncertainty-aware attention for reliable interpretation and prediction, 2018

Jay Heo, Hae Beom Lee, Saehoon Kim, Juho Lee, Kwang Joon Kim, Eunho Yang, and Sung Ju Hwang. Uncertainty-aware attention for reliable interpretation and prediction, 2018. URL https://arxiv.org/abs/1805.09653

work page arXiv 2018

[40] [40]

Complex-Valued Neural Networks

Akira Hirose. Complex-Valued Neural Networks. Studies in Computational Intelligence. Springer Berlin, Heidelberg, 2 edition, 2012. doi:10.1007/978-3-642-27632-3

work page doi:10.1007/978-3-642-27632-3 2012

[41] [41]

Gaussian adaptive attention is all you need: Robust contextual representations across multiple modalities, 2024

Georgios Ioannides, Aman Chadha, and Aaron Elkins. Gaussian adaptive attention is all you need: Robust contextual representations across multiple modalities, 2024. URL https://arxiv.org/html/2401.11143v3

work page arXiv 2024

[42] [42]

Hadi Jahanshahi and Zheng H. Zhu. Uncertainty propagation networks for neural ordinary differential equations, 2025. URL https://arxiv.org/abs/2508.16815

work page arXiv 2025

[43] [43]

ACE-NODE : Attentive co-evolving neural ordinary differential equations, 2021 a

Sheo Yon Jhin, Minju Jo, Taeyong Kong, Jinsung Jeon, and Noseong Park. ACE-NODE : Attentive co-evolving neural ordinary differential equations, 2021 a . URL https://arxiv.org/abs/2105.14953

work page arXiv 2021

[44] [44]

Attentive neural controlled differential equations for time-series classification and forecasting, 2021 b

Sheo Yon Jhin, Heejoo Shin, Seoyoung Hong, Solhee Park, and Noseong Park. Attentive neural controlled differential equations for time-series classification and forecasting, 2021 b . URL https://arxiv.org/abs/2109.01876

work page arXiv 2021

[45] [45]

R. E. Kalman and R. S. Bucy. New results in linear filtering and prediction theory. Journal of Basic Engineering, 83 0 (1): 0 95--108, 1961. doi:10.1115/1.3658902. URL http://dx.doi.org/10.1115/1.3658902

work page doi:10.1115/1.3658902 1961

[46] [46]

A new approach to linear filtering and prediction problems

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME--Journal of Basic Engineering, 82 0 (Series D): 0 35--45, 1960

work page 1960

[47] [47]

Transformers are RNNs : Fast autoregressive Transformers with linear attention, 2020

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs : Fast autoregressive Transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236

work page arXiv 2020

[48] [48]

Attentive Neural Processes

Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes, 2019. URL https://arxiv.org/abs/1901.05761

work page internal anchor Pith review Pith/arXiv arXiv 2019

[49] [49]

Deep Kalman Filters

Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep Kalman filters, 2015. URL https://arxiv.org/abs/1511.05121

work page internal anchor Pith review Pith/arXiv arXiv 2015

[50] [50]

Structured Inference Networks for Nonlinear State Space Models

Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models, 2016. URL https://arxiv.org/abs/1609.09869

work page internal anchor Pith review Pith/arXiv arXiv 2016

[51] [51]

Unveiling the power of complex-valued Transformers in wireless communications, 2025

Yang Leng, Qingfeng Lin, Long-Yin Yung, Jingreng Lei, Yang Li, and Yik-Chung Wu. Unveiling the power of complex-valued Transformers in wireless communications, 2025. URL https://arxiv.org/abs/2502.11151

work page arXiv 2025

[52] [52]

Scaled-dot-product attention as one-sided entropic optimal transport, 2025

Elon Litman. Scaled-dot-product attention as one-sided entropic optimal transport, 2025. URL https://arxiv.org/abs/2508.08369

work page arXiv 2025

[53] [53]

Alvarez, and Hongpeng Zhou

Haiping Liu, Lijing Lin, Jingyuan Sun, Zhegong Shangguan, Mauricio A. Alvarez, and Hongpeng Zhou. Rethinking RoPE : A mathematical blueprint for n-dimensional positional embedding, 2025. URL https://arxiv.org/abs/2504.06308

work page arXiv 2025

[54] [54]

Neural extended Kalman filters for learning and predicting dynamics of structural systems

Wei Liu, Zhilu Lai, Kiran Bacsa, and Eleni Chatzi. Neural extended Kalman filters for learning and predicting dynamics of structural systems. Structural Health Monitoring, 23 0 (2): 0 1037–1052, June 2023. ISSN 1741-3168. doi:10.1177/14759217231179912. URL http://dx.doi.org/10.1177/14759217231179912

work page doi:10.1177/14759217231179912 2023

[55] [55]

Learning to encode position for transformer with continuous dynamical model, 2020

Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model, 2020. URL https://arxiv.org/abs/2003.09229

work page arXiv 2020

[56] [56]

ngpt: Normalized transformer with representation learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere, 2025. URL https://arxiv.org/abs/2410.01131

work page arXiv 2025

[57] [57]

Inverse distance weighting attention, 2023

Calvin McCarter. Inverse distance weighting attention, 2023. URL https://arxiv.org/abs/2310.18805

work page arXiv 2023

[58] [58]

R. Mehra. On the identification of variances and adaptive Kalman filtering. IEEE Transactions on Automatic Control, 15 0 (2): 0 175--184, 1970. doi:10.1109/TAC.1970.1099422

work page doi:10.1109/tac.1970.1099422 1970

[59] [59]

SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan. Slang: Fast structured covariance approximations for Bayesian deep learning with natural gradient, 2019. URL https://arxiv.org/abs/1811.04504

work page internal anchor Pith review Pith/arXiv arXiv 2019

[60] [60]

Traveling words: A geometric interpretation of Transformers , 2023

Raul Molina. Traveling words: A geometric interpretation of Transformers , 2023. URL https://arxiv.org/abs/2309.07315

work page arXiv 2023

[61] [61]

Movellan and Prasad Gabbur

Javier R. Movellan and Prasad Gabbur. Probabilistic Transformers , 2020. URL https://arxiv.org/abs/2010.15583

work page arXiv 2020

[62] [62]

Identification and control of dynamical systems using neural networks

Kumpati S Narendra and Kannan Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1 0 (1): 0 4--27, Mar 1990. doi:10.1109/72.80202

work page doi:10.1109/72.80202 1990

[63] [63]

Nielsen, Laziz U

Stefan K. Nielsen, Laziz U. Abdullaev, Rachel S. Y. Teo, and Tan M. Nguyen. Elliptical attention, 2024. URL https://arxiv.org/abs/2406.13770

work page arXiv 2024

[64] [64]

Norcliffe, C

Alexander Norcliffe, Cristian Bodnar, Ben Day, Jacob Moss, and Pietro Liò. Neural ODE processes, 2021. URL https://arxiv.org/abs/2103.12413

work page arXiv 2021

[65] [65]

Moseley, Akshay Chaudhari, and Curtis Langlotz

Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael E. Moseley, Akshay Chaudhari, and Curtis Langlotz. Liere: Lie rotational positional encodings, 2025. URL https://arxiv.org/abs/2406.10322

work page arXiv 2025

[66] [66]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022

[68] [68]

Toeplitz neural network for sequence modeling, 2023

Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Toeplitz neural network for sequence modeling, 2023. URL https://arxiv.org/abs/2305.04749

work page arXiv 2023

[69] [69]

Adaptive filter attention

Peter Racioppo. Adaptive filter attention. Master's thesis, University of California, Los Angeles, Los Angeles, CA, 2025. URL https://escholarship.org/content/qt0xn6488h/qt0xn6488h.pdf

work page 2025

[70] [70]

Hopfield Networks is All You Need

Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need, 2021. URL https://arxiv.org/abs/2008.02217

work page internal anchor Pith review Pith/arXiv arXiv 2021

[71] [71]

Provable benefits of complex parameterizations for structured state space models, 2024

Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, and Nadav Cohen. Provable benefits of complex parameterizations for structured state space models, 2024. URL https://arxiv.org/abs/2410.14067

work page arXiv 2024

[72] [72]

H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3 0 (8): 0 1445--1450, 1965. doi:10.2514/3.3166. URL https://doi.org/10.2514/3.3166

work page doi:10.2514/3.3166 1965

[73] [73]

Guy Revach, Nir Shlezinger, Xiaoyong Ni, Adria Lopez Escoriza, Ruud J. G. van Sloun, and Yonina C. Eldar. Kalmannet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70: 0 1532–1547, 2022. ISSN 1941-0476. doi:10.1109/tsp.2022.3158588. URL http://dx.doi.org/10.1109/TSP.2022.3158588

work page doi:10.1109/tsp.2022.3158588 2022

[74] [74]

Towards understanding how attention mechanism works in deep learning, 2024

Tianyu Ruan and Shihua Zhang. Towards understanding how attention mechanism works in deep learning, 2024. URL https://arxiv.org/abs/2412.18288

work page arXiv 2024

[75] [75]

Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent ODEs for irregularly-sampled time series, 2019. URL https://arxiv.org/abs/1907.03907

work page internal anchor Pith review Pith/arXiv arXiv 2019

[76] [76]

Probabilistic attention based on Gaussian processes for deep multiple instance learning, 2023

Arne Schmidt, Pablo Morales-Álvarez, and Rafael Molina. Probabilistic attention based on Gaussian processes for deep multiple instance learning, 2023. URL https://arxiv.org/abs/2302.04061

work page arXiv 2023

[77] [77]

Schweppe

F. Schweppe. Evaluation of likelihood functions for Gaussian signals. IEEE Transactions on Information Theory, 11 0 (1): 0 61--70, 1965. doi:10.1109/TIT.1965.1053737

work page doi:10.1109/tit.1965.1053737 1965

[78] [78]

Patrick Seifner, Kostadin Cvejoski, David Berghaus, Cesar Ojeda, and Ramses J. Sanchez. Foundation inference models for stochastic differential equations: A Transformer -based approach for zero-shot function estimation, 2025. https://doi.org/10.48550/arXiv.2502.19049

work page doi:10.48550/arxiv.2502.19049 2025

[79] [79]

ComplexFormer : Disruptively advancing Transformer inference ability via head-specific complex vector attention, 2025

Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, and MingKai Zheng. ComplexFormer : Disruptively advancing Transformer inference ability via head-specific complex vector attention, 2025. URL https://arxiv.org/abs/2505.10222

work page arXiv 2025

[80] [80]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations, 2018. URL https://arxiv.org/abs/1803.02155

work page internal anchor Pith review Pith/arXiv arXiv 2018