arxiv: 2605.11007 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: no theorem link

RT-Transformer: The Transformer Block as a Spherical State Estimator

Peter Racioppo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 00:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformer architecturehyperspherical geometrystate estimationattention mechanismresidual connectionslayer normalizationdirectional inferencemanifold estimation

0 comments

The pith

The Transformer block emerges as a geometric estimator when latent states are modeled as directions on a hypersphere.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention, residual connections, and normalization arise together from one estimation problem rather than as separate inventions. Representing the hidden state as a unit vector on the sphere and placing noise in the tangent plane at the current point produces a precision-weighted update rule. Attention combines incoming evidence, residuals add the update to the prior estimate, and normalization projects the result back onto the sphere to keep the state normalized. This single geometry explains why the three components must operate in sequence and why their combination is stable.

Core claim

Modeling the latent state as a direction on the hypersphere, with noise defined in the tangent plane at the current estimate, yields a precision-weighted directional inference procedure in which attention aggregates evidence, residual connections implement incremental state updates, and normalization retracts the updated state back onto the hypersphere.

What carries the argument

Hyperspherical latent state with tangent-plane noise, which directly induces the precision-weighted directional inference procedure.

If this is right

Attention computes a precision-weighted average of evidence vectors in the tangent space.
Residual connections add the aggregated update to the previous state estimate.
Normalization projects the result back onto the unit hypersphere after each update.
The full block follows from solving one estimation problem on the manifold rather than from independent architectural decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alternative manifolds or noise models could yield new attention or normalization variants that preserve the same geometric derivation.
The view links Transformer updates to classical manifold-based filtering methods used in robotics and signal processing.
It suggests that scaling attention or changing normalization strength should be constrained by the tangent-space precision rather than chosen empirically.

Load-bearing premise

The latent state can be usefully represented as a direction on the hypersphere and the relevant noise lives exactly in the tangent plane at the current estimate.

What would settle it

Training a Transformer variant without normalization and observing whether hidden-state magnitudes grow without bound while performance remains unchanged would test whether the retraction step is required by the geometry.

Figures

Figures reproduced from arXiv: 2605.11007 by Peter Racioppo.

**Figure 2.** Figure 2: Illustration of the RT-Filter. Transported directional observations form a precision-weighted [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

We show that the core components of the Transformer block -- attention, residual connections, and normalization -- arise naturally from a single geometric estimation problem. Modeling the latent state as a direction on the hypersphere, with noise defined in the tangent plane at the current estimate, yields a precision-weighted directional inference procedure in which attention aggregates evidence, residual connections implement incremental state updates, and normalization retracts the updated state back onto the hypersphere. Together, these components follow from the geometry of the estimation problem rather than being introduced as independent architectural choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that the core components of the Transformer block—attention, residual connections, and normalization—arise naturally from a single geometric estimation problem. By modeling the latent state as a unit direction on the hypersphere with noise in the tangent plane at the current estimate, a precision-weighted directional inference procedure is obtained in which attention aggregates evidence, residuals perform incremental updates, and normalization retracts the state onto the sphere.

Significance. If the derivation is shown to be exact and parameter-free, the result would supply a unified geometric account of why these architectural elements co-occur, potentially clarifying inductive biases in Transformers and motivating spherical variants. The approach is conceptually aligned with existing lines of work that ground network primitives in probabilistic geometry, but its impact hinges on whether the mapping reproduces standard scaled dot-product attention without auxiliary modeling choices.

major comments (3)

[Abstract] Abstract: the assertion that the components 'arise naturally' and 'yield' the inference procedure is unsupported by any displayed equations, derivation steps, or verification that the resulting procedure exactly recovers the standard Transformer block; this gap prevents assessment of whether the geometry produces the architecture or merely accommodates it.
[Derivation] Derivation of attention (presumably the central technical section): under tangent-plane Gaussian noise the MAP update produces attention weights proportional to the inverse covariance between the current estimate and each evidence vector; the manuscript must demonstrate that the required precision matrices emerge from the spherical geometry alone rather than being set to the outer products of learned query/key projections, as the latter choice would render the derivation circular.
[Verification] Verification of equivalence: no explicit proof or numerical check is supplied showing that the spherical MAP procedure, after the residual update and retraction, reproduces the exact functional form of multi-head scaled dot-product attention plus layer normalization used in practice.

minor comments (1)

[Notation] Notation for the tangent-plane noise covariance and the retraction operator should be introduced with explicit definitions and a small worked example to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive critique. The comments correctly identify that the current manuscript presents the geometric derivation at a conceptual level without sufficient explicit steps, equations, or verification. We will perform a major revision to supply the missing technical details while preserving the core claim.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the components 'arise naturally' and 'yield' the inference procedure is unsupported by any displayed equations, derivation steps, or verification that the resulting procedure exactly recovers the standard Transformer block; this gap prevents assessment of whether the geometry produces the architecture or merely accommodates it.

Authors: We agree that the abstract is too high-level. In the revision we will expand it to include a one-sentence outline of the key steps (tangent-plane MAP estimation producing precision-weighted aggregation, residual as incremental update, and normalization as retraction) and will add a forward reference to the new derivation subsection. revision: yes
Referee: [Derivation] Derivation of attention (presumably the central technical section): under tangent-plane Gaussian noise the MAP update produces attention weights proportional to the inverse covariance between the current estimate and each evidence vector; the manuscript must demonstrate that the required precision matrices emerge from the spherical geometry alone rather than being set to the outer products of learned query/key projections, as the latter choice would render the derivation circular.

Authors: The current text does not contain the explicit construction of the precision matrix from the spherical metric. We will add the intermediate derivation showing that the covariance is induced by the orthogonal projection onto the tangent plane at the current estimate; the query and key vectors then arise as the coordinates of the evidence vectors in that tangent basis. This removes the circularity by deriving the form of the weights directly from the geometry before any learned parameters are introduced. revision: yes
Referee: [Verification] Verification of equivalence: no explicit proof or numerical check is supplied showing that the spherical MAP procedure, after the residual update and retraction, reproduces the exact functional form of multi-head scaled dot-product attention plus layer normalization used in practice.

Authors: We acknowledge the lack of a formal equivalence statement or numerical check. The revision will include a new theorem with proof sketch establishing that the composition of the MAP update, residual addition, and spherical retraction recovers the standard scaled dot-product attention plus layer-norm functional form. We will also add a small-scale numerical verification on synthetic directional data confirming exact numerical agreement (within floating-point tolerance) between the two implementations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation framed as interpretive geometry without reduction to inputs

full rationale

The abstract and description present the Transformer block components as following directly from modeling latent state as a hyperspherical direction with tangent-plane noise, yielding attention as precision-weighted aggregation, residuals as incremental updates, and normalization as retraction. No equations, sections, or self-citations are available that demonstrate a fitted parameter renamed as prediction, a self-definitional loop (e.g., output defined via input), or an ansatz introduced only to recover the architecture. The derivation chain is self-contained as a geometric reinterpretation rather than a tautology, consistent with the most common honest outcome for such papers. Potential concerns about specific noise covariance choices remain outside the circularity criteria, as they would require explicit quotes showing the choice is forced by construction rather than assumed for the model.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on one modeling assumption stated in the abstract; no numerical parameters or new entities are mentioned.

axioms (2)

domain assumption The latent state is usefully represented as a unit vector (direction) on the hypersphere.
Explicitly stated as the starting modeling choice in the abstract.
domain assumption Noise acts only in the tangent plane at the current estimate.
Stated directly in the abstract as the noise model that produces the inference procedure.

pith-pipeline@v0.9.0 · 5372 in / 1273 out tokens · 41356 ms · 2026-05-13T00:54:46.356165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

216 extracted references · 216 canonical work pages · 7 internal anchors

[1]

, biburl =

Antsaklis, Panos J. , biburl =

work page
[2]

2005 , isbn =

Skogestad, Sigurd and Postlethwaite, Ian , title =. 2005 , isbn =

work page 2005
[3]

and Tannenbaum, Allen R

Doyle, John Comstock and Francis, Bruce A. and Tannenbaum, Allen R. , title =. 1991 , isbn =

work page 1991
[4]

E , title =

Kirk, Donald. E , title =. 2004 , address =

work page 2004
[5]

2020 , eprint=

HiPPO: Recurrent Memory with Optimal Polynomial Projections , author=. 2020 , eprint=

work page 2020
[6]

2022 , eprint=

Efficiently Modeling Long Sequences with Structured State Spaces , author=. 2022 , eprint=

work page 2022
[7]

It’s raw! audio generation with state-space models

Karan Goel and Albert Gu and Chris Donahue and Christopher Ré , year=. 2202.09729 , archivePrefix=

work page arXiv
[8]

2024 , eprint=

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. 2024 , eprint=

work page 2024
[9]

2022 , eprint=

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces , author=. 2022 , eprint=

work page 2022
[10]

2023 , eprint=

Simplified State Space Layers for Sequence Modeling , author=. 2023 , eprint=

work page 2023
[11]

2008 , isbn =

Speyer, Jason Lee , title =. 2008 , isbn =

work page 2008
[12]

, title =

Stearns, Samuel D. , title =. Advanced Topics in Signal Processing , pages =. 1987 , isbn =

work page 1987
[13]

Haykin, Simon , biburl =

work page
[14]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[15]

2021 , howpublished=

Multivariate normal distribution - Maximum Likelihood Estimation , author=. 2021 , howpublished=

work page 2021
[16]

Solution Formulas for Differential

Maximilian Behr and Peter Benner and Jan Heiland , year=. Solution Formulas for Differential. 1811.08327 , archivePrefix=

work page arXiv
[17]

Transformers are

Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and François Fleuret , year=. Transformers are. 2006.16236 , archivePrefix=

work page arXiv 2006
[18]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu , year=. Transformers are. 2405.21060 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Li and Eric P

Aviv Bick and Kevin Y. Li and Eric P. Xing and J. Zico Kolter and Albert Gu , year=. Transformers to. 2408.10189 , archivePrefix=

work page arXiv
[20]

2018 , eprint=

Self-Attention with Relative Position Representations , author=. 2018 , eprint=

work page 2018
[21]

2024 , url=

Gaussian Adaptive Attention is All You Need: Robust Contextual Representations Across Multiple Modalities , author=. 2024 , url=

work page 2024
[22]

Transformer Dissection: A Unified Understanding for T ransformer ' s Attention via the Lens of Kernel

Tsai, Yao-Hung Hubert and Bai, Shaojie and Yamada, Makoto and Morency, Louis-Philippe and Salakhutdinov, Ruslan. Transformer Dissection: A Unified Understanding for T ransformer ' s Attention via the Lens of Kernel. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural ...

work page doi:10.18653/v1/d19-1443 2019
[23]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su and Yu Lu and Shengfeng Pan and Ahmed Murtadha and Bo Wen and Yunfeng Liu , year=. 2104.09864 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

2025 , eprint=

Element-wise Attention Is All You Need , author=. 2025 , eprint=

work page 2025
[25]

2023 , eprint=

Element-Wise Attention Layers: an option for optimization , author=. 2023 , eprint=

work page 2023
[26]

Learning Continuous-Time Dynamics With Attention , year=

Chien, Jen-Tzung and Chen, Yi-Hsiang , journal=. Learning Continuous-Time Dynamics With Attention , year=

work page
[27]

2024 , eprint=

Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks , author=. 2024 , eprint=

work page 2024
[28]

2016 , eprint=

Using Fast Weights to Attend to the Recent Past , author=. 2016 , eprint=

work page 2016
[29]

Emrullah Ildiz and Yixiao Huang and Yingcong Li and Ankit Singh Rawat and Samet Oymak , year=

M. Emrullah Ildiz and Yixiao Huang and Yingcong Li and Ankit Singh Rawat and Samet Oymak , year=. From Self-Attention to. 2402.13512 , archivePrefix=

work page arXiv
[30]

2018 , eprint=

Dynamically Context-Sensitive Time-Decay Attention for Dialogue Modeling , author=. 2018 , eprint=

work page 2018
[31]

2024 , eprint=

Dynamic-Attention-based EEG State Transition Modeling for Emotion Recognition , author=. 2024 , eprint=

work page 2024
[32]

Deep neural network based on dynamic attention and layer attention for meteorological data downscaling , author=

work page
[33]

2024 , eprint=

Elliptical Attention , author=. 2024 , eprint=

work page 2024
[34]

2106.09681 , archivePrefix=

Alaaeldin El-Nouby and Hugo Touvron and Mathilde Caron and Piotr Bojanowski and Matthijs Douze and Armand Joulin and Ivan Laptev and Natalia Neverova and Gabriel Synnaeve and Jakob Verbeek and Hervé Jegou , year=. 2106.09681 , archivePrefix=

work page arXiv
[35]

2023 , eprint=

Understanding Self-attention Mechanism via Dynamical System Perspective , author=. 2023 , eprint=

work page 2023
[36]

2020 , eprint=

Neural Controlled Differential Equations for Irregular Time Series , author=. 2020 , eprint=

work page 2020
[37]

Alexander Norcliffe and Cristian Bodnar and Ben Day and Jacob Moss and Pietro Liò , year=. Neural. 2103.12413 , archivePrefix=

work page arXiv
[38]

2105.14953 , archivePrefix=

Sheo Yon Jhin and Minju Jo and Taeyong Kong and Jinsung Jeon and Noseong Park , year=. 2105.14953 , archivePrefix=

work page arXiv
[39]

2503.01329 , archivePrefix=

Anh Tong and Thanh Nguyen-Tang and Dongeun Lee and Duc Nguyen and Toan Tran and David Hall and Cheongwoong Kang and Jaesik Choi , year=. 2503.01329 , archivePrefix=

work page arXiv
[40]

Transformers for modeling physical systems , volume=

Geneva, Nicholas and Zabaras, Nicholas , year=. Transformers for modeling physical systems , volume=. doi:10.1016/j.neunet.2021.11.022 , journal=

work page doi:10.1016/j.neunet.2021.11.022 2021
[41]

2019 , eprint=

Neural Ordinary Differential Equations , author=. 2019 , eprint=

work page 2019
[42]

2021 , eprint=

Attentive Neural Controlled Differential Equations for Time-series Classification and Forecasting , author=. 2021 , eprint=

work page 2021
[43]

2020 , eprint=

Neural Ordinary Differential Equation based Recurrent Neural Network Model , author=. 2020 , eprint=

work page 2020
[44]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun and Li Dong and Shaohan Huang and Shuming Ma and Yuqing Xia and Jilong Xue and Jianyong Wang and Furu Wei , year=. 2307.08621 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

2023 , eprint=

Focus Your Attention (with Adaptive IIR Filters) , author=. 2023 , eprint=

work page 2023
[46]

AFD-STA: Adaptive Filtering Denoising with Spatiotemporal Attention for Chaotic System Prediction , author=

work page
[47]

2024 , eprint=

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection , author=. 2024 , eprint=

work page 2024
[48]

2024 , eprint=

Enhanced Structured State Space Models via Grouped FIR Filtering and Attention Sink Mechanisms , author=. 2024 , eprint=

work page 2024
[49]

2022 , eprint=

Meta-AF: Meta-Learning for Adaptive Filters , author=. 2022 , eprint=

work page 2022
[50]

2016 , eprint=

Dynamic Filter Networks , author=. 2016 , eprint=

work page 2016
[51]

2018 , eprint=

Deep Complex Networks , author=. 2018 , eprint=

work page 2018
[52]

2023 , eprint=

Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series , author=. 2023 , eprint=

work page 2023
[53]

Deep Kalman Filters

Rahul G. Krishnan and Uri Shalit and David Sontag , year=. Deep. 1511.05121 , archivePrefix=

work page Pith review arXiv
[54]

2024 , eprint=

Deep State Space Recurrent Neural Networks for Time Series Forecasting , author=. 2024 , eprint=

work page 2024
[55]

2025 , eprint=

Selection Mechanisms for Sequence Modeling using Linear State Space Models , author=. 2025 , eprint=

work page 2025
[56]

2015 , eprint=

Learning Stochastic Recurrent Networks , author=. 2015 , eprint=

work page 2015
[57]

2021 , eprint=

Implicit Kernel Attention , author=. 2021 , eprint=

work page 2021
[58]

Probabilistic Attention based on

Arne Schmidt and Pablo Morales-Álvarez and Rafael Molina , year=. Probabilistic Attention based on. 2302.04061 , archivePrefix=

work page arXiv
[59]

2023 , eprint=

Theory and Implementation of Complex-Valued Neural Networks , author=. 2023 , eprint=

work page 2023
[60]

Proceedings of the National Academy of Sciences , year=

Discovering governing equations from data by sparse identification of nonlinear dynamical systems , author=. Proceedings of the National Academy of Sciences , year=

work page
[61]

The Matrix Cookbook

Petersen, Kaare Brandt and Pedersen, Michael Syskind. The Matrix Cookbook. 2006

work page 2006
[62]

Journal of Behavioral Data Science , year =

Zhiyong Zhang , title =. Journal of Behavioral Data Science , year =. doi:10.35566/jbds/v1n2/p2 , url =

work page doi:10.35566/jbds/v1n2/p2
[63]

2025 , type =

Peter Racioppo , title =. 2025 , type =

work page 2025
[64]

2016 , eprint=

Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

work page 2016
[65]

Transactions of the ASME--Journal of Basic Engineering , Volume =

Kalman, Rudolph Emil , Title =. Transactions of the ASME--Journal of Basic Engineering , Volume =

work page
[66]

and Lopes, Cassio G

Cattivelli, Federico S. and Lopes, Cassio G. and Sayed, Ali H. , booktitle =. Diffusion Strategies for Distributed. 2009 , pages =

work page 2009
[67]

, title =

Cattivelli, Federico and Sayed, Ali H. , title =. Proceedings of the 43rd Asilomar Conference on Signals, Systems and Computers , pages =. 2009 , isbn =

work page 2009
[68]

Das, Subhro and Moura, José M. F. , journal=. Consensus+Innovations Distributed. 2017 , volume=

work page 2017
[69]

and Sayed, Ali H

Cattivelli, Federico S. and Sayed, Ali H. , journal=. Diffusion LMS Strategies for Distributed Estimation , year=

work page
[70]

2013 , eprint=

Diffusion Adaptation over Networks , author=. 2013 , eprint=

work page 2013
[71]

Gautam Goel and Peter Bartlett , year=. Can a. 2312.06937 , archivePrefix=

work page arXiv
[72]

2021 , eprint=

Hopfield Networks is All You Need , author=. 2021 , eprint=

work page 2021
[73]

2018 , eprint=

Neural Processes , author=. 2018 , eprint=

work page 2018
[74]

2019 , eprint=

Attentive Neural Processes , author=. 2019 , eprint=

work page 2019
[75]

MacKay, David J. C. , title =. 1992 , issue_date =. doi:10.1162/neco.1992.4.3.448 , journal =

work page doi:10.1162/neco.1992.4.3.448 1992
[76]

, title =

Neal, Radford M. , title =. 1996 , isbn =

work page 1996
[77]

2015 , eprint=

Weight Uncertainty in Neural Networks , author=. 2015 , eprint=

work page 2015
[78]

2021 , eprint=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. 2021 , eprint=

work page 2021
[79]

2017 , eprint=

Structured Attention Networks , author=. 2017 , eprint=

work page 2017
[80]

Movellan and Prasad Gabbur , year=

Javier R. Movellan and Prasad Gabbur , year=. 2010.15583 , archivePrefix=

work page arXiv 2010

Showing first 80 references.