pith. sign in

arxiv: 2604.13656 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI· math.ST· stat.ML· stat.TH

Ordinary Least Squares is a Special Case of Transformer

Pith reviewed 2026-05-10 13:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.STstat.MLstat.TH
keywords transformerordinary least squaresattention mechanismlinear transformerstatistical inferencehopfield networkmemory mechanismcovariance decomposition
0
0 comments X

The pith

Ordinary least squares regression reduces to one forward pass through a linear transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ordinary least squares can be recovered exactly as a special case of the single-layer linear transformer by choosing parameters derived from the spectral decomposition of the data covariance matrix. If this equivalence holds, it shows that the attention mechanism can carry out classical linear regression in a single step rather than through iteration or optimization loops. A reader would care because the result supplies a direct algebraic bridge between transformer attention and the closed-form solution of least-squares problems. This framing positions the transformer not merely as a flexible approximator but as an architecture that embeds familiar statistical operations inside its forward pass. The authors extend the prototype to identify separate slow and fast memory components and trace how the linear case evolves into standard transformers with higher-capacity memory.

Core claim

We prove that Ordinary Least Squares is a special case of the single-layer Linear Transformer. By constructing parameters from the spectral decomposition of the empirical covariance matrix, the attention forward pass becomes identical to the OLS closed-form projection. This allows the model to solve the regression task in one step. From this linear prototype we identify a decoupled slow-and-fast memory mechanism inside transformers and show how the architecture progresses toward standard transformers, converting the Hopfield energy function from linear to exponential memory capacity.

What carries the argument

The parameter setting in the linear transformer, obtained directly from the spectral decomposition of the empirical covariance matrix, that renders the attention output mathematically identical to the OLS projection.

If this is right

  • Attention performs exact OLS regression in a single forward pass rather than by iterative solving.
  • Transformers contain an identifiable decoupled slow-and-fast memory mechanism.
  • The linear prototype evolves continuously into standard transformers that realize exponential memory capacity through the Hopfield energy function.
  • Modern transformer architectures maintain a direct continuity with classical statistical inference methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other linear statistical procedures such as ridge regression could be realized by analogous parameter constructions inside the same attention block.
  • Training dynamics of linear transformers might implicitly discover covariance-based projections when the loss encourages regression-like behavior.
  • Multi-layer transformers could be interpreted as iterated or composed versions of the single-layer OLS prototype.
  • Interpretability analyses of attention weights in the linear case could map directly onto covariance eigenvectors.

Load-bearing premise

The specific parameter values taken from the covariance spectral decomposition can be realized exactly inside the linear transformer architecture without structural violation or approximation.

What would settle it

Take any finite dataset, compute its OLS solution, set the transformer weights and biases according to the derived spectral construction, run the attention forward pass on the same inputs, and verify whether every output coordinate matches the OLS result to machine precision.

Figures

Figures reproduced from arXiv: 2604.13656 by Xiaojun Tan, Yuchen Zhao.

Figure 1
Figure 1. Figure 1: Empirical validation of training reachability for the OLS-Transformer. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, the authors construct a specific parameter setting for the attention mechanism (Q, K, V matrices) such that its forward pass is mathematically equivalent to the OLS closed-form projection. They extend this to identify a decoupled slow and fast memory mechanism in transformers and discuss the progression to standard (non-linear) transformers and connections to Hopfield networks.

Significance. If the central equivalence can be shown to hold with fixed, input-independent parameters without presupposing the OLS solution, the result would provide a concrete algebraic bridge between transformer attention and classical statistical estimation, potentially explaining one-pass inference in linear attention variants. The discussion of memory mechanisms offers a possible route to interpreting capacity in terms of linear vs. exponential regimes, which could be useful for theoretical analysis of transformers. However, the construction as described appears to encode a precomputed solution rather than derive it dynamically.

major comments (3)
  1. [parameter construction section] In the central construction (the section on parameter setting via spectral decomposition of the empirical covariance), the Q, K, V matrices are defined directly from the eigenvectors and eigenvalues of the data covariance. Standard linear transformer attention applies fixed parameters to arbitrary inputs; embedding data-specific spectral quantities into these parameters requires external computation of the covariance decomposition before weight assignment. This makes the equivalence hold by construction for a given dataset rather than demonstrating that attention dynamics compute OLS for general inputs.
  2. [abstract and equivalence proof] The abstract asserts a 'rigorous algebraic proof' that OLS is a special case, yet the key step reduces to choosing parameters so that attention equals the OLS projection. The paper should explicitly state whether the resulting weights remain valid for new inputs drawn from the same distribution or only for the training data used in the decomposition; without this, the claim that attention 'can solve the problem in one forward pass' lacks generality.
  3. [memory mechanism section] The extension to a 'decoupled slow and fast memory mechanism' (§ on memory mechanisms) is built directly on the linear prototype. If the prototype equivalence relies on data-dependent parameter construction, the memory interpretation inherits the same limitation and requires separate justification that the slow/fast split emerges independently of the covariance embedding.
minor comments (2)
  1. [abstract] The abstract introduces the linear transformer without first defining its exact forward-pass equations or notation for the attention output; a brief equation block before the construction would improve readability.
  2. [discussion section] The transition from the linear prototype to standard transformers is described at a high level; adding a short comparison table of the attention formulations (linear vs. softmax) would clarify the claimed continuity with Hopfield energy functions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that our central construction relies on a data-dependent parameter setting, and we will revise the manuscript to clarify the scope and limitations of the claimed equivalence. We respond to each major comment below.

read point-by-point responses
  1. Referee: [parameter construction section] In the central construction (the section on parameter setting via spectral decomposition of the empirical covariance), the Q, K, V matrices are defined directly from the eigenvectors and eigenvalues of the data covariance. Standard linear transformer attention applies fixed parameters to arbitrary inputs; embedding data-specific spectral quantities into these parameters requires external computation of the covariance decomposition before weight assignment. This makes the equivalence hold by construction for a given dataset rather than demonstrating that attention dynamics compute OLS for general inputs.

    Authors: We agree that the construction requires an external computation of the spectral decomposition of the empirical covariance matrix to set the Q, K, V parameters. This is by design: the result establishes an exact algebraic equivalence showing that OLS is realizable as a special case of linear transformer attention, rather than claiming that the attention mechanism discovers or computes the OLS solution dynamically from arbitrary inputs without such initialization. Once the parameters are fixed, the forward pass applies the equivalent projection to any input, including new points. We will revise the relevant section to explicitly note the external computation step and to frame the contribution as a representational equivalence rather than a dynamic inference procedure. revision: yes

  2. Referee: [abstract and equivalence proof] The abstract asserts a 'rigorous algebraic proof' that OLS is a special case, yet the key step reduces to choosing parameters so that attention equals the OLS projection. The paper should explicitly state whether the resulting weights remain valid for new inputs drawn from the same distribution or only for the training data used in the decomposition; without this, the claim that attention 'can solve the problem in one forward pass' lacks generality.

    Authors: The algebraic equivalence holds for the forward pass once parameters are set from the training covariance; the resulting fixed weights implement the OLS projection matrix and therefore apply to arbitrary inputs, including new samples from the same distribution. The one-forward-pass claim refers to inference after this configuration step. We will revise the abstract and the proof section to state this scope explicitly and to qualify that the construction presupposes the covariance (as the closed-form OLS solution itself does). revision: yes

  3. Referee: [memory mechanism section] The extension to a 'decoupled slow and fast memory mechanism' (§ on memory mechanisms) is built directly on the linear prototype. If the prototype equivalence relies on data-dependent parameter construction, the memory interpretation inherits the same limitation and requires separate justification that the slow/fast split emerges independently of the covariance embedding.

    Authors: The slow/fast memory discussion is an interpretive reading of the linear prototype and does inherit the data-dependent initialization of the prototype. We will add a clarifying paragraph in the memory section that separates the conceptual decoupling (arising from the structure of the attention update) from the specific covariance-based parameter choice, and we will note that further analysis would be needed to establish the split in the nonlinear case without relying on the same embedding. revision: partial

Circularity Check

1 steps flagged

OLS equivalence achieved by constructing transformer parameters from covariance eigendecomposition

specific steps
  1. self definitional [Abstract]
    "Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection."

    The parameter setting is explicitly built from the eigendecomposition of the input covariance (the same matrix that appears in the OLS closed-form solution). This forces the attention output to match OLS by design, rather than deriving the equivalence from the transformer's structure applied to arbitrary inputs with fixed, data-independent weights.

full rationale

The paper's central claim is that OLS is a special case of the linear transformer, shown by constructing a specific parameter setting from the spectral decomposition of the empirical covariance matrix so that the attention forward pass equals the OLS projection. This construction directly embeds the data-dependent quantities (eigenvectors and eigenvalues of the covariance) into the fixed weights Q, K, V. Because transformer weights must be input-independent, the equivalence holds only after externally computing the OLS solution and injecting it into the model parameters. The derivation therefore reduces to an imposed equivalence by construction rather than an independent demonstration that the architecture computes OLS.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a parameter mapping from the covariance spectral decomposition into the linear transformer weights; this mapping is introduced by construction rather than derived from external constraints.

free parameters (1)
  • Linear transformer attention parameters
    Explicitly constructed from the eigenvectors and eigenvalues of the empirical covariance matrix to enforce exact equivalence with the OLS solution.
axioms (1)
  • standard math The empirical covariance matrix admits a spectral decomposition that can be directly embedded into the linear attention weight matrices.
    Invoked in the abstract to construct the parameter setting that achieves the OLS equivalence.

pith-pipeline@v0.9.0 · 5455 in / 1311 out tokens · 37669 ms · 2026-05-10T13:30:47.711173+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  2. [2]

    Hopfield networks is all you need, 2021

    Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´c, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need, 2021

  3. [3]

    The mean-field dynamics of transformers, 2026

    Philippe Rigollet. The mean-field dynamics of transformers, 2026

  4. [4]

    The emergence of clusters in self- attention dynamics

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self- attention dynamics. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 57026–57037. Curran Associates, Inc., 2023

  5. [5]

    Dalal, and Vishal Misra

    Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra. The bayesian geometry of transformer attention, 2026

  6. [6]

    Linear transformers are versatile in-context learners

    Max Vladymyrov, Johannes von Oswald, Mark Sandler, and Rong Ge. Linear transformers are versatile in-context learners. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 48784–48809. Curran Associates, Inc., 2024

  7. [7]

    Transformers as statisticians: Provable in-context learning with in-context algorithm selection

    Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 57125–57211. Curran Associates, Inc., 2023

  8. [8]

    Transformers learn in-context by gradient descent

    Johannes V on Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Mach...

  9. [9]

    Transformers learn to implement preconditioned gradient descent for in-context learning

    Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 45614–45650. Curran Associates, Inc., 2023

  10. [10]

    Bartlett

    Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49):1–55, 2024

  11. [11]

    Using fast weights to attend to the recent past

    Jimmy Ba, Geoffrey E Hinton, V olodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

  12. [12]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 9355–9366. PMLR, 18–24 Jul 2021

  13. [13]

    The tolman-eichenbaum machine: unifying space and relational memory through generalization in the hippocampal formation.Cell, 183(5):1249–1263, 2020

    James CR Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil Burgess, and Timo- thy EJ Behrens. The tolman-eichenbaum machine: unifying space and relational memory through generalization in the hippocampal formation.Cell, 183(5):1249–1263, 2020

  14. [14]

    Attractor and integrator networks in the brain.Nature Reviews Neuroscience, 23(12):744–766, 2022

    Mikail Khona and Ila R Fiete. Attractor and integrator networks in the brain.Nature Reviews Neuroscience, 23(12):744–766, 2022

  15. [15]

    James C. R. Whittington, Joseph Warren, and Timothy E. J. Behrens. Relating transformers to models and neural representations of the hippocampal formation, 2022

  16. [16]

    Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982

    J J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982

  17. [17]

    Hopfield

    Dmitry Krotov and John J. Hopfield. Dense associative memory for pattern recognition. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

  18. [18]

    On a model of associative memory with huge storage capacity.Journal of Statistical Physics, 168(2):288–299, 2017

    Mete Demircigil, Judith Heusel, Matthias Löwe, Sven Upgang, and Franck Vermet. On a model of associative memory with huge storage capacity.Journal of Statistical Physics, 168(2):288–299, 2017. 6