pith. machine review for the scientific record. sign in

arxiv: 2603.13381 · v2 · submitted 2026-03-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords transformer attentionquery projectionnonlinear residualbottleneck MLPlanguage modelingdecoder-only models
0
0 comments X

The pith

Replacing the linear query projection with identity plus a small bottleneck MLP improves validation log-loss by 2.4 percent in GPT-style models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the query projection in transformer attention can be made nonlinear while preserving the property that basis changes are absorbed by adjacent layers. It does this by defining the query as the input plus a residual bottleneck MLP whose parameter count is roughly d squared plus linear in d. Experiments on small decoder-only models show this change yields lower loss and perplexity than the standard linear query and also beats a wider baseline that adds more parameters. A reader would care because the result indicates that attention need not stay fully linear to function, opening a low-cost route to greater expressivity inside the attention block.

Core claim

The central claim is that setting the query to Q(X) equals X plus a bottleneck MLP f_theta of X allows the attention mechanism to benefit from nonlinearity while the identity term keeps the algebraic absorption of basis transformations intact. On GPT-3 small style models this replacement produces 2.40 percent lower validation log-loss and 6.81 percent lower perplexity, and the gain exceeds what is obtained by increasing non-embedding parameters by 12.5 percent.

What carries the argument

The nonlinear residual query Q(X) = X + f_theta(X), where f_theta is a bottleneck MLP; the identity term anchors the function so that linear basis changes can still be absorbed by neighboring layers.

If this is right

  • The performance advantage holds against a linear model that uses 12.5 percent more non-embedding parameters.
  • The identity anchor allows the rest of the network to continue absorbing linear changes without retraining the entire stack.
  • Only the query projection needs the nonlinearity; keys and values can stay linear.
  • The added parameter count remains modest because the MLP is bottlenecked.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same residual construction could be tested on the key or value projections to see if further gains appear.
  • If the absorption property generalizes, the approach might be combined with other low-rank or sparse attention variants.
  • At very large scales the optimal bottleneck width inside the MLP may need to grow, changing the parameter-efficiency trade-off.

Load-bearing premise

Basis transformations absorbed by adjacent layers remain stable once the query projection is replaced by the nonlinear residual, and the added MLP does not create optimization instabilities at the tested model scale.

What would settle it

Train the same nonlinear-query model at substantially larger scale or on a different modality and check whether the reported loss and perplexity gains disappear.

Figures

Figures reproduced from arXiv: 2603.13381 by Marko Karbevski.

Figure 1
Figure 1. Figure 1: Training dynamics (steps 5k–60k). Solid: validation; dashed: training curve for the best con [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relative improvement over baseline (steps 1k to 59k). Nonlinear configurations: 84.97M param [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_\theta(X)$, where $f_\theta$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that the linear query projection W_Q in decoder-only transformers can be replaced by a nonlinear residual Q(X) = X + f_θ(X) (bottleneck MLP with d² + O(d) parameters) without loss of the algebraic absorption property that justifies the identity anchor. Experiments on GPT-3-small-style models report 2.40% lower validation log-loss and 6.81% lower perplexity versus the linear baseline, while also outperforming a model with 12.5% more non-embedding parameters.

Significance. If the improvement is shown to arise from the controlled nonlinearity rather than added capacity, the result would indicate that modest, identity-anchored nonlinearities in attention projections can be beneficial at small scale and motivate scaling studies. The identity residual is a constructive design choice that reduces optimization risk relative to an unconstrained nonlinear projection.

major comments (3)
  1. [Abstract] Abstract: the algebraic argument that any basis change in W_Q can be absorbed into adjacent linear layers holds only for the linear case. Replacing W_Q by the nonlinear residual Q(X) = X + f_θ(X) introduces terms that cannot be absorbed by the same mechanism; the manuscript does not demonstrate that the learned f_θ remains small enough for the identity anchor to dominate or that the effective query stays approximately linear.
  2. [Experiments] Experiments section: the reported 2.40% log-loss gain is compared against a baseline and against a model with 12.5% more non-embedding parameters, but no control is described that adds an equivalent number of linear parameters (e.g., a wider linear projection or extra linear layer) while keeping the query strictly linear. Without this control it is impossible to separate the effect of nonlinearity from the effect of extra capacity.
  3. [Experiments] Experiments section: the abstract states concrete percentage improvements, yet no information is given on whether the baseline and the nonlinear model were trained with identical optimizer settings, learning-rate schedules, number of seeds, or data order. These details are required to assess whether the observed gap is statistically reliable.
minor comments (2)
  1. [Abstract] The precise bottleneck width, depth, and activation function of f_θ are not stated in the abstract; these hyperparameters should be reported explicitly so that the added parameter count can be verified.
  2. [Method] Notation: the symbol f_θ is introduced without an explicit equation for the MLP architecture (e.g., hidden dimension, residual connections inside the MLP). Adding this equation would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript to incorporate additional analysis and controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the algebraic argument that any basis change in W_Q can be absorbed into adjacent linear layers holds only for the linear case. Replacing W_Q by the nonlinear residual Q(X) = X + f_θ(X) introduces terms that cannot be absorbed by the same mechanism; the manuscript does not demonstrate that the learned f_θ remains small enough for the identity anchor to dominate or that the effective query stays approximately linear.

    Authors: We agree that the strict algebraic absorption property applies only to linear projections. The identity residual is intended to keep the nonlinearity small and close to the linear case. In the revision we will add measurements of ||f_θ(X)|| relative to ||X|| across layers and training steps to quantify how closely the effective query remains linear. revision: partial

  2. Referee: [Experiments] Experiments section: the reported 2.40% log-loss gain is compared against a baseline and against a model with 12.5% more non-embedding parameters, but no control is described that adds an equivalent number of linear parameters (e.g., a wider linear projection or extra linear layer) while keeping the query strictly linear. Without this control it is impossible to separate the effect of nonlinearity from the effect of extra capacity.

    Authors: The referee is correct that a matched-capacity linear control is needed to isolate the nonlinearity. We will add this experiment in the revision by training a model with an expanded linear W_Q whose parameter count matches the nonlinear residual. revision: yes

  3. Referee: [Experiments] Experiments section: the abstract states concrete percentage improvements, yet no information is given on whether the baseline and the nonlinear model were trained with identical optimizer settings, learning-rate schedules, number of seeds, or data order. These details are required to assess whether the observed gap is statistically reliable.

    Authors: We will expand the experimental details section to report identical optimizer and learning-rate schedules, the use of three random seeds, and identical data ordering for all compared models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result independent of algebraic justification

full rationale

The paper's core claim is an empirical performance gain (2.40% lower validation log-loss) from replacing linear W_Q with the nonlinear residual Q(X) = X + f_θ(X). The algebraic justification for the identity anchor is presented as background from 'recent algebraic analysis' rather than a derivation internal to this work that reduces to fitted parameters by construction. No equation in the provided text equates the reported improvement to the input capacity or to a self-citation chain; the experiments include a control with 12.5% more parameters, supplying an external benchmark. The result therefore does not collapse to a tautology or renamed fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that linear basis changes can still be absorbed when the query projection is replaced by a nonlinear residual, plus the empirical observation that the added MLP improves optimization. No new physical constants or particles are introduced.

free parameters (1)
  • MLP bottleneck width and depth
    The architecture of f_θ is chosen by hand; its exact hidden dimension and number of layers are free parameters that affect the reported gains.
axioms (1)
  • domain assumption Any linear transformation applied to queries can be absorbed into subsequent layers without changing the overall function.
    Invoked to justify setting the linear part of Q to identity.

pith-pipeline@v0.9.0 · 5466 in / 1310 out tokens · 30714 ms · 2026-05-15T12:58:43.278825+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 5 internal anchors

  1. [1]

    Why do we need weight decay in modern deep learning? InAdvances in Neural Information Processing Systems, 2024

    Maksym Andriushchenko, Francesco D’Angelo, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning? InAdvances in Neural Information Processing Systems, 2024

  2. [2]

    Rotary embeddings: A relative revolution.EleutherAI Blog, April 2021.https://blog

    Stella Biderman, Sid Black, Charles Foster, Leo Gao, Eric Hallahan, Horace He, Ben Wang, and Phil Wang. Rotary embeddings: A relative revolution.EleutherAI Blog, April 2021.https://blog. eleuther.ai/rotary-embeddings/

  3. [3]

    GPT-NeoX-20B: An open-source autoregressive language model

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivan- shu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: An open-source autoregressive language model. InProceedings of the ACL Workshop on C...

  4. [4]

    Rethinking attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with Performers. InInternational Conference on Learning Representations, 2021

  5. [5]

    AuroRA: Breaking low-rank bottleneck of LoRA with nonlinear mapping

    Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. AuroRA: Breaking low-rank bottleneck of LoRA with nonlinear mapping. InAdvances in Neural Information Processing Systems, 2025

  6. [6]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2021

  7. [7]

    Transformer tricks: Removing weights for skipless transformers.arXiv preprint arXiv:2404.12362, 2024

    Nils Graef. Transformer tricks: Removing weights for skipless transformers.arXiv preprint arXiv:2404.12362, 2024

  8. [8]

    Deep residual learning for image recog- nition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  9. [9]

    Smith, and Yee Whye Teh

    Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L. Smith, and Yee Whye Teh. Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. InInternational Conference on Learning Representations, 2023

  10. [10]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute- optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  11. [11]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  12. [12]

    Efficient learn- ing with sine-activated low-rank matrices.arXiv preprint arXiv:2403.19243, 2024

    Yiping Ji, Hemanth Saratchandran, Cameron Gordon, Zeyu Zhang, and Simon Lucey. Efficient learn- ing with sine-activated low-rank matrices.arXiv preprint arXiv:2403.19243, 2024

  13. [13]

    Always skip attention

    Yiping Ji, Hemanth Saratchandran, Peyman Moghadam, and Simon Lucey. Always skip attention. In ICCV, 2025. 7

  14. [14]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  15. [15]

    Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

    Marko Karbevski and Antonij Mijoski. Key and value weights are probably all you need: On the necessity of the query, key, value weight triplet in self-attention transformers.arXiv preprint arXiv:2510.23912, 2025

  16. [16]

    NanoGPT.https://github.com/karpathy/nanoGPT, 2023

    Andrej Karpathy. NanoGPT.https://github.com/karpathy/nanoGPT, 2023

  17. [17]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165, 2020

  18. [18]

    How do transformers learn topic structure: Towards a mechanistic understanding

    Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. InInternational Conference on Machine Learning, pages 19689–19729, 2023

  19. [19]

    LoRAN: Improved low-rank adaptation by a non-linear transformation

    Yinqiao Li, Linqi Song, and Hanxu Hou. LoRAN: Improved low-rank adaptation by a non-linear transformation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 3134–3143, 2024

  20. [20]

    Normalization and effective learning rates in reinforcement learning

    Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning. InAdvances in Neural Informa- tion Processing Systems, 2024

  21. [21]

    MLP-Attention: Improving transformer architectures with MLP attention weights.Tiny Papers @ ICLR, 2023

    Alireza Morsali, Moein Heidari, Samin Heydarian, and Tohid Abedini. MLP-Attention: Improving transformer architectures with MLP attention weights.Tiny Papers @ ICLR, 2023

  22. [22]

    Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free. InNeurIPS, 2025

  23. [23]

    David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V . Le. Primer: Search- ing for efficient transformers for language modeling. InNeurIPS, 2021

  24. [24]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  25. [25]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

  26. [26]

    mHC: Manifold-Constrained Hyper-Connections

    Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, et al. mHC: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025

  27. [27]

    Neural attention: Enhancing QKV calculation in self-attention mechanism with neural networks.arXiv preprint arXiv:2310.11398, 2023

    Muhan Zhang. Neural attention: Enhancing QKV calculation in self-attention mechanism with neural networks.arXiv preprint arXiv:2310.11398, 2023

  28. [28]

    Hyper-Connections

    Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-Connections. InInternational Conference on Learning Representations, 2025. 8 A Hyperparameter Sweep This section is illustrative and does not claim completeness. The figure below is an earlier snapshot of the sweep, covering roughly 70% of the hyperpar...