arxiv: 2603.13381 · v2 · submitted 2026-03-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Marko Karbevski

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformer attentionquery projectionnonlinear residualbottleneck MLPlanguage modelingdecoder-only models

0 comments

The pith

Replacing the linear query projection with identity plus a small bottleneck MLP improves validation log-loss by 2.4 percent in GPT-style models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the query projection in transformer attention can be made nonlinear while preserving the property that basis changes are absorbed by adjacent layers. It does this by defining the query as the input plus a residual bottleneck MLP whose parameter count is roughly d squared plus linear in d. Experiments on small decoder-only models show this change yields lower loss and perplexity than the standard linear query and also beats a wider baseline that adds more parameters. A reader would care because the result indicates that attention need not stay fully linear to function, opening a low-cost route to greater expressivity inside the attention block.

Core claim

The central claim is that setting the query to Q(X) equals X plus a bottleneck MLP f_theta of X allows the attention mechanism to benefit from nonlinearity while the identity term keeps the algebraic absorption of basis transformations intact. On GPT-3 small style models this replacement produces 2.40 percent lower validation log-loss and 6.81 percent lower perplexity, and the gain exceeds what is obtained by increasing non-embedding parameters by 12.5 percent.

What carries the argument

The nonlinear residual query Q(X) = X + f_theta(X), where f_theta is a bottleneck MLP; the identity term anchors the function so that linear basis changes can still be absorbed by neighboring layers.

If this is right

The performance advantage holds against a linear model that uses 12.5 percent more non-embedding parameters.
The identity anchor allows the rest of the network to continue absorbing linear changes without retraining the entire stack.
Only the query projection needs the nonlinearity; keys and values can stay linear.
The added parameter count remains modest because the MLP is bottlenecked.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual construction could be tested on the key or value projections to see if further gains appear.
If the absorption property generalizes, the approach might be combined with other low-rank or sparse attention variants.
At very large scales the optimal bottleneck width inside the MLP may need to grow, changing the parameter-efficiency trade-off.

Load-bearing premise

Basis transformations absorbed by adjacent layers remain stable once the query projection is replaced by the nonlinear residual, and the added MLP does not create optimization instabilities at the tested model scale.

What would settle it

Train the same nonlinear-query model at substantially larger scale or on a different modality and check whether the reported loss and perplexity gains disappear.

Figures

Figures reproduced from arXiv: 2603.13381 by Marko Karbevski.

**Figure 2.** Figure 2: Relative improvement over baseline (steps 1k to 59k). Nonlinear configurations: 84.97M param [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_\theta(X)$, where $f_\theta$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nonlinear query residuals give small consistent gains on tiny models but the benefit could be extra capacity rather than nonlinearity, since the linear absorption argument does not carry over.

read the letter

The main thing to know is that this paper replaces the linear query projection with Q(X) = X + bottleneck-MLP(X) and reports a 2.4% drop in validation log-loss plus 6.8% lower perplexity on GPT-3-small-style models, beating a baseline that has 12.5% more non-embedding parameters. The change adds only d² + O(d) parameters and keeps training stable via the identity anchor. That is the concrete result worth noting. The construction itself is new in the sense that prior algebraic work on absorbing linear transformations did not test this exact residual form. The paper does a clean job of stating the motivation from the identity-query result and then showing a minimal empirical test. The numbers are presented directly and the comparison to a heavier model is useful for context. The soft spots are straightforward. The algebraic justification for setting the query to identity relies on linear products that adjacent layers can absorb; once the nonlinear term is added, that mechanism no longer holds, yet the experiments do not check whether the residual stays small or whether the network converges to something equivalent to a linear query. Without ablations that vary the MLP width or test if the gain disappears when the nonlinearity is removed, it remains possible that the improvement is simply from the added parameters. The abstract also gives no information on random seeds, exact training parity with the larger baseline, or scaling behavior beyond the small model. This is the kind of targeted tweak that people working on attention variants or parameter-efficient expressivity would want to see. A reader who cares about low-cost ways to adjust capacity inside the attention block will find it worth reading. It deserves peer review because the idea is simple, the reported result is positive, and the open questions about mechanism and scaling are clear enough that referees can address them directly.

Referee Report

3 major / 2 minor

Summary. The paper claims that the linear query projection W_Q in decoder-only transformers can be replaced by a nonlinear residual Q(X) = X + f_θ(X) (bottleneck MLP with d² + O(d) parameters) without loss of the algebraic absorption property that justifies the identity anchor. Experiments on GPT-3-small-style models report 2.40% lower validation log-loss and 6.81% lower perplexity versus the linear baseline, while also outperforming a model with 12.5% more non-embedding parameters.

Significance. If the improvement is shown to arise from the controlled nonlinearity rather than added capacity, the result would indicate that modest, identity-anchored nonlinearities in attention projections can be beneficial at small scale and motivate scaling studies. The identity residual is a constructive design choice that reduces optimization risk relative to an unconstrained nonlinear projection.

major comments (3)

[Abstract] Abstract: the algebraic argument that any basis change in W_Q can be absorbed into adjacent linear layers holds only for the linear case. Replacing W_Q by the nonlinear residual Q(X) = X + f_θ(X) introduces terms that cannot be absorbed by the same mechanism; the manuscript does not demonstrate that the learned f_θ remains small enough for the identity anchor to dominate or that the effective query stays approximately linear.
[Experiments] Experiments section: the reported 2.40% log-loss gain is compared against a baseline and against a model with 12.5% more non-embedding parameters, but no control is described that adds an equivalent number of linear parameters (e.g., a wider linear projection or extra linear layer) while keeping the query strictly linear. Without this control it is impossible to separate the effect of nonlinearity from the effect of extra capacity.
[Experiments] Experiments section: the abstract states concrete percentage improvements, yet no information is given on whether the baseline and the nonlinear model were trained with identical optimizer settings, learning-rate schedules, number of seeds, or data order. These details are required to assess whether the observed gap is statistically reliable.

minor comments (2)

[Abstract] The precise bottleneck width, depth, and activation function of f_θ are not stated in the abstract; these hyperparameters should be reported explicitly so that the added parameter count can be verified.
[Method] Notation: the symbol f_θ is introduced without an explicit equation for the MLP architecture (e.g., hidden dimension, residual connections inside the MLP). Adding this equation would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript to incorporate additional analysis and controls.

read point-by-point responses

Referee: [Abstract] Abstract: the algebraic argument that any basis change in W_Q can be absorbed into adjacent linear layers holds only for the linear case. Replacing W_Q by the nonlinear residual Q(X) = X + f_θ(X) introduces terms that cannot be absorbed by the same mechanism; the manuscript does not demonstrate that the learned f_θ remains small enough for the identity anchor to dominate or that the effective query stays approximately linear.

Authors: We agree that the strict algebraic absorption property applies only to linear projections. The identity residual is intended to keep the nonlinearity small and close to the linear case. In the revision we will add measurements of ||f_θ(X)|| relative to ||X|| across layers and training steps to quantify how closely the effective query remains linear. revision: partial
Referee: [Experiments] Experiments section: the reported 2.40% log-loss gain is compared against a baseline and against a model with 12.5% more non-embedding parameters, but no control is described that adds an equivalent number of linear parameters (e.g., a wider linear projection or extra linear layer) while keeping the query strictly linear. Without this control it is impossible to separate the effect of nonlinearity from the effect of extra capacity.

Authors: The referee is correct that a matched-capacity linear control is needed to isolate the nonlinearity. We will add this experiment in the revision by training a model with an expanded linear W_Q whose parameter count matches the nonlinear residual. revision: yes
Referee: [Experiments] Experiments section: the abstract states concrete percentage improvements, yet no information is given on whether the baseline and the nonlinear model were trained with identical optimizer settings, learning-rate schedules, number of seeds, or data order. These details are required to assess whether the observed gap is statistically reliable.

Authors: We will expand the experimental details section to report identical optimizer and learning-rate schedules, the use of three random seeds, and identical data ordering for all compared models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result independent of algebraic justification

full rationale

The paper's core claim is an empirical performance gain (2.40% lower validation log-loss) from replacing linear W_Q with the nonlinear residual Q(X) = X + f_θ(X). The algebraic justification for the identity anchor is presented as background from 'recent algebraic analysis' rather than a derivation internal to this work that reduces to fitted parameters by construction. No equation in the provided text equates the reported improvement to the input capacity or to a self-citation chain; the experiments include a control with 12.5% more parameters, supplying an external benchmark. The result therefore does not collapse to a tautology or renamed fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that linear basis changes can still be absorbed when the query projection is replaced by a nonlinear residual, plus the empirical observation that the added MLP improves optimization. No new physical constants or particles are introduced.

free parameters (1)

MLP bottleneck width and depth
The architecture of f_θ is chosen by hand; its exact hidden dimension and number of layers are free parameters that affect the reported gains.

axioms (1)

domain assumption Any linear transformation applied to queries can be absorbed into subsequent layers without changing the overall function.
Invoked to justify setting the linear part of Q to identity.

pith-pipeline@v0.9.0 · 5466 in / 1310 out tokens · 30714 ms · 2026-05-15T12:58:43.278825+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 5 internal anchors

[1]

Why do we need weight decay in modern deep learning? InAdvances in Neural Information Processing Systems, 2024

Maksym Andriushchenko, Francesco D’Angelo, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning? InAdvances in Neural Information Processing Systems, 2024

work page 2024
[2]

Rotary embeddings: A relative revolution.EleutherAI Blog, April 2021.https://blog

Stella Biderman, Sid Black, Charles Foster, Leo Gao, Eric Hallahan, Horace He, Ben Wang, and Phil Wang. Rotary embeddings: A relative revolution.EleutherAI Blog, April 2021.https://blog. eleuther.ai/rotary-embeddings/

work page 2021
[3]

GPT-NeoX-20B: An open-source autoregressive language model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivan- shu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: An open-source autoregressive language model. InProceedings of the ACL Workshop on C...

work page 2022
[4]

Rethinking attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with Performers. InInternational Conference on Learning Representations, 2021

work page 2021
[5]

AuroRA: Breaking low-rank bottleneck of LoRA with nonlinear mapping

Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. AuroRA: Breaking low-rank bottleneck of LoRA with nonlinear mapping. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[6]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Transformer tricks: Removing weights for skipless transformers.arXiv preprint arXiv:2404.12362, 2024

Nils Graef. Transformer tricks: Removing weights for skipless transformers.arXiv preprint arXiv:2404.12362, 2024

work page arXiv 2024
[8]

Deep residual learning for image recog- nition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[9]

Smith, and Yee Whye Teh

Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L. Smith, and Yee Whye Teh. Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. InInternational Conference on Learning Representations, 2023

work page 2023
[10]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute- optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022
[12]

Efficient learn- ing with sine-activated low-rank matrices.arXiv preprint arXiv:2403.19243, 2024

Yiping Ji, Hemanth Saratchandran, Cameron Gordon, Zeyu Zhang, and Simon Lucey. Efficient learn- ing with sine-activated low-rank matrices.arXiv preprint arXiv:2403.19243, 2024

work page arXiv 2024
[13]

Always skip attention

Yiping Ji, Hemanth Saratchandran, Peyman Moghadam, and Simon Lucey. Always skip attention. In ICCV, 2025. 7

work page 2025
[14]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

Marko Karbevski and Antonij Mijoski. Key and value weights are probably all you need: On the necessity of the query, key, value weight triplet in self-attention transformers.arXiv preprint arXiv:2510.23912, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

NanoGPT.https://github.com/karpathy/nanoGPT, 2023

Andrej Karpathy. NanoGPT.https://github.com/karpathy/nanoGPT, 2023

work page 2023
[17]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165, 2020

work page 2020
[18]

How do transformers learn topic structure: Towards a mechanistic understanding

Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. InInternational Conference on Machine Learning, pages 19689–19729, 2023

work page 2023
[19]

LoRAN: Improved low-rank adaptation by a non-linear transformation

Yinqiao Li, Linqi Song, and Hanxu Hou. LoRAN: Improved low-rank adaptation by a non-linear transformation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 3134–3143, 2024

work page 2024
[20]

Normalization and effective learning rates in reinforcement learning

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning. InAdvances in Neural Informa- tion Processing Systems, 2024

work page 2024
[21]

MLP-Attention: Improving transformer architectures with MLP attention weights.Tiny Papers @ ICLR, 2023

Alireza Morsali, Moein Heidari, Samin Heydarian, and Tohid Abedini. MLP-Attention: Improving transformer architectures with MLP attention weights.Tiny Papers @ ICLR, 2023

work page 2023
[22]

Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free. InNeurIPS, 2025

work page 2025
[23]

David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V . Le. Primer: Search- ing for efficient transformers for language modeling. InNeurIPS, 2021

work page 2021
[24]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[25]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[26]

mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, et al. mHC: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Neural attention: Enhancing QKV calculation in self-attention mechanism with neural networks.arXiv preprint arXiv:2310.11398, 2023

Muhan Zhang. Neural attention: Enhancing QKV calculation in self-attention mechanism with neural networks.arXiv preprint arXiv:2310.11398, 2023

work page arXiv 2023
[28]

Hyper-Connections

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-Connections. InInternational Conference on Learning Representations, 2025. 8 A Hyperparameter Sweep This section is illustrative and does not claim completeness. The figure below is an earlier snapshot of the sweep, covering roughly 70% of the hyperpar...

work page 2025