Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Marko Karbevski

arxiv: 2603.13381 · v3 · pith:I3CPQ63Nnew · submitted 2026-03-11 · 💻 cs.LG · cs.AI

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Marko Karbevski This is my paper

Pith reviewed 2026-05-15 12:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformer attentionquery projectionnonlinear residualbottleneck MLPlanguage modelingdecoder-only models

0 comments

The pith

Replacing the linear query projection with identity plus a small bottleneck MLP improves validation log-loss by 2.4 percent in GPT-style models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the query projection in transformer attention can be made nonlinear while preserving the property that basis changes are absorbed by adjacent layers. It does this by defining the query as the input plus a residual bottleneck MLP whose parameter count is roughly d squared plus linear in d. Experiments on small decoder-only models show this change yields lower loss and perplexity than the standard linear query and also beats a wider baseline that adds more parameters. A reader would care because the result indicates that attention need not stay fully linear to function, opening a low-cost route to greater expressivity inside the attention block.

Core claim

The central claim is that setting the query to Q(X) equals X plus a bottleneck MLP f_theta of X allows the attention mechanism to benefit from nonlinearity while the identity term keeps the algebraic absorption of basis transformations intact. On GPT-3 small style models this replacement produces 2.40 percent lower validation log-loss and 6.81 percent lower perplexity, and the gain exceeds what is obtained by increasing non-embedding parameters by 12.5 percent.

What carries the argument

The nonlinear residual query Q(X) = X + f_theta(X), where f_theta is a bottleneck MLP; the identity term anchors the function so that linear basis changes can still be absorbed by neighboring layers.

If this is right

The performance advantage holds against a linear model that uses 12.5 percent more non-embedding parameters.
The identity anchor allows the rest of the network to continue absorbing linear changes without retraining the entire stack.
Only the query projection needs the nonlinearity; keys and values can stay linear.
The added parameter count remains modest because the MLP is bottlenecked.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual construction could be tested on the key or value projections to see if further gains appear.
If the absorption property generalizes, the approach might be combined with other low-rank or sparse attention variants.
At very large scales the optimal bottleneck width inside the MLP may need to grow, changing the parameter-efficiency trade-off.

Load-bearing premise

Basis transformations absorbed by adjacent layers remain stable once the query projection is replaced by the nonlinear residual, and the added MLP does not create optimization instabilities at the tested model scale.

What would settle it

Train the same nonlinear-query model at substantially larger scale or on a different modality and check whether the reported loss and perplexity gains disappear.

Figures

Figures reproduced from arXiv: 2603.13381 by Marko Karbevski.

**Figure 2.** Figure 2: Relative improvement over baseline (steps 1k to 59k). Nonlinear configurations: 84.97M param [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_\theta(X)$, where $f_\theta$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nonlinear query residuals give small consistent gains on tiny models but the benefit could be extra capacity rather than nonlinearity, since the linear absorption argument does not carry over.

read the letter

The main thing to know is that this paper replaces the linear query projection with Q(X) = X + bottleneck-MLP(X) and reports a 2.4% drop in validation log-loss plus 6.8% lower perplexity on GPT-3-small-style models, beating a baseline that has 12.5% more non-embedding parameters. The change adds only d² + O(d) parameters and keeps training stable via the identity anchor. That is the concrete result worth noting. The construction itself is new in the sense that prior algebraic work on absorbing linear transformations did not test this exact residual form. The paper does a clean job of stating the motivation from the identity-query result and then showing a minimal empirical test. The numbers are presented directly and the comparison to a heavier model is useful for context. The soft spots are straightforward. The algebraic justification for setting the query to identity relies on linear products that adjacent layers can absorb; once the nonlinear term is added, that mechanism no longer holds, yet the experiments do not check whether the residual stays small or whether the network converges to something equivalent to a linear query. Without ablations that vary the MLP width or test if the gain disappears when the nonlinearity is removed, it remains possible that the improvement is simply from the added parameters. The abstract also gives no information on random seeds, exact training parity with the larger baseline, or scaling behavior beyond the small model. This is the kind of targeted tweak that people working on attention variants or parameter-efficient expressivity would want to see. A reader who cares about low-cost ways to adjust capacity inside the attention block will find it worth reading. It deserves peer review because the idea is simple, the reported result is positive, and the open questions about mechanism and scaling are clear enough that referees can address them directly.

Referee Report

3 major / 2 minor

Summary. The paper claims that the linear query projection W_Q in decoder-only transformers can be replaced by a nonlinear residual Q(X) = X + f_θ(X) (bottleneck MLP with d² + O(d) parameters) without loss of the algebraic absorption property that justifies the identity anchor. Experiments on GPT-3-small-style models report 2.40% lower validation log-loss and 6.81% lower perplexity versus the linear baseline, while also outperforming a model with 12.5% more non-embedding parameters.

Significance. If the improvement is shown to arise from the controlled nonlinearity rather than added capacity, the result would indicate that modest, identity-anchored nonlinearities in attention projections can be beneficial at small scale and motivate scaling studies. The identity residual is a constructive design choice that reduces optimization risk relative to an unconstrained nonlinear projection.

major comments (3)

[Abstract] Abstract: the algebraic argument that any basis change in W_Q can be absorbed into adjacent linear layers holds only for the linear case. Replacing W_Q by the nonlinear residual Q(X) = X + f_θ(X) introduces terms that cannot be absorbed by the same mechanism; the manuscript does not demonstrate that the learned f_θ remains small enough for the identity anchor to dominate or that the effective query stays approximately linear.
[Experiments] Experiments section: the reported 2.40% log-loss gain is compared against a baseline and against a model with 12.5% more non-embedding parameters, but no control is described that adds an equivalent number of linear parameters (e.g., a wider linear projection or extra linear layer) while keeping the query strictly linear. Without this control it is impossible to separate the effect of nonlinearity from the effect of extra capacity.
[Experiments] Experiments section: the abstract states concrete percentage improvements, yet no information is given on whether the baseline and the nonlinear model were trained with identical optimizer settings, learning-rate schedules, number of seeds, or data order. These details are required to assess whether the observed gap is statistically reliable.

minor comments (2)

[Abstract] The precise bottleneck width, depth, and activation function of f_θ are not stated in the abstract; these hyperparameters should be reported explicitly so that the added parameter count can be verified.
[Method] Notation: the symbol f_θ is introduced without an explicit equation for the MLP architecture (e.g., hidden dimension, residual connections inside the MLP). Adding this equation would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript to incorporate additional analysis and controls.

read point-by-point responses

Referee: [Abstract] Abstract: the algebraic argument that any basis change in W_Q can be absorbed into adjacent linear layers holds only for the linear case. Replacing W_Q by the nonlinear residual Q(X) = X + f_θ(X) introduces terms that cannot be absorbed by the same mechanism; the manuscript does not demonstrate that the learned f_θ remains small enough for the identity anchor to dominate or that the effective query stays approximately linear.

Authors: We agree that the strict algebraic absorption property applies only to linear projections. The identity residual is intended to keep the nonlinearity small and close to the linear case. In the revision we will add measurements of ||f_θ(X)|| relative to ||X|| across layers and training steps to quantify how closely the effective query remains linear. revision: partial
Referee: [Experiments] Experiments section: the reported 2.40% log-loss gain is compared against a baseline and against a model with 12.5% more non-embedding parameters, but no control is described that adds an equivalent number of linear parameters (e.g., a wider linear projection or extra linear layer) while keeping the query strictly linear. Without this control it is impossible to separate the effect of nonlinearity from the effect of extra capacity.

Authors: The referee is correct that a matched-capacity linear control is needed to isolate the nonlinearity. We will add this experiment in the revision by training a model with an expanded linear W_Q whose parameter count matches the nonlinear residual. revision: yes
Referee: [Experiments] Experiments section: the abstract states concrete percentage improvements, yet no information is given on whether the baseline and the nonlinear model were trained with identical optimizer settings, learning-rate schedules, number of seeds, or data order. These details are required to assess whether the observed gap is statistically reliable.

Authors: We will expand the experimental details section to report identical optimizer and learning-rate schedules, the use of three random seeds, and identical data ordering for all compared models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result independent of algebraic justification

full rationale

The paper's core claim is an empirical performance gain (2.40% lower validation log-loss) from replacing linear W_Q with the nonlinear residual Q(X) = X + f_θ(X). The algebraic justification for the identity anchor is presented as background from 'recent algebraic analysis' rather than a derivation internal to this work that reduces to fitted parameters by construction. No equation in the provided text equates the reported improvement to the input capacity or to a self-citation chain; the experiments include a control with 12.5% more parameters, supplying an external benchmark. The result therefore does not collapse to a tautology or renamed fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that linear basis changes can still be absorbed when the query projection is replaced by a nonlinear residual, plus the empirical observation that the added MLP improves optimization. No new physical constants or particles are introduced.

free parameters (1)

MLP bottleneck width and depth
The architecture of f_θ is chosen by hand; its exact hidden dimension and number of layers are free parameters that affect the reported gains.

axioms (1)

domain assumption Any linear transformation applied to queries can be absorbed into subsequent layers without changing the overall function.
Invoked to justify setting the linear part of Q to identity.

pith-pipeline@v0.9.0 · 5466 in / 1310 out tokens · 30714 ms · 2026-05-15T12:58:43.278825+00:00 · methodology

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)