Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
Pith reviewed 2026-05-15 12:58 UTC · model grok-4.3
The pith
Replacing the linear query projection with identity plus a small bottleneck MLP improves validation log-loss by 2.4 percent in GPT-style models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that setting the query to Q(X) equals X plus a bottleneck MLP f_theta of X allows the attention mechanism to benefit from nonlinearity while the identity term keeps the algebraic absorption of basis transformations intact. On GPT-3 small style models this replacement produces 2.40 percent lower validation log-loss and 6.81 percent lower perplexity, and the gain exceeds what is obtained by increasing non-embedding parameters by 12.5 percent.
What carries the argument
The nonlinear residual query Q(X) = X + f_theta(X), where f_theta is a bottleneck MLP; the identity term anchors the function so that linear basis changes can still be absorbed by neighboring layers.
If this is right
- The performance advantage holds against a linear model that uses 12.5 percent more non-embedding parameters.
- The identity anchor allows the rest of the network to continue absorbing linear changes without retraining the entire stack.
- Only the query projection needs the nonlinearity; keys and values can stay linear.
- The added parameter count remains modest because the MLP is bottlenecked.
Where Pith is reading between the lines
- The same residual construction could be tested on the key or value projections to see if further gains appear.
- If the absorption property generalizes, the approach might be combined with other low-rank or sparse attention variants.
- At very large scales the optimal bottleneck width inside the MLP may need to grow, changing the parameter-efficiency trade-off.
Load-bearing premise
Basis transformations absorbed by adjacent layers remain stable once the query projection is replaced by the nonlinear residual, and the added MLP does not create optimization instabilities at the tested model scale.
What would settle it
Train the same nonlinear-query model at substantially larger scale or on a different modality and check whether the reported loss and perplexity gains disappear.
Figures
read the original abstract
Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_\theta(X)$, where $f_\theta$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the linear query projection W_Q in decoder-only transformers can be replaced by a nonlinear residual Q(X) = X + f_θ(X) (bottleneck MLP with d² + O(d) parameters) without loss of the algebraic absorption property that justifies the identity anchor. Experiments on GPT-3-small-style models report 2.40% lower validation log-loss and 6.81% lower perplexity versus the linear baseline, while also outperforming a model with 12.5% more non-embedding parameters.
Significance. If the improvement is shown to arise from the controlled nonlinearity rather than added capacity, the result would indicate that modest, identity-anchored nonlinearities in attention projections can be beneficial at small scale and motivate scaling studies. The identity residual is a constructive design choice that reduces optimization risk relative to an unconstrained nonlinear projection.
major comments (3)
- [Abstract] Abstract: the algebraic argument that any basis change in W_Q can be absorbed into adjacent linear layers holds only for the linear case. Replacing W_Q by the nonlinear residual Q(X) = X + f_θ(X) introduces terms that cannot be absorbed by the same mechanism; the manuscript does not demonstrate that the learned f_θ remains small enough for the identity anchor to dominate or that the effective query stays approximately linear.
- [Experiments] Experiments section: the reported 2.40% log-loss gain is compared against a baseline and against a model with 12.5% more non-embedding parameters, but no control is described that adds an equivalent number of linear parameters (e.g., a wider linear projection or extra linear layer) while keeping the query strictly linear. Without this control it is impossible to separate the effect of nonlinearity from the effect of extra capacity.
- [Experiments] Experiments section: the abstract states concrete percentage improvements, yet no information is given on whether the baseline and the nonlinear model were trained with identical optimizer settings, learning-rate schedules, number of seeds, or data order. These details are required to assess whether the observed gap is statistically reliable.
minor comments (2)
- [Abstract] The precise bottleneck width, depth, and activation function of f_θ are not stated in the abstract; these hyperparameters should be reported explicitly so that the added parameter count can be verified.
- [Method] Notation: the symbol f_θ is introduced without an explicit equation for the MLP architecture (e.g., hidden dimension, residual connections inside the MLP). Adding this equation would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each point below and will revise the manuscript to incorporate additional analysis and controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: the algebraic argument that any basis change in W_Q can be absorbed into adjacent linear layers holds only for the linear case. Replacing W_Q by the nonlinear residual Q(X) = X + f_θ(X) introduces terms that cannot be absorbed by the same mechanism; the manuscript does not demonstrate that the learned f_θ remains small enough for the identity anchor to dominate or that the effective query stays approximately linear.
Authors: We agree that the strict algebraic absorption property applies only to linear projections. The identity residual is intended to keep the nonlinearity small and close to the linear case. In the revision we will add measurements of ||f_θ(X)|| relative to ||X|| across layers and training steps to quantify how closely the effective query remains linear. revision: partial
-
Referee: [Experiments] Experiments section: the reported 2.40% log-loss gain is compared against a baseline and against a model with 12.5% more non-embedding parameters, but no control is described that adds an equivalent number of linear parameters (e.g., a wider linear projection or extra linear layer) while keeping the query strictly linear. Without this control it is impossible to separate the effect of nonlinearity from the effect of extra capacity.
Authors: The referee is correct that a matched-capacity linear control is needed to isolate the nonlinearity. We will add this experiment in the revision by training a model with an expanded linear W_Q whose parameter count matches the nonlinear residual. revision: yes
-
Referee: [Experiments] Experiments section: the abstract states concrete percentage improvements, yet no information is given on whether the baseline and the nonlinear model were trained with identical optimizer settings, learning-rate schedules, number of seeds, or data order. These details are required to assess whether the observed gap is statistically reliable.
Authors: We will expand the experimental details section to report identical optimizer and learning-rate schedules, the use of three random seeds, and identical data ordering for all compared models. revision: yes
Circularity Check
No circularity: empirical result independent of algebraic justification
full rationale
The paper's core claim is an empirical performance gain (2.40% lower validation log-loss) from replacing linear W_Q with the nonlinear residual Q(X) = X + f_θ(X). The algebraic justification for the identity anchor is presented as background from 'recent algebraic analysis' rather than a derivation internal to this work that reduces to fitted parameters by construction. No equation in the provided text equates the reported improvement to the input capacity or to a self-citation chain; the experiments include a control with 12.5% more parameters, supplying an external benchmark. The result therefore does not collapse to a tautology or renamed fit.
Axiom & Free-Parameter Ledger
free parameters (1)
- MLP bottleneck width and depth
axioms (1)
- domain assumption Any linear transformation applied to queries can be absorbed into subsequent layers without changing the overall function.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.