Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
Pith reviewed 2026-05-18 03:34 UTC · model grok-4.3
The pith
One of the three QKV weights in self-attention can be replaced by the identity matrix under mild assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under mild assumptions, we prove that one of the Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25%. If applied to the Query or Key weights, this also simplifies optimization: attention logits depend on a single learned weight matrix rather than on a product of two. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that reduced models match baseline performance despite fewer parameters, and outperform baselines when saved parameters are reallocated. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting,skip
What carries the argument
The identity-matrix substitution for one member of the query-key-value triplet, which leaves the overall attention computation invariant under the paper's mild assumptions.
If this is right
- Attention layers can be implemented using only two learned weight matrices instead of three while preserving the same output function.
- When the query or key matrix is replaced, attention logit computation depends on a single weight matrix rather than a product of two.
- Small decoder-only models with one weight removed match or exceed the performance of full models once the freed parameters are reassigned elsewhere.
- The reduction applies equally to encoder-only and decoder-only transformer architectures.
- In the ReLU case, skip connections place multilayer perceptrons into function classes that are generically disjoint at fixed width.
Where Pith is reading between the lines
- At larger scales the same reduction could produce meaningful savings in both memory footprint and training time.
- The redundancy argument may extend to cross-attention or grouped-query attention variants.
- The identified expressivity boundary could inform choices of width or residual structure in new architectures.
- Many existing transformer implementations may carry more attention parameters than are strictly required for the operation.
Load-bearing premise
The proof relies on unspecified mild assumptions about the attention mechanism or model architecture that must hold for the redundancy to apply.
What would settle it
Train a small decoder-only model with the query projection replaced by the identity matrix and the saved parameters moved to the feed-forward layers, then compare its validation loss to an otherwise identical full-QKV baseline on the same dataset; a substantially higher loss for the reduced model would falsify the practical redundancy.
Figures
read the original abstract
We theoretically investigate whether the Query, Key, Value weight triplet can be reduced in encoder-only and decoder-only transformers. Under mild assumptions, we prove that one of the Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25\%. If applied to the Query or Key weights, this also simplifies optimization: attention logits depend on a single learned weight matrix rather than on a product of two. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that reduced models match baseline performance despite fewer parameters, and outperform baselines when saved parameters are reallocated. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting, skip connections push MLPs into a generically disjoint function class at fixed width. These findings motivate investigation across modalities and at scale, where the observed stability and efficiency gains may prove most consequential.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that under mild assumptions, one of the Query, Key, or Value weight matrices in self-attention is redundant and can be replaced by the identity matrix, reducing attention parameters by 25%. If applied to Query or Key, this also simplifies optimization since attention logits depend on a single matrix rather than a product. The claim is supported by a theoretical proof and empirical validation on small decoder-only GPT-style models trained from scratch, where Query-removed models match baseline performance and outperform when saved parameters are reallocated. The analysis additionally identifies a structural expressivity boundary: in the ReLU setting, skip connections place MLPs in generically disjoint function classes at fixed width.
Significance. If the reduction holds for complete transformer blocks, the result would provide a parameter-efficient simplification of attention with potential optimization benefits. The small-model experiments offer direct evidence that the reduced models can match or exceed baselines, and the expressivity boundary for residual ReLU networks is a useful structural observation. These elements, combined with the parameter-free nature of the proposed identity replacement, strengthen the case for further investigation at scale across modalities.
major comments (2)
- §3 (theoretical proof): The claim that one of Q/K/V can be replaced by the identity while preserving attention output exactly proceeds via absorption into a neighboring projection. However, standard encoder/decoder blocks include residual connections and pre-/post-layer-norm; such absorption alters which matrix multiplies the residual stream and may change the function class realized by the full block even if the isolated attention map is identical. The manuscript's own ReLU + skip-connection expressivity result indicates awareness of these structural effects, yet it is unclear whether the proof accounts for them or assumes an isolated attention layer.
- Assumptions paragraph and Theorem statement: The proof is conditioned on 'mild assumptions' that are not enumerated. Without an explicit list of these assumptions, any conditions for exact equivalence, or an error analysis when the assumptions are mildly violated, it is impossible to determine the scope or robustness of the 25% reduction claim.
minor comments (2)
- Abstract and §5 (experiments): Model sizes, training hyperparameters, and exact baseline comparisons are only summarized; adding a table with parameter counts, FLOPs, and validation metrics for the reduced vs. baseline models would improve clarity.
- Notation: Ensure W_q, W_k, W_v and the reparameterized forms (e.g., W_k') are defined consistently in the proof and empirical sections to avoid ambiguity in the absorption argument.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important points about the scope of our theoretical results and the need for greater clarity. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: §3 (theoretical proof): The claim that one of Q/K/V can be replaced by the identity while preserving attention output exactly proceeds via absorption into a neighboring projection. However, standard encoder/decoder blocks include residual connections and pre-/post-layer-norm; such absorption alters which matrix multiplies the residual stream and may change the function class realized by the full block even if the isolated attention map is identical. The manuscript's own ReLU + skip-connection expressivity result indicates awareness of these structural effects, yet it is unclear whether the proof accounts for them or assumes an isolated attention layer.
Authors: Our theoretical proof in §3 establishes that the output of the self-attention computation can be preserved exactly under the given conditions by replacing one of the Q, K, or V matrices with the identity and absorbing the change into a neighboring linear projection. We acknowledge that full encoder/decoder blocks include residual connections and layer normalization, which means the absorption affects how the modified projection interacts with the residual stream. The proof itself targets equivalence at the level of the attention sublayer output rather than claiming identical function classes for the entire block. Our empirical results on complete decoder-only GPT-style models (trained from scratch) show that the reduced models match or exceed baseline performance, indicating that any differences in the realized function class do not harm practical performance. The separate ReLU + skip-connection expressivity result concerns MLP behavior under residuals and is not intended to apply directly to the attention reduction. We will revise §3 to explicitly discuss the interaction with residuals and norms and to clarify the precise scope of the equivalence claim. revision: yes
-
Referee: Assumptions paragraph and Theorem statement: The proof is conditioned on 'mild assumptions' that are not enumerated. Without an explicit list of these assumptions, any conditions for exact equivalence, or an error analysis when the assumptions are mildly violated, it is impossible to determine the scope or robustness of the 25% reduction claim.
Authors: We agree that the assumptions must be stated explicitly to allow readers to assess the scope and robustness of the result. The current manuscript refers to 'mild assumptions' without a dedicated enumeration or discussion of boundary cases. In the revised version we will insert a new paragraph immediately preceding the theorem statement that lists all assumptions (including any requirements on matrix dimensions, linearity of projections, or other conditions needed for exact equivalence). We will also add a short robustness discussion, noting that our small-model experiments demonstrate stable performance under the practical conditions encountered during training, which provides indirect evidence that mild violations do not materially degrade the 25% reduction benefit. revision: yes
Circularity Check
No significant circularity; theoretical proof is self-contained reparameterization
full rationale
The paper's central result is a mathematical proof under mild assumptions that one of the Query/Key/Value matrices is redundant and replaceable by the identity, reducing parameters by 25%. This is a direct reparameterization argument on the attention computation (Q = X W_q, etc.) rather than a fit to data or a self-referential construction. No equations or steps reduce the claimed redundancy to a fitted parameter or prior self-citation by construction. The separate empirical validation on small GPT-style models is presented as confirmation after the proof, not as input to it. The derivation chain is independent of the target result and does not invoke load-bearing self-citations or uniqueness theorems from the authors' prior work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mild assumptions on the transformer architecture or attention computation under which one weight matrix becomes redundant
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3.1 (Reparametrization Lemma): ... f(X, W_Q, W_K, W_V) = f(X Θ, I_d, fW_K, fW_V) ... Θ = W_Q, fW_K = W_Q^{-1} W_K
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 (Attention-Skip-Only Query Weight Elimination): ... set W_Q := I_d ... propagate basis transformations through residual pathways
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Can an MLP Absorb Its Own Skip Connection?
Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
-
Perceptrons and localization of attention's mean-field landscape
In the mean-field limit of attention with perceptron blocks, critical points of the energy landscape are generically atomic and localized on subsets of the unit sphere.
-
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
Nonlinear query projections of the form X + MLP(X) improve transformer performance on small models with only d² + O(d) added parameters.
Reference graph
Works this paper leans on
-
[1]
Understanding Deep Neural Networks with Rectified Linear Units
Available as arXiv:1611.01491 (extended ver- sions). [ALTdJ+23] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´ on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Software available from wandb.com. [BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
[BKS+25] Luca Baroni, Galvin Khara, Joachim Schaeffer, Marat Subkhankulov, and Stefan Heimer- sheim. Transformers don’t need layernorm at inference time: Scaling layernorm re- moval to gpt-2 xl and the implications for mechanistic interpretability.arXiv preprint arXiv:2507.02559,
-
[4]
[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[5]
Rethinking Attention with Performers
[CLD+20] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[6]
[Dee24] DeepSeek-AI. Deepseek-v3: Scaling open-source language models with efficient training. arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Transformer tricks: Removing weights for skipless transformers
[Gra24] Nils Graef. Transformer tricks: Removing weights for skipless transformers.arXiv preprint arXiv:2404.12362,
-
[8]
You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710,
[Hei24] Stefan Heimersheim. You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710,
-
[9]
Gaussian Error Linear Units (GELUs)
21 [HG16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Parameter-Efficient Transfer Learning for NLP
[HGJ+19] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Larous- silhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp.ArXiv, abs/1902.00751,
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[11]
Simplifying transformer blocks
[HH24] Bobby He and Thomas Hofmann. Simplifying transformer blocks. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Representation Learning, volume 2024, pages 8882–8910,
work page 2024
-
[12]
LoRA: Low-Rank Adaptation of Large Language Models
[HSW+21] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
22 [LT24] AI @ Meta Llama Team. The llama 3 herd of models.ArXiv, abs/2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
[MWM+24] Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
gpt-oss-120b & gpt-oss-20b Model Card
[OAA+25] OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Boaz Barak, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
[Sha19] Noam M. Shazeer. Fast transformer decoding: One write-head is all you need.ArXiv, abs/1911.02150,
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[17]
RoFormer: Enhanced Transformer with Rotary Position Embedding
[SLP+21] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
From words to watts: Benchmarking the energy costs of large language model inference
[SZM+23] Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE,
work page 2023
-
[19]
[Tea25] Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Linformer: Self-Attention with Linear Complexity
Curran Associates Inc. [WLK+20] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self- attention with linear complexity.arXiv preprint arXiv:2006.04768,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[21]
arXiv preprint arXiv:2312.12148 (2023)
[XXQ+23] Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter- efficient fine-tuning methods for pretrained language models: A critical review and as- sessment.arXiv preprint arXiv:2312.12148,
-
[22]
[YBR+19] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077,
-
[23]
Transformers without Normalization
[ZCH+25] Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization.arXiv preprint arXiv:2503.10622,
-
[24]
[ZXZ+19] Guangxiang Zhao, Jingjing Xu, Zhiyuan Zhang, Liangchen Luo, Zhengdong Lu, and Xu Sun. Muse: Parallel multi-scale attention for sequence to sequence learning.arXiv preprint arXiv:1911.09483,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.