Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

Marko Karbevski , Antonij Mijoski

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords queryvalueweightparametersweightsattentiondecoder-onlymatrix

read the original abstract

We theoretically investigate whether the Query, Key, Value weight triplet can be reduced in encoder-only and decoder-only transformers. Under mild assumptions, we prove that one of the Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25\%. If applied to the Query or Key weights, this also simplifies optimization: attention logits depend on a single learned weight matrix rather than on a product of two. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that reduced models match baseline performance despite fewer parameters, and outperform baselines when saved parameters are reallocated. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting, skip connections push MLPs into a generically disjoint function class at fixed width. These findings motivate investigation across modalities and at scale, where the observed stability and efficiency gains may prove most consequential.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can an MLP Absorb Its Own Skip Connection?
cs.LG 2026-04 accept novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.