Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

Antonij Mijoski; Marko Karbevski

arxiv: 2510.23912 · v7 · submitted 2025-10-27 · 💻 cs.LG · cs.AI

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

Marko Karbevski , Antonij Mijoski This is my paper

Pith reviewed 2026-05-18 03:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-attentiontransformersQKV weightsparameter reductionidentity matrixexpressivity boundaryReLU networksskip connections

0 comments

The pith

One of the three QKV weights in self-attention can be replaced by the identity matrix under mild assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to determine whether all three learned weight matrices in the standard self-attention mechanism are necessary. It establishes that, under mild assumptions, any one of the query, key, or value matrices can be replaced by the identity matrix while leaving the attention output unchanged, immediately lowering the parameter count in attention layers by 25 percent. When the replacement targets the query or key matrix, the attention scores simplify because they now depend on a single learned matrix rather than the product of two. Experiments on small decoder-only language models confirm that removing the query weights produces models whose performance matches the full baseline, and exceeds it once the saved parameters are reassigned to other layers. The analysis additionally identifies a structural limit on the functions representable by ReLU multilayer perceptrons once skip connections are introduced at fixed width.

Core claim

Under mild assumptions, we prove that one of the Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25%. If applied to the Query or Key weights, this also simplifies optimization: attention logits depend on a single learned weight matrix rather than on a product of two. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that reduced models match baseline performance despite fewer parameters, and outperform baselines when saved parameters are reallocated. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting,skip

What carries the argument

The identity-matrix substitution for one member of the query-key-value triplet, which leaves the overall attention computation invariant under the paper's mild assumptions.

If this is right

Attention layers can be implemented using only two learned weight matrices instead of three while preserving the same output function.
When the query or key matrix is replaced, attention logit computation depends on a single weight matrix rather than a product of two.
Small decoder-only models with one weight removed match or exceed the performance of full models once the freed parameters are reassigned elsewhere.
The reduction applies equally to encoder-only and decoder-only transformer architectures.
In the ReLU case, skip connections place multilayer perceptrons into function classes that are generically disjoint at fixed width.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

At larger scales the same reduction could produce meaningful savings in both memory footprint and training time.
The redundancy argument may extend to cross-attention or grouped-query attention variants.
The identified expressivity boundary could inform choices of width or residual structure in new architectures.
Many existing transformer implementations may carry more attention parameters than are strictly required for the operation.

Load-bearing premise

The proof relies on unspecified mild assumptions about the attention mechanism or model architecture that must hold for the redundancy to apply.

What would settle it

Train a small decoder-only model with the query projection replaced by the identity matrix and the saved parameters moved to the feed-forward layers, then compare its validation loss to an otherwise identical full-QKV baseline on the same dataset; a substantially higher loss for the reduced model would falsify the practical redundancy.

Figures

Figures reproduced from arXiv: 2510.23912 by Antonij Mijoski, Marko Karbevski.

**Figure 2.** Figure 2: Mean per-sample cosine similarity between predicted and target outputs. The trained MLP [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Training and validation loss for tied Embedding/LMHead weights configuration. The reduced model (No WQ, red) closely tracks the standard baseline (blue) throughout training, achieving comparable final performance with fewer parameters. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Training and validation loss for untied weights configuration. Both models converge smoothly, with the reduced variant (No WQ, blue) achieving slightly better final validation loss than the standard model (red). Our main findings are the following: 1. Query weights are redundant. Models trained with WQ = Id achieve validation loss competitive with or better than standard baselines ( [PITH_FULL_IMAGE:figur… view at source ↗

read the original abstract

We theoretically investigate whether the Query, Key, Value weight triplet can be reduced in encoder-only and decoder-only transformers. Under mild assumptions, we prove that one of the Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25\%. If applied to the Query or Key weights, this also simplifies optimization: attention logits depend on a single learned weight matrix rather than on a product of two. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that reduced models match baseline performance despite fewer parameters, and outperform baselines when saved parameters are reallocated. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting, skip connections push MLPs into a generically disjoint function class at fixed width. These findings motivate investigation across modalities and at scale, where the observed stability and efficiency gains may prove most consequential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that under mild assumptions one of the QKV matrices can be replaced by identity to cut attention parameters by 25 percent, with small-model experiments that hold up and some optimization simplification for query or key removal.

read the letter

The main thing to know is that this work claims a theoretical redundancy result: under mild assumptions, any one of the query, key, or value weight matrices in self-attention can be swapped for the identity without changing the attention output, which trims 25 percent of those parameters. When query or key is the one removed, the logits depend on a single matrix instead of a product, which should simplify optimization. They validate the query removal on small decoder-only GPT-style models trained from scratch and report that the reduced models match baseline performance, and do better when the saved parameters get reallocated elsewhere. They also note a side result on ReLU plus skip connections producing disjoint function classes at fixed width.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that under mild assumptions, one of the Query, Key, or Value weight matrices in self-attention is redundant and can be replaced by the identity matrix, reducing attention parameters by 25%. If applied to Query or Key, this also simplifies optimization since attention logits depend on a single matrix rather than a product. The claim is supported by a theoretical proof and empirical validation on small decoder-only GPT-style models trained from scratch, where Query-removed models match baseline performance and outperform when saved parameters are reallocated. The analysis additionally identifies a structural expressivity boundary: in the ReLU setting, skip connections place MLPs in generically disjoint function classes at fixed width.

Significance. If the reduction holds for complete transformer blocks, the result would provide a parameter-efficient simplification of attention with potential optimization benefits. The small-model experiments offer direct evidence that the reduced models can match or exceed baselines, and the expressivity boundary for residual ReLU networks is a useful structural observation. These elements, combined with the parameter-free nature of the proposed identity replacement, strengthen the case for further investigation at scale across modalities.

major comments (2)

§3 (theoretical proof): The claim that one of Q/K/V can be replaced by the identity while preserving attention output exactly proceeds via absorption into a neighboring projection. However, standard encoder/decoder blocks include residual connections and pre-/post-layer-norm; such absorption alters which matrix multiplies the residual stream and may change the function class realized by the full block even if the isolated attention map is identical. The manuscript's own ReLU + skip-connection expressivity result indicates awareness of these structural effects, yet it is unclear whether the proof accounts for them or assumes an isolated attention layer.
Assumptions paragraph and Theorem statement: The proof is conditioned on 'mild assumptions' that are not enumerated. Without an explicit list of these assumptions, any conditions for exact equivalence, or an error analysis when the assumptions are mildly violated, it is impossible to determine the scope or robustness of the 25% reduction claim.

minor comments (2)

Abstract and §5 (experiments): Model sizes, training hyperparameters, and exact baseline comparisons are only summarized; adding a table with parameter counts, FLOPs, and validation metrics for the reduced vs. baseline models would improve clarity.
Notation: Ensure W_q, W_k, W_v and the reparameterized forms (e.g., W_k') are defined consistently in the proof and empirical sections to avoid ambiguity in the absorption argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important points about the scope of our theoretical results and the need for greater clarity. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: §3 (theoretical proof): The claim that one of Q/K/V can be replaced by the identity while preserving attention output exactly proceeds via absorption into a neighboring projection. However, standard encoder/decoder blocks include residual connections and pre-/post-layer-norm; such absorption alters which matrix multiplies the residual stream and may change the function class realized by the full block even if the isolated attention map is identical. The manuscript's own ReLU + skip-connection expressivity result indicates awareness of these structural effects, yet it is unclear whether the proof accounts for them or assumes an isolated attention layer.

Authors: Our theoretical proof in §3 establishes that the output of the self-attention computation can be preserved exactly under the given conditions by replacing one of the Q, K, or V matrices with the identity and absorbing the change into a neighboring linear projection. We acknowledge that full encoder/decoder blocks include residual connections and layer normalization, which means the absorption affects how the modified projection interacts with the residual stream. The proof itself targets equivalence at the level of the attention sublayer output rather than claiming identical function classes for the entire block. Our empirical results on complete decoder-only GPT-style models (trained from scratch) show that the reduced models match or exceed baseline performance, indicating that any differences in the realized function class do not harm practical performance. The separate ReLU + skip-connection expressivity result concerns MLP behavior under residuals and is not intended to apply directly to the attention reduction. We will revise §3 to explicitly discuss the interaction with residuals and norms and to clarify the precise scope of the equivalence claim. revision: yes
Referee: Assumptions paragraph and Theorem statement: The proof is conditioned on 'mild assumptions' that are not enumerated. Without an explicit list of these assumptions, any conditions for exact equivalence, or an error analysis when the assumptions are mildly violated, it is impossible to determine the scope or robustness of the 25% reduction claim.

Authors: We agree that the assumptions must be stated explicitly to allow readers to assess the scope and robustness of the result. The current manuscript refers to 'mild assumptions' without a dedicated enumeration or discussion of boundary cases. In the revised version we will insert a new paragraph immediately preceding the theorem statement that lists all assumptions (including any requirements on matrix dimensions, linearity of projections, or other conditions needed for exact equivalence). We will also add a short robustness discussion, noting that our small-model experiments demonstrate stable performance under the practical conditions encountered during training, which provides indirect evidence that mild violations do not materially degrade the 25% reduction benefit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical proof is self-contained reparameterization

full rationale

The paper's central result is a mathematical proof under mild assumptions that one of the Query/Key/Value matrices is redundant and replaceable by the identity, reducing parameters by 25%. This is a direct reparameterization argument on the attention computation (Q = X W_q, etc.) rather than a fit to data or a self-referential construction. No equations or steps reduce the claimed redundancy to a fitted parameter or prior self-citation by construction. The separate empirical validation on small GPT-style models is presented as confirmation after the proof, not as input to it. The derivation chain is independent of the target result and does not invoke load-bearing self-citations or uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of mild assumptions that make one weight matrix redundant; these assumptions are invoked but not enumerated in the abstract, and no free parameters or new entities are introduced.

axioms (1)

domain assumption Mild assumptions on the transformer architecture or attention computation under which one weight matrix becomes redundant
The proof of redundancy is stated to hold only under these assumptions; their content determines whether the 25 percent reduction applies.

pith-pipeline@v0.9.0 · 5707 in / 987 out tokens · 34505 ms · 2026-05-18T03:34:54.319158+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3.1 (Reparametrization Lemma): ... f(X, W_Q, W_K, W_V) = f(X Θ, I_d, fW_K, fW_V) ... Θ = W_Q, fW_K = W_Q^{-1} W_K
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 (Attention-Skip-Only Query Weight Elimination): ... set W_Q := I_d ... propagate basis transformations through residual pathways

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can an MLP Absorb Its Own Skip Connection?
cs.LG 2026-04 accept novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
Perceptrons and localization of attention's mean-field landscape
cs.LG 2026-01 unverdicted novelty 7.0

In the mean-field limit of attention with perceptron blocks, critical points of the energy landscape are generically atomic and localized on subsets of the unit sphere.
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
cs.LG 2026-03 conditional novelty 6.0

Nonlinear query projections of the form X + MLP(X) improve transformer performance on small models with only d² + O(d) added parameters.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 3 Pith papers · 14 internal anchors

[1]

Understanding Deep Neural Networks with Rectified Linear Units

Available as arXiv:1611.01491 (extended ver- sions). [ALTdJ+23] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´ on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Layer Normalization

Software available from wandb.com. [BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Brock, A., De, S., Smith, S

[BKS+25] Luca Baroni, Galvin Khara, Joachim Schaeffer, Marat Subkhankulov, and Stefan Heimer- sheim. Transformers don’t need layernorm at inference time: Scaling layernorm re- moval to gpt-2 xl and the implications for mechanistic interpretability.arXiv preprint arXiv:2507.02559,

work page arXiv
[4]

Lan- guage models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

Rethinking Attention with Performers

[CLD+20] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[6]

DeepSeek-V3 Technical Report

[Dee24] DeepSeek-AI. Deepseek-v3: Scaling open-source language models with efficient training. arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Transformer tricks: Removing weights for skipless transformers

[Gra24] Nils Graef. Transformer tricks: Removing weights for skipless transformers.arXiv preprint arXiv:2404.12362,

work page arXiv
[8]

You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710,

[Hei24] Stefan Heimersheim. You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710,

work page arXiv
[9]

Gaussian Error Linear Units (GELUs)

21 [HG16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Parameter-Efficient Transfer Learning for NLP

[HGJ+19] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Larous- silhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp.ArXiv, abs/1902.00751,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[11]

Simplifying transformer blocks

[HH24] Bobby He and Thomas Hofmann. Simplifying transformer blocks. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Representation Learning, volume 2024, pages 8882–8910,

work page 2024
[12]

LoRA: Low-Rank Adaptation of Large Language Models

[HSW+21] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

The Llama 3 Herd of Models

22 [LT24] AI @ Meta Llama Team. The llama 3 herd of models.ArXiv, abs/2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

[MWM+24] Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

gpt-oss-120b & gpt-oss-20b Model Card

[OAA+25] OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Boaz Barak, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

[Sha19] Noam M. Shazeer. Fast transformer decoding: One write-head is all you need.ArXiv, abs/1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[17]

RoFormer: Enhanced Transformer with Rotary Position Embedding

[SLP+21] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

From words to watts: Benchmarking the energy costs of large language model inference

[SZM+23] Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE,

work page 2023
[19]

Qwen3 Technical Report

[Tea25] Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Linformer: Self-Attention with Linear Complexity

Curran Associates Inc. [WLK+20] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self- attention with linear complexity.arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[21]

arXiv preprint arXiv:2312.12148 (2023)

[XXQ+23] Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter- efficient fine-tuning methods for pretrained language models: A critical review and as- sessment.arXiv preprint arXiv:2312.12148,

work page arXiv
[22]

Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077,

[YBR+19] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077,

work page arXiv 1912
[23]

Transformers without Normalization

[ZCH+25] Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization.arXiv preprint arXiv:2503.10622,

work page arXiv
[24]

Muse: Parallel multi-scale attention for sequence to sequence learning.arXiv preprint arXiv:1911.09483,

[ZXZ+19] Guangxiang Zhao, Jingjing Xu, Zhiyuan Zhang, Liangchen Luo, Zhengdong Lu, and Xu Sun. Muse: Parallel multi-scale attention for sequence to sequence learning.arXiv preprint arXiv:1911.09483,

work page arXiv 1911

[1] [1]

Understanding Deep Neural Networks with Rectified Linear Units

Available as arXiv:1611.01491 (extended ver- sions). [ALTdJ+23] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´ on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Layer Normalization

Software available from wandb.com. [BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Brock, A., De, S., Smith, S

[BKS+25] Luca Baroni, Galvin Khara, Joachim Schaeffer, Marat Subkhankulov, and Stefan Heimer- sheim. Transformers don’t need layernorm at inference time: Scaling layernorm re- moval to gpt-2 xl and the implications for mechanistic interpretability.arXiv preprint arXiv:2507.02559,

work page arXiv

[4] [4]

Lan- guage models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[5] [5]

Rethinking Attention with Performers

[CLD+20] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[6] [6]

DeepSeek-V3 Technical Report

[Dee24] DeepSeek-AI. Deepseek-v3: Scaling open-source language models with efficient training. arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Transformer tricks: Removing weights for skipless transformers

[Gra24] Nils Graef. Transformer tricks: Removing weights for skipless transformers.arXiv preprint arXiv:2404.12362,

work page arXiv

[8] [8]

You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710,

[Hei24] Stefan Heimersheim. You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710,

work page arXiv

[9] [9]

Gaussian Error Linear Units (GELUs)

21 [HG16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Parameter-Efficient Transfer Learning for NLP

[HGJ+19] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Larous- silhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp.ArXiv, abs/1902.00751,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[11] [11]

Simplifying transformer blocks

[HH24] Bobby He and Thomas Hofmann. Simplifying transformer blocks. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Representation Learning, volume 2024, pages 8882–8910,

work page 2024

[12] [12]

LoRA: Low-Rank Adaptation of Large Language Models

[HSW+21] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

The Llama 3 Herd of Models

22 [LT24] AI @ Meta Llama Team. The llama 3 herd of models.ArXiv, abs/2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

[MWM+24] Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

gpt-oss-120b & gpt-oss-20b Model Card

[OAA+25] OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Boaz Barak, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

[Sha19] Noam M. Shazeer. Fast transformer decoding: One write-head is all you need.ArXiv, abs/1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[17] [17]

RoFormer: Enhanced Transformer with Rotary Position Embedding

[SLP+21] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

From words to watts: Benchmarking the energy costs of large language model inference

[SZM+23] Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE,

work page 2023

[19] [19]

Qwen3 Technical Report

[Tea25] Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Linformer: Self-Attention with Linear Complexity

Curran Associates Inc. [WLK+20] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self- attention with linear complexity.arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[21] [21]

arXiv preprint arXiv:2312.12148 (2023)

[XXQ+23] Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter- efficient fine-tuning methods for pretrained language models: A critical review and as- sessment.arXiv preprint arXiv:2312.12148,

work page arXiv

[22] [22]

Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077,

[YBR+19] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077,

work page arXiv 1912

[23] [23]

Transformers without Normalization

[ZCH+25] Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization.arXiv preprint arXiv:2503.10622,

work page arXiv

[24] [24]

Muse: Parallel multi-scale attention for sequence to sequence learning.arXiv preprint arXiv:1911.09483,

[ZXZ+19] Guangxiang Zhao, Jingjing Xu, Zhiyuan Zhang, Liangchen Luo, Zhengdong Lu, and Xu Sun. Muse: Parallel multi-scale attention for sequence to sequence learning.arXiv preprint arXiv:1911.09483,

work page arXiv 1911