Do Transformers Need Three Projections? Systematic Study of QKV Variants

Ali Kayyam; Anusha Madan Gopal; M Anthony Lewis

arxiv: 2606.04032 · v2 · pith:XSMHMOPTnew · submitted 2026-06-01 · 💻 cs.LG · cs.AI· cs.CL· cs.PF

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Ali Kayyam , Anusha Madan Gopal , M Anthony Lewis This is my paper

Pith reviewed 2026-06-28 15:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.PF

keywords QKV projectionsattention mechanismKV cacheweight sharingtransformer efficiencylanguage modelinginference optimizationgrouped query attention

0 comments

The pith

Sharing key and value projections halves KV cache size with 3.1 percent perplexity degradation in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three ways to tie together the query, key, and value projections inside transformer attention and measures the effect on accuracy and memory. It finds that forcing the key and value projections to be identical (Q-K=V) keeps performance close to the standard three-projection baseline on synthetic tasks, image classification, and language modeling up to 1.2 billion parameters. The same change cuts the size of the KV cache in half during inference. When the same sharing is stacked with grouped-query or multi-query attention, total cache savings reach 87.5 percent or 96.9 percent. The authors attribute the result to the low-rank nature of attention, which allows keys and values to live in similar spaces without breaking the attention map.

Core claim

Q-K=V projection sharing achieves 50 percent KV cache reduction with only 3.1 percent perplexity degradation in language modeling, while Q=K-V and Q=K=V produce symmetric attention maps that degrade performance unless corrected by asymmetric positional encodings; the sharing is complementary to head-sharing methods such as GQA and MQA.

What carries the argument

The Q-K=V constraint that forces the key and value projection matrices to be identical.

If this is right

Q-K=V combined with GQA-4 yields 87.5 percent KV cache reduction.
Q-K=V combined with MQA yields 96.9 percent KV cache reduction.
Projection sharing works on par with or better than standard QKV on vision and synthetic tasks.
The technique is complementary to existing head-sharing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sharing could be tested during training to reduce optimizer state memory.
If the low-rank assumption holds, the method may extend to decoder-only models in other modalities.
Alternative positional encodings might rescue the symmetric variants on additional tasks.

Load-bearing premise

Keys and values can occupy similar representational spaces in the low-rank attention regime across the tested scales and domains.

What would settle it

Significant perplexity rise or accuracy drop when Q-K=V is applied to models larger than 1.2B parameters or to non-language domains such as audio or code generation.

Figures

Figures reproduced from arXiv: 2606.04032 by Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis.

**Figure 1.** Figure 1: Our proposed Projection-Shared Attention Variants. Attention mechanism with 2D positional encoding is denoted as (X)+. A = Softmax(αKKT )V. (2) This formulation produces a symmetric attention matrix KKT . Symmetric attention has been explored in prior work on graph neural nets (Velickovi ˇ c et al. ´ , 2018) and relational reasoning (Santoro et al., 2017), where the lack of directional bias can be benef… view at source ↗

**Figure 2.** Figure 2: Training loss and validation accuracy of attention variants for image classification on the TinyImageNet dataset. utilize the Adam optimizer with the MultiStepLR scheduler for optimization. In the case of 2D positional encoding, we set pos dim to 50. As indicated in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: shows the loss over time for the synthetics tasks [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Attention maps over synthetic tasks. Rows from top to bottom: Reverse, Sort, Swap, Sub, and Copy. Columns from left to right: QKV, Q=K-V, and (Q=K-V)+. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Top) Code to compute and normalize the self attention map. Bottom) un-normalized and normalized (right) attention maps. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Two sets of samples from the anomaly detection dataset, with the first image in each set representing the anomaly. A.3.2. SET ANOMALY DETECTION We aim to apply transformers to sets (i.e. unordered inputs). A model is trained to find the odd one out in a set of ten images, using CIFAR-100. Nine images are from one class, and one is different. Two sample sets are shown in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 7.** Figure 7: Left: The standard SETR architecture (Zheng et al., 2021). Right: The SETR-PUP decoder. It is modified to also reduce feature dimensions during upsampling. refer to these models as SETR-QKV-CE and SETR-KV-CE, respectively. Finally, they developed an additional hybrid model using a Convolutional Vision Transformer (CvT) (Wu et al., 2021) as the encoder. The models SETR-QKV-CVT and SETR-KV-CVT utilize a CvT-… view at source ↗

**Figure 8.** Figure 8: Projection sharing variants on 300M parameter LLMs trained on 10B tokens. Left: Validation perplexity (lower is better). Right: KV cache reduction (higher is better). Q-K=V achieves 50% cache reduction with only 3.1% perplexity degradation. KV (Q=K-V) provides no cache benefit despite 4.8% degradation due to still requiring separate K and V caches. K (Q=K=V) causes catastrophic 25.4% degradation, making it… view at source ↗

**Figure 9.** Figure 9: Head sharing and combined approaches on 300M parameter LLMs. Left: Validation perplexity. Right: KV cache reduction. Orange bars: head sharing only (GQA-4, MQA). Green bars: combined projection + head sharing (Q-GQA-4, Q-MQA). Combined approaches achieve up to 96.9% cache reduction while maintaining less than 5% perplexity degradation, demonstrating that projection sharing and head sharing are complementar… view at source ↗

**Figure 10.** Figure 10: Efficiency-quality Pareto frontier for attention variants. Projection sharing (blue circles) and head sharing (orange triangles) occupy complementary regions. Combined approaches (green diamonds) achieve the highest cache reductions. The shaded region indicates practical deployment zone (<5% perplexity degradation). Q-K=V fills the gap between QKV baseline and head-sharing methods, providing 50% cache red… view at source ↗

**Figure 11.** Figure 11: Validation curves for 300M parameter models. Left: Validation loss. Right: Validation perplexity over 10B training tokens. Q-K=V (dark teal) matches baseline QKV (olive) closely on held-out data, achieving 50% cache reduction with only 3.1% perplexity degradation. Q=K-V (light pink) shows higher validation loss, confirming suboptimal generalization. All head-sharing and combined variants converge to pract… view at source ↗

**Figure 12.** Figure 12: Validation curves for 1.2B parameter models. Left: Validation loss. Right: Validation perplexity over 10B training tokens. Rankings on held-out data remain consistent with 300M scale. Q-K=V (green) and head-sharing variants track baseline QKV (gray/brown) closely, while combined approaches (Q-GQA-8, Q-MQA) maintain < 5% degradation with 88-98.5% cache reduction, confirming scalability of our findings [PI… view at source ↗

read the original abstract

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-K=V sharing cuts KV cache in half with 3.1% perplexity hit in their 1.2B runs and stacks with GQA, but the low-rank explanation stays untested beyond that scale.

read the letter

The useful part is the clean ablation on three projection-tying schemes plus the 2D positional encoding fix for symmetry. Q-K=V comes out ahead in language modeling: 50% KV cache reduction at 3.1% perplexity cost on 300M and 1.2B models trained on 10B tokens, and it adds to GQA-4 for 87.5% total reduction or MQA for 96.9%. The vision and synthetic results line up with the language ones, and the code is public.

They also give a direct reason why Q-K=V holds up while Q=K-V does not: keys and values can share space in the low-rank attention regime, but forcing query and key together breaks directionality. That distinction shows up consistently in their numbers.

The soft spot is the supporting story for why sharing works. The low-rank and representational-overlap claim is invoked to explain the results, yet the paper does not report direct measurements of attention rank or K/V similarity, and all runs stop at 1.2B parameters. Whether the overlap survives at larger scale or on different data is left open. Error bars are also missing from the headline numbers.

This is aimed at people who care about inference memory on edge devices. The experiments are broad enough and the code is there, so the work is worth a referee's time even if the generalization question needs follow-up.

Referee Report

2 major / 2 minor

Summary. The paper systematically ablates three QKV projection-sharing constraints (Q-K=V, Q=K-V, Q=K=V) plus asymmetric variants with 2D positional encodings. Across synthetic tasks, vision benchmarks (MNIST, CIFAR, TinyImageNet, anomaly detection), and language modeling (300M and 1.2B models on 10B tokens), it reports that Q-K=V matches or occasionally exceeds standard QKV performance while halving KV cache; combining it with GQA-4 or MQA yields 87.5–96.9% cache reduction. The authors attribute success to a low-rank attention regime in which keys and values occupy similar spaces, while Q=K-V breaks directionality. Public code is released.

Significance. If the reported numbers hold, the work supplies a simple, complementary weight-tying technique that delivers large, quantifiable KV-cache savings for on-device inference with negligible quality loss. The public implementation and breadth of tasks (synthetic through 1.2B-scale LM) are concrete strengths that facilitate direct verification and extension.

major comments (2)

[Abstract] Abstract and language-modeling results: the headline 3.1% perplexity degradation for Q-K=V (and the 87.5–96.9% combined reductions) is stated without error bars, number of random seeds, or training-hyperparameter tables, preventing assessment of whether the small gap is statistically reliable or sensitive to optimization details.
[Abstract] The mechanistic claim that 'Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime' (Abstract) is presented as an explanation yet is unsupported by any reported measurement of attention-matrix rank, singular-value spectra, or K/V cosine similarity; without these diagnostics the low-rank justification remains an untested post-hoc interpretation rather than a verified finding.

minor comments (2)

Vision and synthetic sections would benefit from explicit statement of whether the same hyper-parameters were used across all QKV variants or whether per-variant tuning occurred.
The paper notes that Q=K-V and Q=K=V produce symmetric attention maps; a brief equation or diagram showing how the 2D positional encoding breaks this symmetry would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the two major comments below and will make the indicated changes to strengthen the presentation of results and claims.

read point-by-point responses

Referee: [Abstract] Abstract and language-modeling results: the headline 3.1% perplexity degradation for Q-K=V (and the 87.5–96.9% combined reductions) is stated without error bars, number of random seeds, or training-hyperparameter tables, preventing assessment of whether the small gap is statistically reliable or sensitive to optimization details.

Authors: We agree that reporting variability across seeds and providing hyperparameter details would allow better assessment of reliability. In the revised manuscript we will add results from three independent random seeds for the language-modeling experiments (reporting mean and standard deviation of perplexity), along with a supplementary table listing the key training hyperparameters for the 300M and 1.2B models. revision: yes
Referee: [Abstract] The mechanistic claim that 'Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime' (Abstract) is presented as an explanation yet is unsupported by any reported measurement of attention-matrix rank, singular-value spectra, or K/V cosine similarity; without these diagnostics the low-rank justification remains an untested post-hoc interpretation rather than a verified finding.

Authors: The referee is correct that the low-rank-regime explanation is interpretive and not backed by direct measurements such as rank or cosine similarity in the current manuscript. We will revise the abstract (and the corresponding sentence in the conclusion) to present the statement as a hypothesis consistent with the empirical results rather than a verified mechanistic finding. If space permits, we will also add a brief appendix note on average K/V cosine similarity computed from the trained models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation study with no derivations or self-referential predictions.

full rationale

The paper reports direct experimental measurements of perplexity, accuracy, and cache size across QKV sharing variants on synthetic, vision, and language tasks (300M/1.2B models, 10B tokens). No equations, first-principles derivations, or predictions are presented that reduce to fitted inputs by construction. The interpretive claim that Q-K=V works due to low-rank regime and K/V overlap is post-hoc explanation of results, not a load-bearing step that defines or predicts its own inputs. No self-citations are used to justify uniqueness or ansatzes. The study is self-contained against external benchmarks via reported ablations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study of existing transformer components with no new free parameters, axioms beyond standard attention mechanics, or invented entities.

axioms (1)

standard math Standard transformer self-attention formulation with separate Q, K, V projections
The variants are defined relative to the conventional QKV attention mechanism.

pith-pipeline@v0.9.1-grok · 5857 in / 1311 out tokens · 30471 ms · 2026-06-28T15:12:44.687124+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 4 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:1409.0473 , year=

Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , year=

Pith/arXiv arXiv
[2]

The handbook of brain theory and neural networks , volume=

Convolutional networks for images, speech, and time series , author=. The handbook of brain theory and neural networks , volume=. 1995 , publisher=

1995
[3]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[4]

ACM Computing Surveys , volume=

Efficient transformers: A survey , author=. ACM Computing Surveys , volume=. 2022 , publisher=

2022
[5]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

2009
[6]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil , title =. International Conference on Learning Representations , year=. doi:10.48550/ARXIV.2010.11929 , shorttitle=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2010
[8]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =
[9]

Otaduy, and Dan Casas

Zheng, Sixiao and Lu, Jiachen and Zhao, Hengshuang and Zhu, Xiatian and Luo, Zekun and Wang, Yabiao and Fu, Yanwei and Feng, Jianfeng and Xiang, Tao and Torr, Philip H.S. and Zhang, Li , booktitle =. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , year=. doi:10.1109/CVPR46437.2021.00681 , archiveprefix =. 2012....

work page doi:10.1109/cvpr46437.2021.00681 2021
[10]

doi:10.48550/arXiv.2102.04306 , adsurl=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2102.04306
[11]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , year =

Using the Output Embedding to Improve Language Models , author =. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , year =
[12]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

Improve transformer models with better relative position embeddings , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

2020
[13]

2023 , url=

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama , author=. 2023 , url=

2023
[14]

arXiv preprint arXiv:2204.14198 , year=

Flamingo: A Visual Language Model for Few-Shot Learning , author=. arXiv preprint arXiv:2204.14198 , year=

Pith/arXiv arXiv
[15]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

Learning Transferable Visual Models From Natural Language Supervision , author=. Proceedings of the 38th International Conference on Machine Learning (ICML) , year=
[16]

Proc IEEE Int Conf Comput Vis , year=

CvT: Introducing Convolutions to Vision Transformers , author=. Proc IEEE Int Conf Comput Vis , year=
[17]

2022 , url =

happyharrycn and Maggie and Culliton, Phil and Yadav, Poonam and Lee, Sangjune Laurence , title =. 2022 , url =

2022
[18]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Big Bird: Transformers for Longer Sequences , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[19]

Proceedings of machine learning and systems , volume=

Efficiently scaling transformer inference , author=. Proceedings of machine learning and systems , volume=
[20]

Advances in neural information processing systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in neural information processing systems , volume=
[21]

IEEE Transactions on Visualization and Computer Graphics , volume=

Attention flows: Analyzing and comparing attention mechanisms in language models , author=. IEEE Transactions on Visualization and Computer Graphics , volume=. 2020 , publisher=

2020
[22]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

An empirical study of spatial attention mechanisms in deep networks , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[23]

International conference on machine learning , pages=

Scaling vision transformers to 22 billion parameters , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[24]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[25]

arXiv preprint arXiv:2503.10622 , year=

Transformers without Normalization , author=. arXiv preprint arXiv:2503.10622 , year=

arXiv
[26]

arXiv preprint arXiv:1607.06450 , year=

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

Pith/arXiv arXiv
[27]

arXiv preprint arXiv:1708.07747 , year=

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

Pith/arXiv arXiv
[28]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

1998
[29]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

2009
[30]

Advances in neural information processing systems , volume=

Imagenet classification with deep convolutional neural networks , author=. Advances in neural information processing systems , volume=
[31]

arXiv preprint arXiv:2503.01743 , year=

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=

Pith/arXiv arXiv
[32]

International Conference on Machine Learning , pages=

Hyena hierarchy: Towards larger convolutional language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[33]

Nature Machine Intelligence , pages=

Transformers and genome language models , author=. Nature Machine Intelligence , pages=. 2025 , publisher=

2025
[34]

5 technical report , author=

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv
[35]

IEEE transactions on pattern analysis and machine intelligence , volume=

A survey on vision transformer , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2022 , publisher=

2022
[36]

arXiv preprint arXiv:2312.00752 , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

Pith/arXiv arXiv
[37]

Advances in Neural Information Processing Systems , volume=

Soft: Softmax-free transformer with linear complexity , author=. Advances in Neural Information Processing Systems , volume=
[38]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Sima: Simple softmax-free attention for vision transformers , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
[39]

arXiv preprint arXiv:1605.00459 , year=

Multi30k: Multilingual english-german image descriptions , author=. arXiv preprint arXiv:1605.00459 , year=

Pith/arXiv arXiv
[40]

arXiv preprint arXiv:1906.11024 , year=

Sharing attention weights for fast transformer , author=. arXiv preprint arXiv:1906.11024 , year=

Pith/arXiv arXiv 1906
[41]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Residualtransformer: Residual low-rank learning with weight-sharing for transformer layers , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[42]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Minivit: Compressing vision transformers with weight multiplexing , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[43]

arXiv preprint arXiv:2101.00234 , year=

Subformer: Exploring weight sharing for parameter efficiency in generative transformers , author=. arXiv preprint arXiv:2101.00234 , year=

arXiv
[44]

arXiv preprint arXiv:2004.11886 , year=

Lite transformer with long-short range attention , author=. arXiv preprint arXiv:2004.11886 , year=

arXiv 2004
[45]

arXiv preprint arXiv:1909.11942 , year=

Albert: A lite bert for self-supervised learning of language representations , author=. arXiv preprint arXiv:1909.11942 , year=

Pith/arXiv arXiv 1909
[46]

arXiv preprint arXiv:1807.03819 , year=

Universal transformers , author=. arXiv preprint arXiv:1807.03819 , year=

Pith/arXiv arXiv
[47]

arXiv preprint arXiv:1806.03280 , year=

Multilingual neural machine translation with task-specific attention , author=. arXiv preprint arXiv:1806.03280 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2001.04451 , year=

Reformer: The efficient transformer , author=. arXiv preprint arXiv:2001.04451 , year=

Pith/arXiv arXiv 2001
[49]

arXiv preprint arXiv:1909.10351 , year=

Tinybert: Distilling bert for natural language understanding , author=. arXiv preprint arXiv:1909.10351 , year=

arXiv 1909
[50]

arXiv preprint arXiv:2301.00774 , year=

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot , author=. arXiv preprint arXiv:2301.00774 , year=

arXiv
[51]

arXiv preprint arXiv:2005.07683 , year=

Movement Pruning: Adaptive Sparsity by Fine-Tuning , author=. arXiv preprint arXiv:2005.07683 , year=

arXiv 2005
[52]

arXiv preprint arXiv:2109.14642 , year=

Block Pruning For Faster Transformers , author=. arXiv preprint arXiv:2109.14642 , year=

arXiv
[53]

arXiv preprint arXiv:2011.00623 , year=

GOBO: Quantizing Attention-Based NLP Models for Reduced Size and Latency , author=. arXiv preprint arXiv:2011.00623 , year=

arXiv 2011
[54]

International conference on machine learning , pages=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[55]

arXiv preprint arXiv:2305.13253 , year=

OmniQuant: Weight and Activation Post-Training Quantization for Transformers , author=. arXiv preprint arXiv:2305.13253 , year=

arXiv
[56]

ICLR Workshop Track , year=

To prune, or not to prune: exploring the efficacy of pruning for model compression , author=. ICLR Workshop Track , year=
[57]

SparseML: Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models , author=
[58]

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

HAQ: Hardware-Aware Automated Quantization with Mixed Precision , author=. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[59]

Joint Pruning, Quantization and Distillation for Efficient Inference of Transformers , author=
[60]

arXiv preprint arXiv:2402.05964 , year=

A Survey on Transformer Compression , author=. arXiv preprint arXiv:2402.05964 , year=

arXiv
[61]

Model Compression Techniques for Transformer Models: A Survey , author=
[62]

Companion Proceedings of the ACM on Web Conference 2025 , pages=

Don't do rag: When cache-augmented generation is all you need for knowledge tasks , author=. Companion Proceedings of the ACM on Web Conference 2025 , pages=

2025
[63]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
[64]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

2024
[65]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2023
[66]

Computer Vision and Image Understanding , volume=

Visual question answering: A survey of methods and datasets , author=. Computer Vision and Image Understanding , volume=. 2017 , publisher=

2017
[67]

Expert Systems with Applications , volume=

Abstractive summarization: An overview of the state of the art , author=. Expert Systems with Applications , volume=. 2019 , publisher=

2019
[68]

arXiv preprint arXiv:2406.06567 , year =

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion , author =. arXiv preprint arXiv:2406.06567 , year =

arXiv
[69]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[70]

arXiv preprint arXiv:2105.14103 , year=

An Attention Free Transformer , author=. arXiv preprint arXiv:2105.14103 , year=

arXiv
[71]

International Conference on Learning Representations (ICLR) , year=

Graph Attention Networks , author=. International Conference on Learning Representations (ICLR) , year=
[72]

arXiv preprint arXiv:2204.02311 , year=

PaLM: Scaling Language Modeling with Pathways , author=. arXiv preprint arXiv:2204.02311 , year=

Pith/arXiv arXiv
[73]

arXiv preprint arXiv:2310.06825 , year=

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

Pith/arXiv arXiv
[74]

Advances in Neural Information Processing Systems , volume=

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , author=. Advances in Neural Information Processing Systems , volume=
[75]

arXiv preprint arXiv:2306.03078 , year=

SPQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression , author=. arXiv preprint arXiv:2306.03078 , year=

arXiv
[76]

arXiv preprint arXiv:2303.06865 , year=

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU , author=. arXiv preprint arXiv:2303.06865 , year=

arXiv
[77]

arXiv preprint arXiv:1904.10509 , year=

Generating Long Sequences with Sparse Transformers , author=. arXiv preprint arXiv:1904.10509 , year=

Pith/arXiv arXiv 1904
[78]

arXiv preprint arXiv:2004.05150 , year=

Longformer: The Long-Document Transformer , author=. arXiv preprint arXiv:2004.05150 , year=

Pith/arXiv arXiv 2004
[79]

arXiv preprint arXiv:2311.16867 , year=

The Falcon Series of Open Language Models , author=. arXiv preprint arXiv:2311.16867 , year=

Pith/arXiv arXiv
[80]

arXiv preprint arXiv:2307.09288 , year=

LLaMA 2: Open Foundation and Fine-Tuned Chat Models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:1409.0473 , year=

Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , year=

Pith/arXiv arXiv

[2] [2]

The handbook of brain theory and neural networks , volume=

Convolutional networks for images, speech, and time series , author=. The handbook of brain theory and neural networks , volume=. 1995 , publisher=

1995

[3] [3]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

[4] [4]

ACM Computing Surveys , volume=

Efficient transformers: A survey , author=. ACM Computing Surveys , volume=. 2022 , publisher=

2022

[5] [5]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

2009

[6] [6]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[7] [7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil , title =. International Conference on Learning Representations , year=. doi:10.48550/ARXIV.2010.11929 , shorttitle=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2010

[8] [8]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

[9] [9]

Otaduy, and Dan Casas

Zheng, Sixiao and Lu, Jiachen and Zhao, Hengshuang and Zhu, Xiatian and Luo, Zekun and Wang, Yabiao and Fu, Yanwei and Feng, Jianfeng and Xiang, Tao and Torr, Philip H.S. and Zhang, Li , booktitle =. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , year=. doi:10.1109/CVPR46437.2021.00681 , archiveprefix =. 2012....

work page doi:10.1109/cvpr46437.2021.00681 2021

[10] [10]

doi:10.48550/arXiv.2102.04306 , adsurl=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2102.04306

[11] [11]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , year =

Using the Output Embedding to Improve Language Models , author =. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , year =

[12] [12]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

Improve transformer models with better relative position embeddings , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

2020

[13] [13]

2023 , url=

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama , author=. 2023 , url=

2023

[14] [14]

arXiv preprint arXiv:2204.14198 , year=

Flamingo: A Visual Language Model for Few-Shot Learning , author=. arXiv preprint arXiv:2204.14198 , year=

Pith/arXiv arXiv

[15] [15]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

Learning Transferable Visual Models From Natural Language Supervision , author=. Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

[16] [16]

Proc IEEE Int Conf Comput Vis , year=

CvT: Introducing Convolutions to Vision Transformers , author=. Proc IEEE Int Conf Comput Vis , year=

[17] [17]

2022 , url =

happyharrycn and Maggie and Culliton, Phil and Yadav, Poonam and Lee, Sangjune Laurence , title =. 2022 , url =

2022

[18] [18]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Big Bird: Transformers for Longer Sequences , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[19] [19]

Proceedings of machine learning and systems , volume=

Efficiently scaling transformer inference , author=. Proceedings of machine learning and systems , volume=

[20] [20]

Advances in neural information processing systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in neural information processing systems , volume=

[21] [21]

IEEE Transactions on Visualization and Computer Graphics , volume=

Attention flows: Analyzing and comparing attention mechanisms in language models , author=. IEEE Transactions on Visualization and Computer Graphics , volume=. 2020 , publisher=

2020

[22] [22]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

An empirical study of spatial attention mechanisms in deep networks , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[23] [23]

International conference on machine learning , pages=

Scaling vision transformers to 22 billion parameters , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[24] [24]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[25] [25]

arXiv preprint arXiv:2503.10622 , year=

Transformers without Normalization , author=. arXiv preprint arXiv:2503.10622 , year=

arXiv

[26] [26]

arXiv preprint arXiv:1607.06450 , year=

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:1708.07747 , year=

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

Pith/arXiv arXiv

[28] [28]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

1998

[29] [29]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

2009

[30] [30]

Advances in neural information processing systems , volume=

Imagenet classification with deep convolutional neural networks , author=. Advances in neural information processing systems , volume=

[31] [31]

arXiv preprint arXiv:2503.01743 , year=

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=

Pith/arXiv arXiv

[32] [32]

International Conference on Machine Learning , pages=

Hyena hierarchy: Towards larger convolutional language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[33] [33]

Nature Machine Intelligence , pages=

Transformers and genome language models , author=. Nature Machine Intelligence , pages=. 2025 , publisher=

2025

[34] [34]

5 technical report , author=

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv

[35] [35]

IEEE transactions on pattern analysis and machine intelligence , volume=

A survey on vision transformer , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2022 , publisher=

2022

[36] [36]

arXiv preprint arXiv:2312.00752 , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

Pith/arXiv arXiv

[37] [37]

Advances in Neural Information Processing Systems , volume=

Soft: Softmax-free transformer with linear complexity , author=. Advances in Neural Information Processing Systems , volume=

[38] [38]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Sima: Simple softmax-free attention for vision transformers , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

[39] [39]

arXiv preprint arXiv:1605.00459 , year=

Multi30k: Multilingual english-german image descriptions , author=. arXiv preprint arXiv:1605.00459 , year=

Pith/arXiv arXiv

[40] [40]

arXiv preprint arXiv:1906.11024 , year=

Sharing attention weights for fast transformer , author=. arXiv preprint arXiv:1906.11024 , year=

Pith/arXiv arXiv 1906

[41] [41]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Residualtransformer: Residual low-rank learning with weight-sharing for transformer layers , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024

[42] [42]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Minivit: Compressing vision transformers with weight multiplexing , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[43] [43]

arXiv preprint arXiv:2101.00234 , year=

Subformer: Exploring weight sharing for parameter efficiency in generative transformers , author=. arXiv preprint arXiv:2101.00234 , year=

arXiv

[44] [44]

arXiv preprint arXiv:2004.11886 , year=

Lite transformer with long-short range attention , author=. arXiv preprint arXiv:2004.11886 , year=

arXiv 2004

[45] [45]

arXiv preprint arXiv:1909.11942 , year=

Albert: A lite bert for self-supervised learning of language representations , author=. arXiv preprint arXiv:1909.11942 , year=

Pith/arXiv arXiv 1909

[46] [46]

arXiv preprint arXiv:1807.03819 , year=

Universal transformers , author=. arXiv preprint arXiv:1807.03819 , year=

Pith/arXiv arXiv

[47] [47]

arXiv preprint arXiv:1806.03280 , year=

Multilingual neural machine translation with task-specific attention , author=. arXiv preprint arXiv:1806.03280 , year=

Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2001.04451 , year=

Reformer: The efficient transformer , author=. arXiv preprint arXiv:2001.04451 , year=

Pith/arXiv arXiv 2001

[49] [49]

arXiv preprint arXiv:1909.10351 , year=

Tinybert: Distilling bert for natural language understanding , author=. arXiv preprint arXiv:1909.10351 , year=

arXiv 1909

[50] [50]

arXiv preprint arXiv:2301.00774 , year=

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot , author=. arXiv preprint arXiv:2301.00774 , year=

arXiv

[51] [51]

arXiv preprint arXiv:2005.07683 , year=

Movement Pruning: Adaptive Sparsity by Fine-Tuning , author=. arXiv preprint arXiv:2005.07683 , year=

arXiv 2005

[52] [52]

arXiv preprint arXiv:2109.14642 , year=

Block Pruning For Faster Transformers , author=. arXiv preprint arXiv:2109.14642 , year=

arXiv

[53] [53]

arXiv preprint arXiv:2011.00623 , year=

GOBO: Quantizing Attention-Based NLP Models for Reduced Size and Latency , author=. arXiv preprint arXiv:2011.00623 , year=

arXiv 2011

[54] [54]

International conference on machine learning , pages=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[55] [55]

arXiv preprint arXiv:2305.13253 , year=

OmniQuant: Weight and Activation Post-Training Quantization for Transformers , author=. arXiv preprint arXiv:2305.13253 , year=

arXiv

[56] [56]

ICLR Workshop Track , year=

To prune, or not to prune: exploring the efficacy of pruning for model compression , author=. ICLR Workshop Track , year=

[57] [57]

SparseML: Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models , author=

[58] [58]

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

HAQ: Hardware-Aware Automated Quantization with Mixed Precision , author=. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[59] [59]

Joint Pruning, Quantization and Distillation for Efficient Inference of Transformers , author=

[60] [60]

arXiv preprint arXiv:2402.05964 , year=

A Survey on Transformer Compression , author=. arXiv preprint arXiv:2402.05964 , year=

arXiv

[61] [61]

Model Compression Techniques for Transformer Models: A Survey , author=

[62] [62]

Companion Proceedings of the ACM on Web Conference 2025 , pages=

Don't do rag: When cache-augmented generation is all you need for knowledge tasks , author=. Companion Proceedings of the ACM on Web Conference 2025 , pages=

2025

[63] [63]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

[64] [64]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

2024

[65] [65]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2023

[66] [66]

Computer Vision and Image Understanding , volume=

Visual question answering: A survey of methods and datasets , author=. Computer Vision and Image Understanding , volume=. 2017 , publisher=

2017

[67] [67]

Expert Systems with Applications , volume=

Abstractive summarization: An overview of the state of the art , author=. Expert Systems with Applications , volume=. 2019 , publisher=

2019

[68] [68]

arXiv preprint arXiv:2406.06567 , year =

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion , author =. arXiv preprint arXiv:2406.06567 , year =

arXiv

[69] [69]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

2020

[70] [70]

arXiv preprint arXiv:2105.14103 , year=

An Attention Free Transformer , author=. arXiv preprint arXiv:2105.14103 , year=

arXiv

[71] [71]

International Conference on Learning Representations (ICLR) , year=

Graph Attention Networks , author=. International Conference on Learning Representations (ICLR) , year=

[72] [72]

arXiv preprint arXiv:2204.02311 , year=

PaLM: Scaling Language Modeling with Pathways , author=. arXiv preprint arXiv:2204.02311 , year=

Pith/arXiv arXiv

[73] [73]

arXiv preprint arXiv:2310.06825 , year=

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

Pith/arXiv arXiv

[74] [74]

Advances in Neural Information Processing Systems , volume=

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , author=. Advances in Neural Information Processing Systems , volume=

[75] [75]

arXiv preprint arXiv:2306.03078 , year=

SPQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression , author=. arXiv preprint arXiv:2306.03078 , year=

arXiv

[76] [76]

arXiv preprint arXiv:2303.06865 , year=

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU , author=. arXiv preprint arXiv:2303.06865 , year=

arXiv

[77] [77]

arXiv preprint arXiv:1904.10509 , year=

Generating Long Sequences with Sparse Transformers , author=. arXiv preprint arXiv:1904.10509 , year=

Pith/arXiv arXiv 1904

[78] [78]

arXiv preprint arXiv:2004.05150 , year=

Longformer: The Long-Document Transformer , author=. arXiv preprint arXiv:2004.05150 , year=

Pith/arXiv arXiv 2004

[79] [79]

arXiv preprint arXiv:2311.16867 , year=

The Falcon Series of Open Language Models , author=. arXiv preprint arXiv:2311.16867 , year=

Pith/arXiv arXiv

[80] [80]

arXiv preprint arXiv:2307.09288 , year=

LLaMA 2: Open Foundation and Fine-Tuned Chat Models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv