A3 : an Analytical Low-Rank Approximation Framework for Attention

Cheng Zhang; Christos-Savvas Bouganis; George A. Constantinides; Jeffrey T. H. Wong; Pedro Gimenes; Wayne Luk; Xinye Cao; Yiren Zhao

arxiv: 2505.12942 · v4 · pith:NMRMJ4HMnew · submitted 2025-05-19 · 💻 cs.CL · cs.AI· cs.LG

A3 : an Analytical Low-Rank Approximation Framework for Attention

Jeffrey T. H. Wong , Cheng Zhang , Xinye Cao , Pedro Gimenes , Christos-Savvas Bouganis , George A. Constantinides , Wayne Luk , Yiren Zhao This is my paper

Pith reviewed 2026-05-22 14:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords low-rank approximationtransformer compressionattention mechanismlarge language modelsmodel compressionKV cache compressionpost-traininganalytical solution

0 comments

The pith

A³ splits Transformer layers into QK, OV and MLP components and derives analytical low-rank reductions for each to cut size and compute with less accuracy loss than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes A³, a post-training framework that divides each Transformer layer into three functional components—QK, OV, and MLP—and supplies closed-form analytical solutions that shrink the hidden dimensions inside each while minimizing the component-specific functional loss. This produces direct reductions in total parameters, KV cache footprint, and FLOPs without the extra GEMM launches or memory traffic that come from conventional low-rank factorizations. Experiments on LLaMA 3.1-70B demonstrate the practical payoff: under an identical compute-and-memory budget the approximated model reaches 4.69 perplexity on WikiText-2, beating the previous best reported result of 7.87 by a substantial margin. Readers should care because the approach respects the modular structure of attention and feed-forward blocks rather than treating weight matrices in isolation.

Core claim

A³ splits a Transformer layer into three functional components, namely QK, OV, and MLP and provides analytical solutions that reduces the hidden dimension size inside each component while minimizing the component's functional loss. This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads, and yields superior performance compared with prior low-rank techniques on the same reduction budget.

What carries the argument

Analytical low-rank dimension-reduction solutions for the QK, OV, and MLP functional components that minimize each component's functional loss.

If this is right

Model parameter count drops in direct proportion to the chosen hidden-dimension reductions.
KV cache memory requirement shrinks because the reduced dimensions apply to key and value projections.
Inference FLOPs decrease because all matrix multiplies operate on the smaller internal dimensions.
No additional kernel launches or temporary buffers are introduced at runtime.
The same analytical reductions can be combined with quantization or continued fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The component-wise analytical treatment could be ported to other attention-based architectures that share the same QK/OV/MLP split.
Layer-wise mixed-rank schedules derived from the same functional-loss criterion might further improve the accuracy–compression trade-off.
Because the method is post-training and analytical, it offers a lightweight starting point for subsequent hardware-specific optimizations.

Load-bearing premise

Independently minimizing functional loss for the QK, OV, and MLP components produces near-optimal overall model performance without needing to account for cross-component interactions or downstream fine-tuning effects.

What would settle it

Apply the analytical reductions to LLaMA 3.1-70B under the reported budget and measure WikiText-2 perplexity; a result near or above 7.87 would falsify the claim of consistent superiority over prior low-rank methods.

Figures

Figures reproduced from arXiv: 2505.12942 by Cheng Zhang, Christos-Savvas Bouganis, George A. Constantinides, Jeffrey T. H. Wong, Pedro Gimenes, Wayne Luk, Xinye Cao, Yiren Zhao.

**Figure 1.** Figure 1: High-level overview of A 3 . A 3 performs a low-rank approximation on each QK, OV, and MLP component, reducing the head dimensions in QK and OV, and the intermediate dimension in MLP. The classic MLP in a Transformer has two linear layers with a ReLU activation function in between: Xd = ReLU(XmlpWu), Ymlp = XdWd . (3) Wu and Wd scale the input dimension dm to the intermediate dimension dinter and back to d… view at source ↗

**Figure 2.** Figure 2: LLaMA-7b PPL on C4, compared to SVD, FWSVD and SVD-LLM. Pretraining tasks and downstream tasks In [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparisons in Tokens per Second (TPS) of A 3 and SVD-LLM (LLaMA-2-13b, A100 40GB, batch size=2, sequence length=2048, attention backend=Eager/SDPA). We conduct ablation studies to evaluate A 3 ’s impact on individual components (QK, OV, MLP). We also include baselines that can be applied to the target components to show A 3 ’s advantage. Attention without RoPE Theorem 2 (A 3 -QK) and Theorem… view at source ↗

**Figure 4.** Figure 4: Ablation study of A 3 components. (a) QK and OV on MPT-7B. (b) QK-RoPE and MLP on LLaMA-2-7B. Attention with RoPE In Section 3.4 we propose using CUR approximation to solve Problem 1 for attention with RoPE, which follows a similar approach as A 3 -MLP in Section 3.3. Here we compare against structured pruning baselines that can be adapted for this problem, including abs(w) and Wanda [Sun et al., 2023]. … view at source ↗

**Figure 5.** Figure 5: The xq and xkv covariance matrix across all decoders with a sample (2048 sequence length) from SlimPajama. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: A comparison of perplexity (↓) on WikiText-2 using quantization (HQQ, 4 bits) for both the original and A 3 -applied LLaMA-3.1-8B. Quantization compatibility Here we show that A 3 can be combined together with quantization [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches and memory operations for decomposed small matrices. To address these limitations, we propose $A^3$, a post-training low-rank approximation framework. $A^3$ splits a Transformer layer into three functional components, namely $\texttt{QK}$, $\texttt{OV}$, and $\texttt{MLP}$ and provides analytical solutions that reduces the hidden dimension size inside each component while minimizing the component's functional loss. This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. Through extensive experiments, we show that $A^3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also show versatile applications of $A^3$ in KV cache compression, integration with quantization, fine-tuning and mixed-rank assignments. We open-sourced our framework and code at https://github.com/DeepWok/a3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents A³, a post-training low-rank approximation framework for Transformers. It splits each layer into QK, OV, and MLP functional components and derives analytical low-rank reductions for each by minimizing a component-specific functional loss. The method claims to reduce model size, KV cache, and FLOPs without runtime overhead. Key empirical result: under equivalent compute/memory budgets, the approximated LLaMA 3.1-70B achieves 4.69 perplexity on WikiText-2, outperforming prior SoTA by 3.18. Additional uses in KV cache compression, quantization integration, and mixed-rank assignments are shown, with code released.

Significance. If the per-component analytical solutions compose effectively, the framework could advance efficient LLM compression by avoiding runtime overheads common in decomposed low-rank methods and offering a more architecture-aware alternative to generic layer-wise approximations. The open-sourcing supports reproducibility. Significance depends on validating that isolated functional-loss minimization yields near-optimal end-to-end behavior.

major comments (3)

[§3] §3 (Analytical Solutions for QK/OV/MLP): The functional losses are minimized independently per component. However, QK outputs modulate attention scores that are then scaled by OV, and MLP follows the attention residual; no analysis or joint objective addresses potential error amplification or compensation across blocks. This assumption is load-bearing for the central claim that the reported 4.69 perplexity is attributable to the analytical per-component procedure rather than unaccounted interactions.
[§4] §4 (Derivations of closed-form solutions): The manuscript claims analytical solutions, yet the provided text does not include explicit step-by-step derivations or the precise definition of each component's functional loss (e.g., whether it uses original-model activations). Without these, it is impossible to confirm the solutions are truly closed-form and parameter-free beyond the target rank choice, undermining verification of the performance claims.
[Experiments] Experiments section / Table reporting LLaMA 3.1-70B results: The 4.69 vs. 7.87 perplexity comparison is presented under a fixed reduction budget, but lacks ablations on cross-component interactions, sensitivity to per-component rank allocation, or controls for whether baselines received equivalent hyper-parameter tuning. This weakens attribution of the 3.18 gain specifically to the A³ analytical method.

minor comments (2)

[Abstract] Abstract: The phrase 'analytical solutions that reduces the hidden dimension size' contains a grammatical error ('reduces' should be 'reduce').
[Method] The manuscript should clarify in the method section how the overall compute/memory budget is exactly partitioned across QK, OV, and MLP to enable direct reproduction of the reported comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Analytical Solutions for QK/OV/MLP): The functional losses are minimized independently per component. However, QK outputs modulate attention scores that are then scaled by OV, and MLP follows the attention residual; no analysis or joint objective addresses potential error amplification or compensation across blocks. This assumption is load-bearing for the central claim that the reported 4.69 perplexity is attributable to the analytical per-component procedure rather than unaccounted interactions.

Authors: We acknowledge that the per-component functional losses are minimized independently, which is a deliberate design choice to enable analytical closed-form solutions tailored to each architectural component. While a joint objective could in principle capture interactions, our extensive empirical evaluations across multiple models and tasks demonstrate that the independent minimizations compose effectively, yielding the reported performance without notable error amplification. To directly address this point, we have added a new discussion subsection analyzing error propagation across components and included an ablation measuring cumulative effects, which supports that the gains are attributable to the A³ procedure. revision: yes
Referee: [§4] §4 (Derivations of closed-form solutions): The manuscript claims analytical solutions, yet the provided text does not include explicit step-by-step derivations or the precise definition of each component's functional loss (e.g., whether it uses original-model activations). Without these, it is impossible to confirm the solutions are truly closed-form and parameter-free beyond the target rank choice, undermining verification of the performance claims.

Authors: We agree that explicit derivations improve verifiability. The original submission included the derivations in the appendix with definitions of the functional losses (using original-model activations for each component), but we have now expanded §4 in the main text with full step-by-step derivations. This clarifies that the solutions are closed-form and depend only on the target rank, with no additional learned parameters. revision: yes
Referee: [Experiments] Experiments section / Table reporting LLaMA 3.1-70B results: The 4.69 vs. 7.87 perplexity comparison is presented under a fixed reduction budget, but lacks ablations on cross-component interactions, sensitivity to per-component rank allocation, or controls for whether baselines received equivalent hyper-parameter tuning. This weakens attribution of the 3.18 gain specifically to the A³ analytical method.

Authors: We appreciate this observation on strengthening attribution. We have incorporated additional ablation studies on cross-component interactions and per-component rank sensitivity in the revised experiments section. For baseline comparisons, we strictly followed the hyper-parameter configurations and reduction budgets reported in the original baseline papers to maintain fairness; no extra tuning was applied to A³ beyond the analytical rank selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; analytical derivations are self-contained and performance claims are empirically measured

full rationale

The paper defines functional losses for the QK, OV, and MLP components separately and derives closed-form low-rank factors that minimize those component-wise losses on the original model's activations. This is a standard post-training compression procedure and does not reduce the final perplexity result to the input by construction; the reported 4.69 WikiText-2 perplexity is an external measurement after applying the approximations. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are present in the provided text. The method does not rename known empirical patterns and the central claim rests on experimental comparison rather than definitional equivalence or fitted-parameter renaming.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the Transformer layer can be cleanly partitioned into three independent functional blocks whose losses can be minimized separately; no new physical entities are postulated and the only free parameters appear to be the target ranks chosen per component.

free parameters (1)

target rank per component
Chosen to meet a given compute/memory budget; the paper does not state whether these ranks are swept or fixed by a closed-form rule.

axioms (1)

domain assumption The functional loss of each component (QK, OV, MLP) can be defined and minimized independently without significant error from ignoring inter-component dependencies.
Invoked when the paper states that analytical solutions are derived for each component separately.

pith-pipeline@v0.9.0 · 5868 in / 1202 out tokens · 31751 ms · 2026-05-22T14:56:52.592006+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A3 splits a Transformer layer into three functional components, namely QK, OV, and MLP and provides analytical solutions that reduces the hidden dimension size inside each component while minimizing the component's functional loss.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The optimal solution to Problem 1 is fWqk,i = (R^{1/2}_{XqXq})^{-1} SVD_r (R^{1/2}_{XqXq} Wqk,i R^{1/2}_{XkvXkv}) (R^{1/2}_{XkvXkv})^{-1}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 9 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205,

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veliˇckovi´c. Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205,

work page arXiv
[3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[4]

Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118,

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118,

work page arXiv
[5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021a. Patrick Chen, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Drone: Data-aware low-rank compression ...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

URLhttps://zenodo.org/records/12608602. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page arXiv
[7]

Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

work page arXiv
[8]

Towards economical inference: Enabling deepseek’s multi-head latent attention in any transformer-based llms.arXiv preprint arXiv:2502.14837,

Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, and Tao Gui. Towards economical inference: Enabling deepseek’s multi-head latent attention in any transformer-based llms.arXiv preprint arXiv:2502.14837,

work page arXiv
[9]

DeepSeek-V3 Technical Report

10 Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[12]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[13]

arXiv preprint arXiv:2309.10818

Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, et al. Slimpajama-dc: Understanding data combinations for llm training.arXiv preprint arXiv:2309.10818,

work page arXiv
[14]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay B...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378,

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378,

work page arXiv
[17]

Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340,

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340,

work page arXiv
[18]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation- aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Lqer: Low-rank quantization error reconstruction for llms.arXiv preprint arXiv:2402.02446, 2024a

11 Cheng Zhang, Jianyi Cheng, George A Constantinides, and Yiren Zhao. Lqer: Low-rank quantization error reconstruction for llms.arXiv preprint arXiv:2402.02446, 2024a. Cheng Zhang, Jeffrey TH Wong, Can Xiao, George A Constantinides, and Yiren Zhao. Qera: an analytical framework for quantization error reconstruction.arXiv preprint arXiv:2410.06040, 2024b....

work page arXiv 2048
[20]

rank( fWvo,i) =r ⇒argmin fWvo,i ∥R 1 2 Xpi Xpi (Wvo,i −fWvo,i)∥2 F

Proof.We continue with Equation 35: argmin fWvo,i Epi∼Pi {∥pi(Wvo,i −fWvo,i)∥2 F }s.t. rank( fWvo,i) =r ⇒argmin fWvo,i ∥R 1 2 Xpi Xpi (Wvo,i −fWvo,i)∥2 F . (36) Note that multiplication by the invertible matrix RXpi Xpi does not change the rank of the matrix Wvo,i. According to the Eckart-Young-Mirsky theorem [Eckart and Young, 1936b], the optimal rank ra...

work page 2048
[21]

We calibrate the auto-correlation matrix using BF16 models, but accumulate the outer product in FP64

SlimPajama is a pretraining dataset of high-quality corpus, better capturing the statistics of auto-correlation than WikiText2. We calibrate the auto-correlation matrix using BF16 models, but accumulate the outer product in FP64. ApproximationSince the autocorrelation matrix is symmetric and positive semi-definite, we used SVD to calculate its inverse and...

work page 2025

[1] [1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205,

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veliˇckovi´c. Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205,

work page arXiv

[3] [3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[4] [4]

Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118,

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118,

work page arXiv

[5] [5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021a. Patrick Chen, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Drone: Data-aware low-rank compression ...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

URLhttps://zenodo.org/records/12608602. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page arXiv

[7] [7]

Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

work page arXiv

[8] [8]

Towards economical inference: Enabling deepseek’s multi-head latent attention in any transformer-based llms.arXiv preprint arXiv:2502.14837,

Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, and Tao Gui. Towards economical inference: Enabling deepseek’s multi-head latent attention in any transformer-based llms.arXiv preprint arXiv:2502.14837,

work page arXiv

[9] [9]

DeepSeek-V3 Technical Report

10 Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[12] [12]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002

[13] [13]

arXiv preprint arXiv:2309.10818

Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, et al. Slimpajama-dc: Understanding data combinations for llm training.arXiv preprint arXiv:2309.10818,

work page arXiv

[14] [14]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay B...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378,

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378,

work page arXiv

[17] [17]

Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340,

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340,

work page arXiv

[18] [18]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation- aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Lqer: Low-rank quantization error reconstruction for llms.arXiv preprint arXiv:2402.02446, 2024a

11 Cheng Zhang, Jianyi Cheng, George A Constantinides, and Yiren Zhao. Lqer: Low-rank quantization error reconstruction for llms.arXiv preprint arXiv:2402.02446, 2024a. Cheng Zhang, Jeffrey TH Wong, Can Xiao, George A Constantinides, and Yiren Zhao. Qera: an analytical framework for quantization error reconstruction.arXiv preprint arXiv:2410.06040, 2024b....

work page arXiv 2048

[20] [20]

rank( fWvo,i) =r ⇒argmin fWvo,i ∥R 1 2 Xpi Xpi (Wvo,i −fWvo,i)∥2 F

Proof.We continue with Equation 35: argmin fWvo,i Epi∼Pi {∥pi(Wvo,i −fWvo,i)∥2 F }s.t. rank( fWvo,i) =r ⇒argmin fWvo,i ∥R 1 2 Xpi Xpi (Wvo,i −fWvo,i)∥2 F . (36) Note that multiplication by the invertible matrix RXpi Xpi does not change the rank of the matrix Wvo,i. According to the Eckart-Young-Mirsky theorem [Eckart and Young, 1936b], the optimal rank ra...

work page 2048

[21] [21]

We calibrate the auto-correlation matrix using BF16 models, but accumulate the outer product in FP64

SlimPajama is a pretraining dataset of high-quality corpus, better capturing the statistics of auto-correlation than WikiText2. We calibrate the auto-correlation matrix using BF16 models, but accumulate the outer product in FP64. ApproximationSince the autocorrelation matrix is symmetric and positive semi-definite, we used SVD to calculate its inverse and...

work page 2025