arxiv: 2405.16406 · v4 · submitted 2024-05-26 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

Recognition: 3 theorem links

· Lean Theorem

SpinQuant: LLM quantization with learned rotations

Zechun Liu , Changsheng Zhao , Igor Fedorov , Bilge Soran , Dhruv Choudhary , Raghuraman Krishnamoorthi , Vikas Chandra , Yuandong Tian

show 1 more author

Tijmen Blankevoort

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords LLM quantizationpost-training quantizationrotation matricesoutlier removal4-bit quantizationKV cachezero-shot reasoningLLaMA models

0 comments

The pith

SpinQuant learns rotation matrices to quantize LLM weights, activations, and KV cache to 4 bits while keeping outputs identical in full precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that certain rotations of activation and weight matrices leave Transformer outputs unchanged in full precision yet remove outliers that cause large quantization errors. SpinQuant learns the best such rotations from calibration data instead of relying on random or fixed rotations, yielding substantially higher accuracy after 4-bit quantization. This approach matters because it makes memory-efficient inference practical for large models with only small losses in zero-shot reasoning performance.

Core claim

SpinQuant identifies rotation parameterizations that produce identical full-precision outputs but improve quantization accuracy, then learns optimal rotation matrices on calibration data; with 4-bit weights, activations, and KV cache this narrows the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B, outperforming LLM-QAT by 19.1 points, SmoothQuant by 25.0 points, and random-rotation methods like QuaRot by up to 45.1 percent relative gap reduction on LLaMA-3 8B.

What carries the argument

Learned rotation matrices that preserve exact full-precision Transformer outputs while minimizing quantization error through outlier removal.

If this is right

4-bit KV-cache quantization becomes viable with only modest accuracy degradation on LLaMA models.
Zero-shot reasoning on LLaMA-2 7B loses just 2.9 points relative to full precision.
The method outperforms concurrent random-rotation baselines, closing up to 45.1 percent more of the accuracy gap on LLaMA-3 8B.
Some random rotations already improve quantization by up to 13 points over others, but learned rotations exceed all random choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The existence of superior learned rotations suggests that outlier directions in activation statistics are structured and can be systematically aligned away from quantization axes.
Precomputed rotation matrices might be transferable across similar model families, reducing the need for per-model calibration at deployment time.
The same rotation-learning idea could be tested on other bit-widths or combined with existing calibration-free quantization techniques for further error reduction.

Load-bearing premise

Rotation matrices optimized on calibration data will generalize to preserve accuracy on diverse downstream tasks without introducing new errors.

What would settle it

If the learned rotations cause greater accuracy loss than random rotations on a held-out zero-shot task or new model, the generalization benefit would be falsified.

read the original abstract

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpinQuant learns output-preserving rotations to improve 4-bit LLM quantization over random baselines, with solid reported gains but open questions on generalization from calibration data.

read the letter

SpinQuant learns rotation matrices to make 4-bit quantization of weights, activations, and KV cache work better on LLMs. The central idea is to find parameterizations of rotations that leave the full-precision Transformer output exactly the same, then optimize the rotation entries to reduce quantization error on a calibration set. This extends the random-rotation approach in concurrent QuaRot work by making the choice data-driven rather than fixed at random. Some random rotations already vary by up to 13 points on downstream tasks, so learning them is a reasonable next step. On LLaMA-2 7B the method closes the zero-shot gap to full precision to 2.9 points and beats SmoothQuant and LLM-QAT by large margins; similar relative gains appear on LLaMA-3 8B. Code is released, which is useful for checking the implementation. The algebraic claim that certain rotations preserve exact outputs is straightforward once the parameterization is written down. The main soft spot is whether the learned rotations stay effective when activation statistics shift away from the calibration corpus. The optimization is non-convex and the rotation is applied uniformly, so a mismatch in covariance could reintroduce outliers that 4-bit quantization cannot handle. The abstract and reported numbers do not include ablations that freeze the rotation and swap the calibration data or evaluate on clearly out-of-distribution prompts, so it is not yet clear how much of the gain is robust versus tied to the specific training distribution. This work is aimed at people building efficient inference pipelines for large models. The results are concrete, the idea is cleanly motivated, and the code is public, so it is worth sending to a serious referee even if the generalization checks need strengthening.

Referee Report

1 major / 2 minor

Summary. The paper introduces SpinQuant, a post-training quantization technique for LLMs that learns rotation matrices to mitigate outliers in weights, activations, and KV-cache while preserving exact full-precision outputs in the unquantized Transformer. It reports that 4-bit quantization of all three components narrows the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B, outperforming LLM-QAT by 19.1 points and SmoothQuant by 25.0 points, and also improves upon concurrent random-rotation work QuaRot (up to 45.1% relative gap reduction on LLaMA-3 8B).

Significance. If the learned rotations prove robust, the method offers a practical route to aggressive 4-bit quantization of both weights and activations without task-specific fine-tuning, which could materially lower memory and latency for LLM inference. The algebraic invariance of the rotations and the empirical gains over strong baselines are the primary strengths.

major comments (1)

[§4.2] §4.2 and Table 1: the headline claim that SpinQuant narrows the gap to 2.9 points rests on rotations learned from a fixed calibration corpus; the manuscript provides no ablation that freezes the learned R and substitutes a different calibration set (e.g., C4 vs. WikiText) or evaluates on out-of-distribution prompts. Without this, it remains possible that the reported 19-point and 25-point margins are partly artifacts of calibration-test alignment rather than a general property of the rotation parameterization.

minor comments (2)

[§3.1] §3.1: the enumeration of rotation parameterizations that preserve full-precision outputs would be clearer if accompanied by a short proof sketch or reference to the relevant algebraic property (orthogonal matrices with determinant 1).
[§4.1] §4.1: the optimization procedure for learning the rotation matrices (learning rate schedule, number of steps, batch size) is described only at high level; explicit hyperparameters would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the constructive comment regarding calibration robustness. We address the point below and will incorporate the requested ablation in the revised manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 and Table 1: the headline claim that SpinQuant narrows the gap to 2.9 points rests on rotations learned from a fixed calibration corpus; the manuscript provides no ablation that freezes the learned R and substitutes a different calibration set (e.g., C4 vs. WikiText) or evaluates on out-of-distribution prompts. Without this, it remains possible that the reported 19-point and 25-point margins are partly artifacts of calibration-test alignment rather than a general property of the rotation parameterization.

Authors: We agree that an explicit cross-calibration ablation would strengthen the claim that the learned rotations capture general properties of the model rather than calibration-specific statistics. The current experiments follow the standard PTQ protocol (128 samples drawn from the C4 corpus) used by SmoothQuant, LLM-QAT, and QuaRot to ensure fair comparison. In the revised manuscript we will add a dedicated ablation that (i) learns R on C4 and freezes it for evaluation on WikiText, (ii) learns R on WikiText and evaluates on C4, and (iii) tests the frozen rotations on out-of-distribution prompts drawn from a held-out domain. These results will be reported alongside the original Table 1 numbers so readers can directly assess sensitivity to calibration choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in SpinQuant derivation

full rationale

The paper's core chain begins with the algebraic observation that orthogonal rotation matrices preserve exact full-precision Transformer outputs (via R^T R = I), which is a standard linear-algebra fact independent of any learned parameters or data. Rotations are then optimized on a fixed calibration corpus to reduce quantization error; the resulting matrices are applied to the model and evaluated empirically on separate zero-shot benchmarks. This evaluation measures actual downstream accuracy rather than deriving it by construction from the calibration objective. No self-citation chain, definitional loop, or renaming of a fitted quantity as a prediction appears in the load-bearing steps. The method therefore remains self-contained against external baselines (SmoothQuant, LLM-QAT, QuaRot) and does not reduce to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the algebraic fact that certain rotations leave Transformer outputs unchanged and on the empirical assumption that data-driven optimization of those rotations yields better quantization than random choices.

free parameters (1)

learned rotation matrix entries
Parameters optimized on calibration data to minimize quantization error while preserving full-precision behavior.

axioms (1)

domain assumption There exist rotation parameterizations that produce identical full-precision Transformer outputs.
Stated directly in the abstract as the basis for applying rotations without changing model behavior.

pith-pipeline@v0.9.0 · 5597 in / 1183 out tokens · 47914 ms · 2026-05-15T15:47:00.210125+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
cs.AI 2026-04 unverdicted novelty 7.0

Unstructured pruning augments test-time scaling reasoning performance in LLMs and can outperform the unpruned model on benchmarks, contrary to expectations from structured pruning studies.
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels
cs.LG 2026-04 conditional novelty 7.0

COVERCAL selects PTQ calibration samples via weighted set cover over outlier channels, with a stylized clipping model showing missed coverage upper-bounds surrogate loss, yielding gains over random and other baselines...
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
cs.LG 2026-04 unverdicted novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
Statistically-Lossless Quantization of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SLQ achieves task-lossless LLM quantization below 4 bits per parameter and distribution-lossless at 5-6 bits on average, with 1.7-3.6x speedups over FP16.
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
cs.LG 2026-04 unverdicted novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
QuantClaw: Precision Where It Matters for OpenClaw
cs.AI 2026-04 unverdicted novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
cs.LG 2026-04 unverdicted novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
cs.CL 2026-04 unverdicted novelty 6.0

LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
Rethinking Residual Errors in Compensation-based LLM Quantization
cs.LG 2026-04 conditional novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
cs.LG 2026-04 unverdicted novelty 6.0

AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
cs.AR 2026-05 unverdicted novelty 5.0

A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
ConFu: Contemplate the Future for Better Speculative Sampling
cs.CL 2026-03 unverdicted novelty 5.0

ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
cs.LG 2026-04 unverdicted novelty 4.0

Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 19 Pith papers · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Extreme compression of large language models via additive quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118,

work page arXiv
[5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Slim-llm: Salience-driven mixed-precision quantization for large language models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, and Xiaojuan Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models. arXiv preprint arXiv:2405.14917,

work page arXiv
[7]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Squeezellm: Dense-and-sparse quantization

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629,

work page arXiv
[9]

Quantizing deep convolutional networks for efficient inference: A whitepaper

11 Published as a conference paper at ICLR 2025 Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Efficient riemannian optimization on the stiefel manifold via the cayley transform

Jun Li, Li Fuxin, and Sinisa Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform. arXiv preprint arXiv:2002.01113,

work page arXiv 2002
[11]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qllm: Accurate and efficient low-bitwidth quantization for large language models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. Qllm: Accurate and efficient low-bitwidth quantization for large language models. arXiv preprint arXiv:2310.08041, 2023a. Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. Llm-fp4: 4-bit floating- point quantized transformers. arXiv preprint arXiv:2...

work page arXiv
[13]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[17]

Omniquant: Omnidirectionally calibrated quantization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137,

work page arXiv
[18]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Large language models in medicine

12 Published as a conference paper at ICLR 2025 Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930–1940,

work page 2025
[20]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Outlier suppression: Pushing the limit of low-bit transformer language models

Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209.13325,

work page arXiv
[22]

Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145,

work page arXiv
[23]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[24]

Atom: Low-bit quantization for efficient and accurate llm serving

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102,

work page arXiv
[25]

13 Published as a conference paper at ICLR 2025 A A PPENDIX / SUPPLEMENTAL MATERIAL A.1 C OMPLETE RESULTS OF MAIN RESULT TABLE In Tables 7, 8 and 9, we show the complete results of Table

work page 2025
[26]

We compare the accuracy on eight zero-shot commonsense reasoning tasks including ARC-easy, ARC-challenge (Clark et al., 2018), BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), and WinoGrande (Sakaguchi et al.,

work page 2018
[27]

as well as the perplexity score on WikiText2 testset (Merity et al., 2016). We compare our results with previous works including SmoothQuant(Xiao et al., 2022), LLM-QAT(Liu et al., 2023c), GPTQ (Frantar et al., 2022), OmniQuant (Shao et al., 2023), QuIP# (Tseng et al., 2024). A.2 R ESULTS ON 3-BIT WEIGHT QUANTIZATION We present the 3-bit weight and 8-bit ...

work page 2016
[28]

A.3 Cayley OPTIMIZATION CHOICE In Table 11, we evaluate the impact of varying the number of samples and iterations used in Cay- ley optimization

Our method, SpinQuant, successfully reduces the gap to the full-precision network from the previous 9.0 −28.0 points to 1.2 −5.3 points, demonstrating its effectiveness for low-bit quantization. A.3 Cayley OPTIMIZATION CHOICE In Table 11, we evaluate the impact of varying the number of samples and iterations used in Cay- ley optimization. Given the limite...

work page 2023
[29]

The results in Table 13 reflect that using C4 datasets yields consistent results with utilizing the Wiki dataset, showing that SpinQuant is robust to calibration data choice

as calibration data and performe experiments on the LLaMA-2 7B model. The results in Table 13 reflect that using C4 datasets yields consistent results with utilizing the Wiki dataset, showing that SpinQuant is robust to calibration data choice. A.6 L ATENCY MEASUREMENT ON GPU In light of the available Tensor cores in NVIDIA’s Hopper (H100) architecture, w...

work page 2025
[30]

When implemented meticulously, SpinQuant with Hadamard rotation sees marginal difference in the latency compared to without Hadamard rotation. 16 Published as a conference paper at ICLR 2025 Table 9: Complete comparison of the perplexity score on WikiText2 and averaged accuracy on Zero-shot Common Sense Reasoning tasks on Mistral-7B-v0.3. #Bits Method ARC...

work page 2025
[31]

SpinQuant W4A8 quantized models demonstrate significant improvements in 5-shot accuracy on the MMLU benchmark and 1-shot rouge score on the TLDR9 summarization benchmark

We present the results for few-shot learning scenarios. SpinQuant W4A8 quantized models demonstrate significant improvements in 5-shot accuracy on the MMLU benchmark and 1-shot rouge score on the TLDR9 summarization benchmark. It significantly closed the gap to the BF16 baseline. B A NALYSIS B.1 G RADIENT ANALYSIS On the one hand, we have shown that the c...

work page 2025
[32]

19 Published as a conference paper at ICLR 2025 Table 16: Ablation study on SpinQuant combined with RTN or GPTQ in the W4A4KV16 quantization scenario

The process of optimizing R rotates the residual stream basis such as to prioritize improving the SNR of such layers, possibly at the cost of hurting less important layers. 19 Published as a conference paper at ICLR 2025 Table 16: Ablation study on SpinQuant combined with RTN or GPTQ in the W4A4KV16 quantization scenario. LLaMA-3 8B LLaMA-2 7B LLaMA-2 13B...

work page 2025
[33]

Additionally, we make an interesting observation: in several activation layers, the first token displays substantial values in multiple channels

Overall, after rotation, the extreme values are attenuated, and the distribution exhibits no noteworthy outliers across the token dimension. Additionally, we make an interesting observation: in several activation layers, the first token displays substantial values in multiple channels. After rotation, this outlier is distributed across all channels of the...

work page 2025