Recognition: 3 theorem links
· Lean TheoremSpinQuant: LLM quantization with learned rotations
Pith reviewed 2026-05-15 15:47 UTC · model grok-4.3
The pith
SpinQuant learns rotation matrices to quantize LLM weights, activations, and KV cache to 4 bits while keeping outputs identical in full precision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpinQuant identifies rotation parameterizations that produce identical full-precision outputs but improve quantization accuracy, then learns optimal rotation matrices on calibration data; with 4-bit weights, activations, and KV cache this narrows the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B, outperforming LLM-QAT by 19.1 points, SmoothQuant by 25.0 points, and random-rotation methods like QuaRot by up to 45.1 percent relative gap reduction on LLaMA-3 8B.
What carries the argument
Learned rotation matrices that preserve exact full-precision Transformer outputs while minimizing quantization error through outlier removal.
If this is right
- 4-bit KV-cache quantization becomes viable with only modest accuracy degradation on LLaMA models.
- Zero-shot reasoning on LLaMA-2 7B loses just 2.9 points relative to full precision.
- The method outperforms concurrent random-rotation baselines, closing up to 45.1 percent more of the accuracy gap on LLaMA-3 8B.
- Some random rotations already improve quantization by up to 13 points over others, but learned rotations exceed all random choices.
Where Pith is reading between the lines
- The existence of superior learned rotations suggests that outlier directions in activation statistics are structured and can be systematically aligned away from quantization axes.
- Precomputed rotation matrices might be transferable across similar model families, reducing the need for per-model calibration at deployment time.
- The same rotation-learning idea could be tested on other bit-widths or combined with existing calibration-free quantization techniques for further error reduction.
Load-bearing premise
Rotation matrices optimized on calibration data will generalize to preserve accuracy on diverse downstream tasks without introducing new errors.
What would settle it
If the learned rotations cause greater accuracy loss than random rotations on a held-out zero-shot task or new model, the generalization benefit would be falsified.
read the original abstract
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpinQuant, a post-training quantization technique for LLMs that learns rotation matrices to mitigate outliers in weights, activations, and KV-cache while preserving exact full-precision outputs in the unquantized Transformer. It reports that 4-bit quantization of all three components narrows the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B, outperforming LLM-QAT by 19.1 points and SmoothQuant by 25.0 points, and also improves upon concurrent random-rotation work QuaRot (up to 45.1% relative gap reduction on LLaMA-3 8B).
Significance. If the learned rotations prove robust, the method offers a practical route to aggressive 4-bit quantization of both weights and activations without task-specific fine-tuning, which could materially lower memory and latency for LLM inference. The algebraic invariance of the rotations and the empirical gains over strong baselines are the primary strengths.
major comments (1)
- [§4.2] §4.2 and Table 1: the headline claim that SpinQuant narrows the gap to 2.9 points rests on rotations learned from a fixed calibration corpus; the manuscript provides no ablation that freezes the learned R and substitutes a different calibration set (e.g., C4 vs. WikiText) or evaluates on out-of-distribution prompts. Without this, it remains possible that the reported 19-point and 25-point margins are partly artifacts of calibration-test alignment rather than a general property of the rotation parameterization.
minor comments (2)
- [§3.1] §3.1: the enumeration of rotation parameterizations that preserve full-precision outputs would be clearer if accompanied by a short proof sketch or reference to the relevant algebraic property (orthogonal matrices with determinant 1).
- [§4.1] §4.1: the optimization procedure for learning the rotation matrices (learning rate schedule, number of steps, batch size) is described only at high level; explicit hyperparameters would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the constructive comment regarding calibration robustness. We address the point below and will incorporate the requested ablation in the revised manuscript.
read point-by-point responses
-
Referee: [§4.2] §4.2 and Table 1: the headline claim that SpinQuant narrows the gap to 2.9 points rests on rotations learned from a fixed calibration corpus; the manuscript provides no ablation that freezes the learned R and substitutes a different calibration set (e.g., C4 vs. WikiText) or evaluates on out-of-distribution prompts. Without this, it remains possible that the reported 19-point and 25-point margins are partly artifacts of calibration-test alignment rather than a general property of the rotation parameterization.
Authors: We agree that an explicit cross-calibration ablation would strengthen the claim that the learned rotations capture general properties of the model rather than calibration-specific statistics. The current experiments follow the standard PTQ protocol (128 samples drawn from the C4 corpus) used by SmoothQuant, LLM-QAT, and QuaRot to ensure fair comparison. In the revised manuscript we will add a dedicated ablation that (i) learns R on C4 and freezes it for evaluation on WikiText, (ii) learns R on WikiText and evaluates on C4, and (iii) tests the frozen rotations on out-of-distribution prompts drawn from a held-out domain. These results will be reported alongside the original Table 1 numbers so readers can directly assess sensitivity to calibration choice. revision: yes
Circularity Check
No significant circularity detected in SpinQuant derivation
full rationale
The paper's core chain begins with the algebraic observation that orthogonal rotation matrices preserve exact full-precision Transformer outputs (via R^T R = I), which is a standard linear-algebra fact independent of any learned parameters or data. Rotations are then optimized on a fixed calibration corpus to reduce quantization error; the resulting matrices are applied to the model and evaluated empirically on separate zero-shot benchmarks. This evaluation measures actual downstream accuracy rather than deriving it by construction from the calibration objective. No self-citation chain, definitional loop, or renaming of a fitted quantity as a prediction appears in the load-bearing steps. The method therefore remains self-contained against external baselines (SmoothQuant, LLM-QAT, QuaRot) and does not reduce to its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned rotation matrix entries
axioms (1)
- domain assumption There exist rotation parameterizations that produce identical full-precision Transformer outputs.
Forward citations
Cited by 20 Pith papers
-
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
-
Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
Unstructured pruning augments test-time scaling reasoning performance in LLMs and can outperform the unpruned model on benchmarks, contrary to expectations from structured pruning studies.
-
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels
COVERCAL selects PTQ calibration samples via weighted set cover over outlier channels, with a stylized clipping model showing missed coverage upper-bounds surrogate loss, yielding gains over random and other baselines...
-
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
-
Statistically-Lossless Quantization of Large Language Models
SLQ achieves task-lossless LLM quantization below 4 bits per parameter and distribution-lossless at 5-6 bits on average, with 1.7-3.6x speedups over FP16.
-
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
-
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...
-
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
-
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
-
Rethinking Residual Errors in Compensation-based LLM Quantization
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
-
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.
-
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
-
ConFu: Contemplate the Future for Better Speculative Sampling
ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.
-
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[3]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Extreme compression of large language models via additive quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118,
-
[5]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Slim-llm: Salience-driven mixed-precision quantization for large language models
Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, and Xiaojuan Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models. arXiv preprint arXiv:2405.14917,
-
[7]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Squeezellm: Dense-and-sparse quantization
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629,
-
[9]
Quantizing deep convolutional networks for efficient inference: A whitepaper
11 Published as a conference paper at ICLR 2025 Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Efficient riemannian optimization on the stiefel manifold via the cayley transform
Jun Li, Li Fuxin, and Sinisa Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform. arXiv preprint arXiv:2002.01113,
-
[11]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Qllm: Accurate and efficient low-bitwidth quantization for large language models
Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. Qllm: Accurate and efficient low-bitwidth quantization for large language models. arXiv preprint arXiv:2310.08041, 2023a. Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. Llm-fp4: 4-bit floating- point quantized transformers. arXiv preprint arXiv:2...
-
[13]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
SocialIQA: Commonsense Reasoning about Social Interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[17]
Omniquant: Omnidirectionally calibrated quantization for large language models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137,
-
[18]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Large language models in medicine
12 Published as a conference paper at ICLR 2025 Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930–1940,
work page 2025
-
[20]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Outlier suppression: Pushing the limit of low-bit transformer language models
Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209.13325,
-
[22]
Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145,
-
[23]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[24]
Atom: Low-bit quantization for efficient and accurate llm serving
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102,
-
[25]
13 Published as a conference paper at ICLR 2025 A A PPENDIX / SUPPLEMENTAL MATERIAL A.1 C OMPLETE RESULTS OF MAIN RESULT TABLE In Tables 7, 8 and 9, we show the complete results of Table
work page 2025
-
[26]
We compare the accuracy on eight zero-shot commonsense reasoning tasks including ARC-easy, ARC-challenge (Clark et al., 2018), BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), and WinoGrande (Sakaguchi et al.,
work page 2018
-
[27]
as well as the perplexity score on WikiText2 testset (Merity et al., 2016). We compare our results with previous works including SmoothQuant(Xiao et al., 2022), LLM-QAT(Liu et al., 2023c), GPTQ (Frantar et al., 2022), OmniQuant (Shao et al., 2023), QuIP# (Tseng et al., 2024). A.2 R ESULTS ON 3-BIT WEIGHT QUANTIZATION We present the 3-bit weight and 8-bit ...
work page 2016
-
[28]
Our method, SpinQuant, successfully reduces the gap to the full-precision network from the previous 9.0 −28.0 points to 1.2 −5.3 points, demonstrating its effectiveness for low-bit quantization. A.3 Cayley OPTIMIZATION CHOICE In Table 11, we evaluate the impact of varying the number of samples and iterations used in Cay- ley optimization. Given the limite...
work page 2023
-
[29]
as calibration data and performe experiments on the LLaMA-2 7B model. The results in Table 13 reflect that using C4 datasets yields consistent results with utilizing the Wiki dataset, showing that SpinQuant is robust to calibration data choice. A.6 L ATENCY MEASUREMENT ON GPU In light of the available Tensor cores in NVIDIA’s Hopper (H100) architecture, w...
work page 2025
-
[30]
When implemented meticulously, SpinQuant with Hadamard rotation sees marginal difference in the latency compared to without Hadamard rotation. 16 Published as a conference paper at ICLR 2025 Table 9: Complete comparison of the perplexity score on WikiText2 and averaged accuracy on Zero-shot Common Sense Reasoning tasks on Mistral-7B-v0.3. #Bits Method ARC...
work page 2025
-
[31]
We present the results for few-shot learning scenarios. SpinQuant W4A8 quantized models demonstrate significant improvements in 5-shot accuracy on the MMLU benchmark and 1-shot rouge score on the TLDR9 summarization benchmark. It significantly closed the gap to the BF16 baseline. B A NALYSIS B.1 G RADIENT ANALYSIS On the one hand, we have shown that the c...
work page 2025
-
[32]
The process of optimizing R rotates the residual stream basis such as to prioritize improving the SNR of such layers, possibly at the cost of hurting less important layers. 19 Published as a conference paper at ICLR 2025 Table 16: Ablation study on SpinQuant combined with RTN or GPTQ in the W4A4KV16 quantization scenario. LLaMA-3 8B LLaMA-2 7B LLaMA-2 13B...
work page 2025
-
[33]
Overall, after rotation, the extreme values are attenuated, and the distribution exhibits no noteworthy outliers across the token dimension. Additionally, we make an interesting observation: in several activation layers, the first token displays substantial values in multiple channels. After rotation, this outlier is distributed across all channels of the...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.