pith. machine review for the scientific record. sign in

arxiv: 2605.10793 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords activation quantizationLLM quantizationorthogonal rotationsProcrustes problempost-training calibrationhypercube alignmentweight-activation quantization
0
0 comments X

The pith

Orthogonal rotations align normalized LLM activations to hypercube corners via closed-form Procrustes updates, enabling low-bit quantization without end-to-end training or activation storage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a post-training calibration technique that learns orthogonal rotations to redistribute activation energy more evenly across dimensions. It achieves this by aligning normalized activations with the corners of an inscribed hypercube, which the authors argue reduces the impact of outliers during low-bit quantization. The method solves for the rotations in closed form using the orthogonal Procrustes problem and updates them online as calibration samples arrive, so no large activation corpus needs to be stored. Experiments on Llama-2 and Llama-3 models from 3B to 70B parameters indicate that the resulting quantized models match or exceed baseline performance on perplexity and reasoning tasks. If correct, the approach would lower the memory and compute cost of deploying LLMs while keeping accuracy intact.

Core claim

The central claim is that an orthogonal rotation learned to align normalized activations with hypercube corners via the closed-form orthogonal Procrustes solution, combined with an online calibration procedure that updates the rotation as samples are processed, produces rotations that meaningfully lower activation quantization error. This avoids both gradient-based end-to-end training over the orthogonal group and the need to store full activation corpora, and the resulting quantized Llama models maintain competitive or improved perplexity and common-sense reasoning performance across model sizes from 3B to 70B.

What carries the argument

The corner-alignment objective on normalized activations, solved via the orthogonal Procrustes problem for a closed-form rotation update, together with an online procedure that refines the rotation as calibration samples are seen.

Load-bearing premise

That the Procrustes-derived rotation aligning normalized activations to hypercube corners will reduce quantization error across the diverse layers of LLMs and that the online updates will converge stably without access to a full stored activation set.

What would settle it

Apply the learned rotations to a 7B Llama model and observe whether perplexity on standard benchmarks rises above the no-rotation quantized baseline, or whether the online calibration produces unstable rotations after processing a few hundred samples.

Figures

Figures reproduced from arXiv: 2605.10793 by Ali Abbasi, Chayne Thrash, Soheil Kolouri.

Figure 1
Figure 1. Figure 1: Overview of the proposed rotation-based calibration method. (a) Our method learns an [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical CDF of normalized participation ratio (PR) of activations from Llama-2 7B with [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layerwise activation quantization error for different rotation methods on Llama-2 7B, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Calibration cost versus performance on Llama-2 7B under 4-4-16 quantization. Left: [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Large language models (LLMs) are costly to deploy due to their large memory footprint and high inference cost. Weight-activation quantization can reduce these costs, but low-bit activation quantization remains difficult because activation outliers induce large quantization error. Recent rotation-based methods address this by applying orthogonal transformations that redistribute activation magnitude across dimensions, but existing approaches either require expensive end-to-end rotation training or rely on stored activation corpora, introducing significant compute or storage overhead. We propose a lightweight post-training rotation calibration method for LLM activation quantization. Our method learns orthogonal rotations that align normalized activations with the corners of an inscribed hypercube, encouraging activation energy to be distributed more evenly across dimensions. This objective admits an efficient closed-form update via the orthogonal Procrustes problem, avoiding gradient-based optimization over the orthogonal group. We further introduce an online calibration procedure that updates rotations as calibration samples are processed, eliminating the need to store activations on disk and allowing rotations to adapt to quantized activation distributions during calibration. Experiments on Llama-2 and Llama-3 models from 3B to 70B parameters show that our method achieves competitive or improved performance across perplexity benchmarks and common sense reasoning tasks while avoiding both costly end-to-end training and large offline activation storage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ConQuR, a post-training method for quantizing activations in LLMs. It learns orthogonal rotations via a closed-form orthogonal Procrustes solution that aligns normalized activation vectors with the corners of an inscribed hypercube to distribute energy more evenly across dimensions. An online calibration procedure updates the rotations using streaming estimates of the cross-covariance matrix without requiring storage of full activation corpora. Experiments on Llama-2 and Llama-3 models (3B–70B parameters) claim competitive or improved results on perplexity benchmarks and common-sense reasoning tasks relative to prior rotation-based quantization approaches, while avoiding end-to-end training.

Significance. If the central empirical claim holds, the method supplies a lightweight, storage-free alternative to existing rotation-based activation quantization techniques. The closed-form Procrustes update and online streaming procedure are practical strengths that could reduce deployment overhead for low-bit LLMs. The approach is falsifiable via direct comparison of post-rotation quantization error and downstream metrics.

major comments (3)
  1. [§3.2] §3.2 (Orthogonal Procrustes formulation): The objective minimizes Euclidean distance of normalized activations to hypercube corners, yet the paper provides no derivation or ablation showing that this objective reduces the actual uniform quantization error (which is governed by per-dimension max-abs range and bin occupancy after scaling). The correspondence between the Procrustes solution and the downstream quantizer loss is assumed rather than demonstrated.
  2. [§4] §4 (Experiments): The central claim of “competitive or improved performance” is stated without quantitative tables, error bars, or direct comparison of achieved quantization MSE / perplexity against a baseline rotation optimized for the true quantizer objective (e.g., min-max or MSE). The online calibration’s stability is asserted but not supported by convergence diagnostics or sensitivity analysis on the running cross-covariance estimate.
  3. [§3.3] §3.3 (Online update): The streaming Procrustes update relies on partial estimates of the cross-covariance matrix; any bias or slow mixing in these estimates can produce rotations that are suboptimal for later layers or for the final quantized model. No analysis of estimation error or its effect on quantization error is supplied.
minor comments (2)
  1. [§3.1] Notation for the target corner matrix and the assignment of activations to corners should be made explicit in §3.1 to avoid ambiguity in the Procrustes problem statement.
  2. Figure 2 (or equivalent) comparing activation distributions before and after rotation would benefit from axis labels and a quantitative measure of energy redistribution (e.g., max-abs per dimension).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Orthogonal Procrustes formulation): The objective minimizes Euclidean distance of normalized activations to hypercube corners, yet the paper provides no derivation or ablation showing that this objective reduces the actual uniform quantization error (which is governed by per-dimension max-abs range and bin occupancy after scaling). The correspondence between the Procrustes solution and the downstream quantizer loss is assumed rather than demonstrated.

    Authors: We thank the referee for this observation. The hypercube-corner alignment is chosen because it minimizes the maximum absolute value across dimensions after rotation, which directly sets the per-channel scale in uniform quantization and thereby bounds the quantization error. We will add a short derivation in §3.2 that connects the Procrustes objective to the reduction of the ℓ∞ norm of the rotated activations. We will also include an ablation that compares our rotation to one obtained by directly minimizing quantization MSE (or max-abs range) on the same calibration data. revision: yes

  2. Referee: [§4] §4 (Experiments): The central claim of “competitive or improved performance” is stated without quantitative tables, error bars, or direct comparison of achieved quantization MSE / perplexity against a baseline rotation optimized for the true quantizer objective (e.g., min-max or MSE). The online calibration’s stability is asserted but not supported by convergence diagnostics or sensitivity analysis on the running cross-covariance estimate.

    Authors: We agree that the experimental presentation can be strengthened. In the revision we will expand the result tables to report explicit perplexity values together with standard deviations across multiple random calibration seeds. We will add a direct comparison against a rotation matrix optimized for the downstream quantization objective (both min-max and MSE variants). For the online procedure we will include convergence curves of the running cross-covariance estimate and a sensitivity study varying update frequency and mini-batch size. revision: yes

  3. Referee: [§3.3] §3.3 (Online update): The streaming Procrustes update relies on partial estimates of the cross-covariance matrix; any bias or slow mixing in these estimates can produce rotations that are suboptimal for later layers or for the final quantized model. No analysis of estimation error or its effect on quantization error is supplied.

    Authors: We acknowledge the importance of characterizing the streaming estimator. We will add both a theoretical bound on the bias of the running cross-covariance matrix (using standard results on online covariance estimation) and an empirical study that compares the final quantization error and perplexity obtained with the online rotations versus rotations computed from the full activation corpus. This analysis will be placed in §3.3 or an appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; closed-form Procrustes solution to explicitly stated objective is independent

full rationale

The paper proposes an alignment objective (normalized activations to hypercube corners) and derives the rotation via the standard orthogonal Procrustes closed-form solution. This step is a direct mathematical reduction from the chosen objective, not equivalent to the final quantization error metric or performance numbers by construction. No fitted parameters are relabeled as predictions, no self-citations form the load-bearing premise, and no ansatz or uniqueness theorem is imported from prior author work. Empirical validation on Llama models is presented separately and does not retroactively define the derivation. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard linear-algebra facts about orthogonal matrices and the Procrustes problem; no new free parameters or invented entities are introduced in the abstract description.

axioms (2)
  • standard math Orthogonal transformations preserve vector norms and can be used to redistribute activation magnitudes across dimensions.
    Invoked when applying the learned rotation to activations before quantization.
  • standard math The orthogonal Procrustes problem admits an efficient closed-form solution via SVD.
    Used to obtain the rotation matrix without gradient optimization.

pith-pipeline@v0.9.0 · 5524 in / 1410 out tokens · 75531 ms · 2026-05-12T03:57:51.318851+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 3 internal anchors

  1. [1]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  2. [2]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, et al. The llama 3 herd of models, 2024. URL https: //arxiv.org/abs/2407.21783

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

  4. [4]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

  5. [5]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  7. [7]

    Sparsegpt: massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. Sparsegpt: massive language models can be accurately pruned in one-shot. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  8. [8]

    A simple and effective pruning approach for large language models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=PxoFut3dWW

  9. [9]

    LLM-pruner: On the structural pruning of large language models

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. InThirty-seventh Conference on Neural Information Processing Systems,

  10. [10]

    URLhttps://openreview.net/forum?id=J8Ajf9WfXP

  11. [11]

    SVD-LLM: Truncation-aware singular value decomposition for large language model compression

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=LNYIUouhdt

  12. [12]

    ASVD: Activation-aware singular value decomposition for compressing large language models,

    Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware singular value decomposition for compressing large language models,

  13. [13]

    URLhttps://openreview.net/forum?id=HyPofygOCT

  14. [14]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ

  15. [15]

    OPTQ: Accurate quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=tcbBPnfwxS

  16. [16]

    Awq: Activation-aware weight quantization for llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. InMLSys, 2024

  17. [17]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volu...

  18. [18]

    Llm.int8(): 8-bit matrix multiplication for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

  19. [19]

    Zeroquant: efficient and affordable post-training quantization for large-scale transformers

    Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: efficient and affordable post-training quantization for large-scale transformers. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

  20. [20]

    Omniquant: Omnidirectionally calibrated quan- tization for large language models

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quan- tization for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=8Wuvhh0LYW

  21. [21]

    Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling

    Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processi...

  22. [22]

    Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated LLMs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=dfqsW38v1X

  23. [23]

    Spinquant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=ogO6DGE6FZ

  24. [24]

    Dartquant: Efficient rotational distribution calibration for LLM quantization

    Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, and Jian Cheng. Dartquant: Efficient rotational distribution calibration for LLM quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=LfcfwlLCHM

  25. [25]

    DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation

    Jingyang Xiang and Sai Qian Zhang. DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=WzGypILLDb

  26. [26]

    Mahoney, and Kurt Keutzer

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: dense-and-sparse quantization. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  27. [27]

    QuIP: 2-bit quantization of large language models with guarantees

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=xrk9g5vcXR

  28. [28]

    QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/ forum?id=9BrydUVcoe

  29. [29]

    QTIP: Quantization with trellises and incoherence processing

    Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. QTIP: Quantization with trellises and incoherence processing. InThe Thirty-eighth Annual Conference on Neural Informa- tion Processing Systems, 2024. URLhttps://openreview.net/forum?id=7sdkLVuYCU

  30. [30]

    Duquant: distributing outliers via dual transformation makes stronger quantized llms

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: distributing outliers via dual transformation makes stronger quantized llms. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 97...

  31. [31]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=Byj72udxe

  32. [32]

    The penn treebank: annotating predicate argument structure

    Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: annotating predicate argument structure. InProceedings of the Workshop on Human Language Technology, HLT ’94, page 114–119, USA, 1994. Association for Computational Linguistics. ISBN 1558603573. doi: 10.3115...

  33. [33]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020. ISSN 1532-4435

  34. [34]

    doi: 10.1145/3474381

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URLhttps://doi.org/10.1145/3474381

  35. [35]

    Social IQa: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent 13 Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language P...

  36. [36]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. openai blog (2019).URL: https://d4mucfpksywv. cloudfront. net/better-language-models/language-models. pdf, 2024

  37. [37]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=d7KBjmI3GmQ

  38. [38]

    A systematic classifica- tion of knowledge, reasoning, and context within the ARC dataset

    Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. A systematic classifica- tion of knowledge, reasoning, and context within the ARC dataset. In Eunsol Choi, Min- joon Seo, Da...

  39. [39]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, edi- tors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguis...

  40. [40]

    Can a Suit of Armor Conduct Electricity?

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...

  41. [41]

    The Thirty-Fourth

    Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language.Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439, Apr. 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239

  42. [42]

    R. J. Bell and P. Dean. Atomic vibrations in vitreous silica.Discuss. Faraday Soc., 50:55–61,

  43. [43]

    URLhttp://dx.doi.org/10.1039/DF9705000055

    doi: 10.1039/DF9705000055. URLhttp://dx.doi.org/10.1039/DF9705000055

  44. [44]

    Curran Associates Inc., Red Hook, NY , USA, 2019

    Biao Zhang and Rico Sennrich.Root mean square layer normalization. Curran Associates Inc., Red Hook, NY , USA, 2019

  45. [45]

    RoFormer: Enhanced Transformer with Rotary Position Embedding,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URL https://doi.org/10.1016/j. neucom.2023.127063

  46. [46]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 14 Appendix A Rotation invariance in LLM...

  47. [47]

    maps the affine range [minj xj,max j xj] onto the full unsigned integer grid {0, . . . ,2b −1} via a scale and a shift, Qzp(x) = ∆ zp(x) round x ∆zp(x) +ζ(x) −ζ(x) ,∆ zp(x) = maxj xj −min j xj 2b −1 , where ζ(x)∈Z is the integer zeropoint that aligns minj xj with the lower end of the grid. Each coordinate again incurs a rounding error bounded by∆ zp(x)/2,...

  48. [48]

    The factor-of-two slack in (6) is loose at axis poles—which the optimization moves away from—and asymptotically tight at the corners it converges to

    For transformer-scale hidden dimensions this gap is negligible: at d= 4096 , sign-mixed corners achieve a range of 2/ √ 4096≈0.031 , recovering more than 97% of the available range reduction on the unit sphere. The factor-of-two slack in (6) is loose at axis poles—which the optimization moves away from—and asymptotically tight at the corners it converges ...