Recognition: 3 theorem links
· Lean TheoremConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs
Pith reviewed 2026-05-12 03:57 UTC · model grok-4.3
The pith
Orthogonal rotations align normalized LLM activations to hypercube corners via closed-form Procrustes updates, enabling low-bit quantization without end-to-end training or activation storage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an orthogonal rotation learned to align normalized activations with hypercube corners via the closed-form orthogonal Procrustes solution, combined with an online calibration procedure that updates the rotation as samples are processed, produces rotations that meaningfully lower activation quantization error. This avoids both gradient-based end-to-end training over the orthogonal group and the need to store full activation corpora, and the resulting quantized Llama models maintain competitive or improved perplexity and common-sense reasoning performance across model sizes from 3B to 70B.
What carries the argument
The corner-alignment objective on normalized activations, solved via the orthogonal Procrustes problem for a closed-form rotation update, together with an online procedure that refines the rotation as calibration samples are seen.
Load-bearing premise
That the Procrustes-derived rotation aligning normalized activations to hypercube corners will reduce quantization error across the diverse layers of LLMs and that the online updates will converge stably without access to a full stored activation set.
What would settle it
Apply the learned rotations to a 7B Llama model and observe whether perplexity on standard benchmarks rises above the no-rotation quantized baseline, or whether the online calibration produces unstable rotations after processing a few hundred samples.
Figures
read the original abstract
Large language models (LLMs) are costly to deploy due to their large memory footprint and high inference cost. Weight-activation quantization can reduce these costs, but low-bit activation quantization remains difficult because activation outliers induce large quantization error. Recent rotation-based methods address this by applying orthogonal transformations that redistribute activation magnitude across dimensions, but existing approaches either require expensive end-to-end rotation training or rely on stored activation corpora, introducing significant compute or storage overhead. We propose a lightweight post-training rotation calibration method for LLM activation quantization. Our method learns orthogonal rotations that align normalized activations with the corners of an inscribed hypercube, encouraging activation energy to be distributed more evenly across dimensions. This objective admits an efficient closed-form update via the orthogonal Procrustes problem, avoiding gradient-based optimization over the orthogonal group. We further introduce an online calibration procedure that updates rotations as calibration samples are processed, eliminating the need to store activations on disk and allowing rotations to adapt to quantized activation distributions during calibration. Experiments on Llama-2 and Llama-3 models from 3B to 70B parameters show that our method achieves competitive or improved performance across perplexity benchmarks and common sense reasoning tasks while avoiding both costly end-to-end training and large offline activation storage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ConQuR, a post-training method for quantizing activations in LLMs. It learns orthogonal rotations via a closed-form orthogonal Procrustes solution that aligns normalized activation vectors with the corners of an inscribed hypercube to distribute energy more evenly across dimensions. An online calibration procedure updates the rotations using streaming estimates of the cross-covariance matrix without requiring storage of full activation corpora. Experiments on Llama-2 and Llama-3 models (3B–70B parameters) claim competitive or improved results on perplexity benchmarks and common-sense reasoning tasks relative to prior rotation-based quantization approaches, while avoiding end-to-end training.
Significance. If the central empirical claim holds, the method supplies a lightweight, storage-free alternative to existing rotation-based activation quantization techniques. The closed-form Procrustes update and online streaming procedure are practical strengths that could reduce deployment overhead for low-bit LLMs. The approach is falsifiable via direct comparison of post-rotation quantization error and downstream metrics.
major comments (3)
- [§3.2] §3.2 (Orthogonal Procrustes formulation): The objective minimizes Euclidean distance of normalized activations to hypercube corners, yet the paper provides no derivation or ablation showing that this objective reduces the actual uniform quantization error (which is governed by per-dimension max-abs range and bin occupancy after scaling). The correspondence between the Procrustes solution and the downstream quantizer loss is assumed rather than demonstrated.
- [§4] §4 (Experiments): The central claim of “competitive or improved performance” is stated without quantitative tables, error bars, or direct comparison of achieved quantization MSE / perplexity against a baseline rotation optimized for the true quantizer objective (e.g., min-max or MSE). The online calibration’s stability is asserted but not supported by convergence diagnostics or sensitivity analysis on the running cross-covariance estimate.
- [§3.3] §3.3 (Online update): The streaming Procrustes update relies on partial estimates of the cross-covariance matrix; any bias or slow mixing in these estimates can produce rotations that are suboptimal for later layers or for the final quantized model. No analysis of estimation error or its effect on quantization error is supplied.
minor comments (2)
- [§3.1] Notation for the target corner matrix and the assignment of activations to corners should be made explicit in §3.1 to avoid ambiguity in the Procrustes problem statement.
- Figure 2 (or equivalent) comparing activation distributions before and after rotation would benefit from axis labels and a quantitative measure of energy redistribution (e.g., max-abs per dimension).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Orthogonal Procrustes formulation): The objective minimizes Euclidean distance of normalized activations to hypercube corners, yet the paper provides no derivation or ablation showing that this objective reduces the actual uniform quantization error (which is governed by per-dimension max-abs range and bin occupancy after scaling). The correspondence between the Procrustes solution and the downstream quantizer loss is assumed rather than demonstrated.
Authors: We thank the referee for this observation. The hypercube-corner alignment is chosen because it minimizes the maximum absolute value across dimensions after rotation, which directly sets the per-channel scale in uniform quantization and thereby bounds the quantization error. We will add a short derivation in §3.2 that connects the Procrustes objective to the reduction of the ℓ∞ norm of the rotated activations. We will also include an ablation that compares our rotation to one obtained by directly minimizing quantization MSE (or max-abs range) on the same calibration data. revision: yes
-
Referee: [§4] §4 (Experiments): The central claim of “competitive or improved performance” is stated without quantitative tables, error bars, or direct comparison of achieved quantization MSE / perplexity against a baseline rotation optimized for the true quantizer objective (e.g., min-max or MSE). The online calibration’s stability is asserted but not supported by convergence diagnostics or sensitivity analysis on the running cross-covariance estimate.
Authors: We agree that the experimental presentation can be strengthened. In the revision we will expand the result tables to report explicit perplexity values together with standard deviations across multiple random calibration seeds. We will add a direct comparison against a rotation matrix optimized for the downstream quantization objective (both min-max and MSE variants). For the online procedure we will include convergence curves of the running cross-covariance estimate and a sensitivity study varying update frequency and mini-batch size. revision: yes
-
Referee: [§3.3] §3.3 (Online update): The streaming Procrustes update relies on partial estimates of the cross-covariance matrix; any bias or slow mixing in these estimates can produce rotations that are suboptimal for later layers or for the final quantized model. No analysis of estimation error or its effect on quantization error is supplied.
Authors: We acknowledge the importance of characterizing the streaming estimator. We will add both a theoretical bound on the bias of the running cross-covariance matrix (using standard results on online covariance estimation) and an empirical study that compares the final quantization error and perplexity obtained with the online rotations versus rotations computed from the full activation corpus. This analysis will be placed in §3.3 or an appendix. revision: yes
Circularity Check
No significant circularity; closed-form Procrustes solution to explicitly stated objective is independent
full rationale
The paper proposes an alignment objective (normalized activations to hypercube corners) and derives the rotation via the standard orthogonal Procrustes closed-form solution. This step is a direct mathematical reduction from the chosen objective, not equivalent to the final quantization error metric or performance numbers by construction. No fitted parameters are relabeled as predictions, no self-citations form the load-bearing premise, and no ansatz or uniqueness theorem is imported from prior author work. Empirical validation on Llama models is presented separately and does not retroactively define the derivation. The chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Orthogonal transformations preserve vector norms and can be used to redistribute activation magnitudes across dimensions.
- standard math The orthogonal Procrustes problem admits an efficient closed-form solution via SVD.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
min_R∈O(d) Σ_i ||R xi/||xi|| − zi||₂² where zi,j = 1/√d sign((R xi)j) (eq. 1); equivalent to maximizing average ||R x̃||1 on unit sphere via Cauchy–Schwarz and Hölder duality
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
closed-form OPU via SVD of cross-covariance C = Zᵀ X̃; online mini-batch calibration without stored corpus
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking (D=3) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
normalized PR analysis and ||R x̃||∞ minimization for quantization MSE bound
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Aaron Grattafiori, Abhimanyu Dubey, et al. The llama 3 herd of models, 2024. URL https: //arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...
work page 1901
-
[4]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...
work page 2022
-
[5]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...
work page 2023
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Sparsegpt: massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. Sparsegpt: massive language models can be accurately pruned in one-shot. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023
work page 2023
-
[8]
A simple and effective pruning approach for large language models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=PxoFut3dWW
work page 2024
-
[9]
LLM-pruner: On the structural pruning of large language models
Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. InThirty-seventh Conference on Neural Information Processing Systems,
-
[10]
URLhttps://openreview.net/forum?id=J8Ajf9WfXP
-
[11]
SVD-LLM: Truncation-aware singular value decomposition for large language model compression
Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=LNYIUouhdt
work page 2025
-
[12]
ASVD: Activation-aware singular value decomposition for compressing large language models,
Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware singular value decomposition for compressing large language models,
-
[13]
URLhttps://openreview.net/forum?id=HyPofygOCT
-
[14]
MiniLLM: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ
work page 2024
-
[15]
OPTQ: Accurate quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=tcbBPnfwxS
work page 2023
-
[16]
Awq: Activation-aware weight quantization for llm compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. InMLSys, 2024
work page 2024
-
[17]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volu...
work page 2023
-
[18]
Llm.int8(): 8-bit matrix multiplication for transformers at scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088
work page 2022
-
[19]
Zeroquant: efficient and affordable post-training quantization for large-scale transformers
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: efficient and affordable post-training quantization for large-scale transformers. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088
work page 2022
-
[20]
Omniquant: Omnidirectionally calibrated quan- tization for large language models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quan- tization for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=8Wuvhh0LYW
work page 2024
-
[21]
Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processi...
-
[22]
Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated LLMs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=dfqsW38v1X
work page 2024
-
[23]
Spinquant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=ogO6DGE6FZ
work page 2025
-
[24]
Dartquant: Efficient rotational distribution calibration for LLM quantization
Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, and Jian Cheng. Dartquant: Efficient rotational distribution calibration for LLM quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=LfcfwlLCHM
work page 2025
-
[25]
DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation
Jingyang Xiang and Sai Qian Zhang. DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=WzGypILLDb
work page 2025
-
[26]
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: dense-and-sparse quantization. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[27]
QuIP: 2-bit quantization of large language models with guarantees
Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=xrk9g5vcXR
work page 2023
-
[28]
QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/ forum?id=9BrydUVcoe
work page 2024
-
[29]
QTIP: Quantization with trellises and incoherence processing
Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. QTIP: Quantization with trellises and incoherence processing. InThe Thirty-eighth Annual Conference on Neural Informa- tion Processing Systems, 2024. URLhttps://openreview.net/forum?id=7sdkLVuYCU
work page 2024
-
[30]
Duquant: distributing outliers via dual transformation makes stronger quantized llms
Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: distributing outliers via dual transformation makes stronger quantized llms. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 97...
work page 2024
-
[31]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=Byj72udxe
work page 2017
-
[32]
The penn treebank: annotating predicate argument structure
Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: annotating predicate argument structure. InProceedings of the Workshop on Human Language Technology, HLT ’94, page 114–119, USA, 1994. Association for Computational Linguistics. ISBN 1558603573. doi: 10.3115...
-
[33]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020. ISSN 1532-4435
work page 2020
-
[34]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URLhttps://doi.org/10.1145/3474381
-
[35]
Social IQa: Commonsense Reasoning about Social Interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent 13 Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language P...
-
[36]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. openai blog (2019).URL: https://d4mucfpksywv. cloudfront. net/better-language-models/language-models. pdf, 2024
work page 2019
-
[37]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=d7KBjmI3GmQ
work page 2021
-
[38]
A systematic classifica- tion of knowledge, reasoning, and context within the ARC dataset
Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. A systematic classifica- tion of knowledge, reasoning, and context within the ARC dataset. In Eunsol Choi, Min- joon Seo, Da...
-
[39]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, edi- tors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguis...
-
[40]
Can a Suit of Armor Conduct Electricity?
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...
-
[41]
Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language.Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439, Apr. 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239
-
[42]
R. J. Bell and P. Dean. Atomic vibrations in vitreous silica.Discuss. Faraday Soc., 50:55–61,
-
[43]
URLhttp://dx.doi.org/10.1039/DF9705000055
doi: 10.1039/DF9705000055. URLhttp://dx.doi.org/10.1039/DF9705000055
-
[44]
Curran Associates Inc., Red Hook, NY , USA, 2019
Biao Zhang and Rico Sennrich.Root mean square layer normalization. Curran Associates Inc., Red Hook, NY , USA, 2019
work page 2019
-
[45]
RoFormer: Enhanced Transformer with Rotary Position Embedding,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URL https://doi.org/10.1016/j. neucom.2023.127063
-
[46]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 14 Appendix A Rotation invariance in LLM...
work page 2018
-
[47]
maps the affine range [minj xj,max j xj] onto the full unsigned integer grid {0, . . . ,2b −1} via a scale and a shift, Qzp(x) = ∆ zp(x) round x ∆zp(x) +ζ(x) −ζ(x) ,∆ zp(x) = maxj xj −min j xj 2b −1 , where ζ(x)∈Z is the integer zeropoint that aligns minj xj with the lower end of the grid. Each coordinate again incurs a rounding error bounded by∆ zp(x)/2,...
-
[48]
For transformer-scale hidden dimensions this gap is negligible: at d= 4096 , sign-mixed corners achieve a range of 2/ √ 4096≈0.031 , recovering more than 97% of the available range reduction on the unit sphere. The factor-of-two slack in (6) is loose at axis poles—which the optimization moves away from—and asymptotically tight at the corners it converges ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.