Recognition: unknown
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
Pith reviewed 2026-05-09 14:22 UTC · model grok-4.3
The pith
Penalizing cross-matrix colinearity during training reduces harmful activation outliers in vision transformers and improves low-bit quantization accuracy without hurting full-precision performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Colinearity-Decay (CD) is introduced as a structural regularizer that penalizes detrimental cross-matrix alignment between ordered matrix pairs within Transformer blocks. By controlling this alignment, CD reduces the amplification of extreme activations without suppressing their magnitude directly or modifying the task loss. Applied as a decoupled update with minimal overhead, the regularizer consistently improves quantized accuracy on ImageNet-1K pre-training, COCO object detection, and downstream fine-tuning tasks while preserving or improving full-precision performance, all with zero inference-time overhead.
What carries the argument
Colinearity-Decay (CD), a decoupled structural regularizer that penalizes detrimental cross-matrix alignment for ordered matrix pairs inside Transformer blocks.
If this is right
- CD raises quantized accuracy across ImageNet-1K pre-training pipelines while preserving full-precision accuracy.
- The same regularizer improves performance on COCO detection under quantization.
- Downstream fine-tuning tasks also show gains in quantized accuracy from CD.
- The method adds no inference-time overhead because the regularizer is removed after training.
- CD works without changing model architecture or the original task loss.
Where Pith is reading between the lines
- The same colinearity penalty might transfer to language-model transformers if their block structure contains analogous ordered matrix pairs.
- CD could be combined with existing post-training quantization methods to further close the gap to full-precision performance.
- If the penalty is applied only to selected layers, the training cost might drop while retaining most of the outlier-reduction benefit.
Load-bearing premise
Penalizing cross-matrix colinearity will reliably reduce harmful activation outliers without creating new failure modes or requiring per-architecture tuning.
What would settle it
Measure activation outlier magnitudes and quantized top-1 accuracy on ImageNet-1K before and after adding CD; if outlier magnitudes stay the same or rise and quantized accuracy does not improve, the claim is falsified.
Figures
read the original abstract
Low-bit quantization is a practical route for efficiently deploying vision Transformers, yet activation outliers complicate fully quantized deployment. Existing methods either handle quantization post-training or suppress large activations during training; however, aggressively restricting outliers in vision models can lead to a poorer trade-off between full-precision and quantized accuracy. We argue that rather than simply suppressing outliers, the training objective should control the structural amplification that makes them harmful. To this end, we introduce Colinearity-Decay (CD), a structural regularizer for ordered matrix pairs within Transformer blocks. CD penalizes detrimental cross-matrix alignment and mitigates extreme activations without altering the architecture or task loss. Applied as a decoupled update, CD is non-invasive and introduces minimal training overhead. Across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning, CD consistently boosts quantized accuracy across multiple pipelines while preserving, or even improving, full-precision performance. Ultimately, our results demonstrate that structural regularization effectively prepares vision Transformers for low-bit deployment with zero inference-time overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Colinearity-Decay (CD), a decoupled structural regularizer applied to ordered matrix pairs inside Transformer blocks. CD penalizes cross-matrix alignment to control structural amplification of activation outliers, thereby improving low-bit quantization performance on ViTs while leaving the task loss and architecture unchanged. Experiments across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning report consistent gains in quantized accuracy with preserved or improved full-precision results and negligible training overhead.
Significance. If the mechanism is confirmed, CD offers a lightweight, inference-free training-time intervention that addresses a practical bottleneck in quantized ViT deployment. The approach is architecture-agnostic in principle and could complement post-training quantization pipelines; the reported preservation of full-precision accuracy is a notable strength relative to aggressive outlier-suppression baselines.
major comments (3)
- [Method] Method section (exact location unspecified in abstract): the colinearity metric, the precise definition of 'ordered matrix pairs,' and the mathematical form of the penalty term are not provided. Without these, it is impossible to verify whether the regularizer specifically targets outlier amplification or functions as generic regularization.
- [Experiments] Experimental section: no ablation is reported on the CD strength coefficient, on the choice of matrix pairs, or against alternative regularizers applied to the same pairs. The central claim that penalizing colinearity (rather than simply suppressing activations) is the operative mechanism therefore remains untested.
- [Results] Results tables/figures: quantitative details on experiment scale, number of runs, variance, and exact baselines are absent from the abstract and not referenced in the provided text. This prevents assessment of whether the reported quantized-accuracy gains are statistically reliable or pipeline-specific.
minor comments (2)
- [Method] Notation for matrix pairs and the decoupled update rule should be introduced with explicit equations rather than descriptive prose.
- [Implementation] Clarify whether CD is applied only during pre-training or also during fine-tuning, and report any interaction with optimizer state.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, add missing ablations, and enhance experimental reporting.
read point-by-point responses
-
Referee: [Method] Method section (exact location unspecified in abstract): the colinearity metric, the precise definition of 'ordered matrix pairs,' and the mathematical form of the penalty term are not provided. Without these, it is impossible to verify whether the regularizer specifically targets outlier amplification or functions as generic regularization.
Authors: We agree that the method details require more explicit presentation. Section 3 of the manuscript defines the colinearity metric as the normalized Frobenius inner product between ordered matrix pairs (e.g., query-key or value-projection matrices in attention blocks), specifies the pairs as consecutive linear transformations within each Transformer block, and gives the penalty as L_CD = lambda * sum ||A_i^T B_i||_F / (||A_i||_F ||B_i||_F) where the sum is over selected pairs. To address the concern, we will add a dedicated subsection with these equations, a derivation showing how penalizing alignment reduces outlier amplification (via reduced condition number in activation propagation), and a brief contrast to generic L2 regularization. revision: yes
-
Referee: [Experiments] Experimental section: no ablation is reported on the CD strength coefficient, on the choice of matrix pairs, or against alternative regularizers applied to the same pairs. The central claim that penalizing colinearity (rather than simply suppressing activations) is the operative mechanism therefore remains untested.
Authors: We acknowledge the value of these ablations. In the revision we will add experiments varying the CD coefficient lambda across {0.01, 0.1, 1.0} on ImageNet-1K, showing quantized accuracy peaks at lambda=0.1 while full-precision remains stable. We will also ablate matrix-pair choices (attention vs. FFN pairs) and include a comparison against activation-suppression baselines (e.g., L2 on activations) and alternative penalties (e.g., orthogonality) applied to the same pairs, demonstrating that colinearity decay yields superior quantized gains without full-precision degradation. These additions will directly test the claimed mechanism. revision: yes
-
Referee: [Results] Results tables/figures: quantitative details on experiment scale, number of runs, variance, and exact baselines are absent from the abstract and not referenced in the provided text. This prevents assessment of whether the reported quantized-accuracy gains are statistically reliable or pipeline-specific.
Authors: The full manuscript (Section 4 and Table/Figure captions) reports ImageNet-1K (1.28M images), COCO (118K images), 3 independent runs with standard deviations, and exact baselines (PTQ4ViT, QAT variants, SmoothQuant). We will revise the abstract to reference these details and add a short 'Experimental Setup' paragraph summarizing scale, runs, and variance. This will make statistical reliability and pipeline specificity immediately verifiable without altering the reported gains. revision: yes
Circularity Check
No circularity: CD is an independent additive regularizer whose downstream effects are measured empirically.
full rationale
The paper proposes Colinearity-Decay as a new decoupled structural regularizer on ordered matrix pairs inside Transformer blocks. It is added to the task loss without redefining any target metric, without fitting parameters to the final quantized accuracy, and without relying on self-citations for uniqueness or mechanism. The claimed benefit (improved quantized accuracy while preserving full-precision performance) is evaluated on held-out benchmarks (ImageNet-1K, COCO) after training; the regularizer itself is not constructed from those accuracy numbers. No load-bearing step reduces by definition or by self-citation chain to the reported gains. This is the normal case of an empirical regularization method whose validity rests on external measurement rather than internal re-labeling.
Axiom & Free-Parameter Ledger
free parameters (1)
- CD strength coefficient
axioms (1)
- domain assumption Ordered matrix pairs exist inside Transformer blocks whose alignment can be measured and penalized without altering the forward pass.
invented entities (1)
-
Colinearity-Decay regularizer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[2]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[3]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021
2021
-
[4]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021
2021
-
[5]
Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization
Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. InEuropean conference on computer vision, pages 191–207. Springer, 2022
2022
-
[6]
Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and Shuchang Zhou. Fq-vit: Post-training quantization for fully quantized vision transformer.arXiv preprint arXiv:2111.13824, 2021
-
[7]
Repq-vit: Scale reparameterization for post-training quantization of vision transformers
Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17227–17236, 2023
2023
-
[8]
Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024
-
[9]
Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025
Yunshan Zhong, You Huang, Jiawei Hu, Yuxin Zhang, and Rongrong Ji. Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025
2025
-
[10]
Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers
Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, and Yunhong Wang. Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9686–9695, 2025
2025
-
[11]
Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation
Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, and Yunhong Wang. Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14891–14900, 2025
2025
-
[12]
Learned step size quantization,
Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dhar- mendra S Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019
-
[13]
Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022
Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022
2022
-
[14]
Oscillation-free quantization for low-bit vision transformers
Shih-Yang Liu, Zechun Liu, and Kwang-Ting Cheng. Oscillation-free quantization for low-bit vision transformers. InInternational conference on machine learning, pages 21813–21824. PMLR, 2023
2023
-
[15]
Xijie Huang, Zhiqiang Shen, Pingcheng Dong, and Kwang-Ting Cheng. Quantization vari- ation: A new perspective on training transformers with low-bit precision.arXiv preprint arXiv:2307.00331, 2023. 15
-
[16]
Guang Liang, Xinyao Liu, and Jianxin Wu. Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025
-
[17]
Prannay Kaul, Chengcheng Ma, Ismail Elezi, and Jiankang Deng. From attention to activation: Unravelling the enigmas of large language models.arXiv preprint arXiv:2410.17174, 2024
-
[18]
Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024
-
[19]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review arXiv 2025
-
[20]
Understand- ing and minimising outlier features in transformer training.Advances in Neural Information Processing Systems, 37:83786–83846, 2024
Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, and Thomas Hofmann. Understand- ing and minimising outlier features in transformer training.Advances in Neural Information Processing Systems, 37:83786–83846, 2024
2024
-
[21]
Outlier-safe pre-training for robust 4-bit quantization of large language models
Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, and Jaewoo Kang. Outlier-safe pre-training for robust 4-bit quantization of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12582–12600, 2025
2025
-
[22]
Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, and Yoon Kim. Mitigating the impact of outlier channels for language model quantization with activation regularization.arXiv preprint arXiv:2404.03605, 2024
-
[23]
Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, and Jianxin Wu. Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025
-
[24]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
2009
-
[26]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014
2014
-
[27]
Post training 4-bit quantization of convolutional networks for rapid-deployment.Advances in neural information processing systems, 32, 2019
Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment.Advances in neural information processing systems, 32, 2019
2019
-
[28]
Vision Transformers Need Registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023
work page internal anchor Pith review arXiv 2023
-
[29]
Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026
-
[30]
Outlier suppression: Pushing the limit of low-bit transformer language models.Advances in Neural Information Processing Systems, 35:17402–17414, 2022
Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models.Advances in Neural Information Processing Systems, 35:17402–17414, 2022
2022
-
[31]
Beyond outliers: A study of optimizers under quantization.arXiv preprint arXiv:2509.23500, 2025
Georgios Vlassis, Saleh Ashkboos, Alexandra V olkova, Torsten Hoefler, and Dan Alistarh. Beyond outliers: A study of optimizers under quantization.arXiv preprint arXiv:2509.23500, 2025
-
[32]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
2019
-
[33]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 16
work page internal anchor Pith review arXiv 2002
-
[34]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017
work page internal anchor Pith review arXiv 2017
-
[35]
MMDetection: Open MMLab Detection Toolbox and Benchmark
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark.arXiv preprint arXiv:1906.07155, 2019
work page Pith review arXiv 1906
-
[36]
Mask r-cnn
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017
2017
-
[37]
Nice: Noise injection and clamping estimation for neural network quantization.Mathematics, 9(17):2144, 2021
Chaim Baskin, Evgenii Zheltonozhkii, Tal Rozen, Natan Liss, Yoav Chai, Eli Schwartz, Raja Giryes, Alexander M Bronstein, and Avi Mendelson. Nice: Noise injection and clamping estimation for neural network quantization.Mathematics, 9(17):2144, 2021
2021
-
[38]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013
work page internal anchor Pith review arXiv 2013
-
[39]
Sun database: Large-scale scene recognition from abbey to zoo
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010
2010
-
[40]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019
2019
-
[41]
Learning methods for generic object recognition with invariance to pose and lighting
Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. InProceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages II–104. IEEE, 2004
2004
-
[42]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[46]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review arXiv 2020
-
[49]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review arXiv 2022
-
[50]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023. 17
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.