pith. machine review for the scientific record. sign in

arxiv: 2604.10091 · v1 · submitted 2026-04-11 · 💻 cs.CL

Recognition: unknown

SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords post-training quantizationlarge language modelsimportance scoringmask matrixlow-bit quantizationweight updatemodel compression
0
0 comments X

The pith

SEPTQ quantizes LLMs by scoring each weight element globally then updating column-by-column with a mask.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create a simpler post-training quantization approach for large language models that avoids retraining and the intricate per-layer steps common in existing methods. It computes an importance score for every element in each weight matrix to fix quantization locations once, in a static global pass, then applies a mask to drive quantization and weight adjustment column by column until a usable matrix results. This reduces the entire procedure to two steps while targeting better preservation of generative performance, especially when bits are few. A reader would care because the result points toward running billion-parameter models on ordinary hardware with less accuracy loss and no extra training cost.

Core claim

SEPTQ first calculates the importance score for each element in the weight matrix and determines the quantization locations in a static global manner. Then it utilizes the mask matrix which represents the important locations to quantize and update the associated weights column-by-column until the appropriate quantized weight matrix is obtained. This simplifies post-training quantization into only two steps and considers effectiveness and efficiency simultaneously.

What carries the argument

The mask matrix derived from static global importance scores, which guides quantization and incremental weight updates performed column by column.

If this is right

  • LLMs ranging from millions to billions of parameters achieve usable performance after quantization at multiple bit widths.
  • Low-bit settings suffer less degradation than with prior complex PTQ techniques.
  • The full quantization process collapses to two straightforward steps without per-layer tuning loops.
  • Generative quality holds across diverse datasets without any retraining pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same importance-plus-mask pattern might transfer to other compression operations such as structured pruning.
  • Hardware accelerators could exploit the column-wise update order to reduce memory traffic during the quantization step itself.
  • Scaling the method to models beyond current billion-parameter sizes would test whether the static global scoring remains stable.

Load-bearing premise

A single static global ranking of weight importance plus sequential column-by-column mask updates can recover a quantized matrix whose generative behavior stays close to the original model.

What would settle it

Quantize several LLMs to 2-bit or 3-bit with SEPTQ, run them on standard benchmarks such as WikiText perplexity or zero-shot tasks, and check whether accuracy or fluency falls below that of strong prior PTQ baselines.

Figures

Figures reproduced from arXiv: 2604.10091 by Changya Li, Fenglong Ma, Feng Zhang, Han Liu, Haotian Gao, Hong Yu, Wei Wang, Xiaotong Zhang.

Figure 1
Figure 1. Figure 1: The perplexity results of widely-used quantization [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The frequency and proportion distributions of the importance scores in different quantization settings. In 4-bit [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The visualization of determined quantization loca [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The visualization of quantizing the model weights. Given the original weight matrix [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The perplexity results of the OPT-13B, LLaMA-13B and LLaMA2-13B models on C4 and WikiText2 datasets (The lower [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: The effectiveness results by using the global strategy [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Output projection matrices of consecutive linear layers (4th to 7th Layers) showing consistent trends across layers [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited devices while preserving generative quality, encompasses two primary methods: quantization aware training (QAT) and post-training quantization (PTQ). QAT involves additional retraining or fine-tuning, thus inevitably resulting in high training cost and making it unsuitable for LLMs. Consequently, PTQ has become the research hotspot in recent quantization methods. However, existing PTQ methods usually rely on various complex computation procedures and suffer from considerable performance degradation under low-bit quantization settings. To alleviate the above issues, we propose a simple and effective post-training quantization paradigm for LLMs, named SEPTQ. Specifically, SEPTQ first calculates the importance score for each element in the weight matrix and determines the quantization locations in a static global manner. Then it utilizes the mask matrix which represents the important locations to quantize and update the associated weights column-by-column until the appropriate quantized weight matrix is obtained. Compared with previous methods, SEPTQ simplifies the post-training quantization procedure into only two steps, and considers the effectiveness and efficiency simultaneously. Experimental results on various datasets across a suite of models ranging from millions to billions in different quantization bit-levels demonstrate that SEPTQ significantly outperforms other strong baselines, especially in low-bit quantization scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes SEPTQ, a simplified post-training quantization (PTQ) paradigm for LLMs. It first computes a static global importance score for each element in the weight matrix to determine quantization locations via a mask, then performs mask-guided quantization and weight updates column-by-column until a suitable quantized matrix is obtained. The authors claim this reduces PTQ to two steps while preserving generative quality better than prior methods, with experimental superiority shown across models from millions to billions of parameters, various datasets, and especially low-bit settings.

Significance. If the central empirical claims hold with full reproducibility, SEPTQ would offer a practically significant simplification of PTQ for LLMs by avoiding retraining and complex per-layer procedures common in methods like GPTQ or AWQ. This could facilitate broader deployment of quantized models on resource-constrained hardware. The work does not provide parameter-free derivations or machine-checked proofs, but its emphasis on a lightweight two-step recipe could be impactful if the low-bit results prove robust.

major comments (3)
  1. [Abstract] Abstract and method description: the exact formula for the element-wise importance score and the column-by-column update rule (including any objective minimized during updates) are unspecified. This is load-bearing for the central claim, as the skeptic correctly notes that a purely static mask without calibration activations or Hessian-based compensation risks unaddressed activation-scale mismatches that cause low-bit degradation in prior weight-only PTQ.
  2. [Abstract] Abstract: the assertion of significant outperformance over strong baselines lacks any quantitative metrics, model sizes, bit-widths, datasets, or error bars. Without these, it is impossible to assess whether the two-step procedure truly outperforms GPTQ/AWQ-style methods or whether results reflect post-hoc selection.
  3. [Experiments] The weakest assumption (static global mask + sequential updates suffice without per-layer sensitivity analysis) is not tested via ablation; §4 experiments should include controls showing that removing calibration data or layer-wise error minimization does not degrade perplexity or downstream accuracy at 2-4 bits.
minor comments (1)
  1. [Abstract] The abstract repeats the two-step description; a single concise sentence would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on clarity and experimental rigor. We address each major comment point by point below and will make the necessary revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the exact formula for the element-wise importance score and the column-by-column update rule (including any objective minimized during updates) are unspecified. This is load-bearing for the central claim, as the skeptic correctly notes that a purely static mask without calibration activations or Hessian-based compensation risks unaddressed activation-scale mismatches that cause low-bit degradation in prior weight-only PTQ.

    Authors: We agree that the abstract, due to its brevity, does not include the precise mathematical formulations. The manuscript describes the overall procedure but to strengthen the presentation we will add the exact formula for the element-wise importance score (computed statically from the weight matrix) and the column-by-column update rule (which minimizes a local reconstruction error objective on masked positions) to the revised abstract and method section. This will also allow us to explicitly discuss how the mask-guided updates mitigate activation-scale issues without requiring per-layer Hessian computations or extensive calibration data, consistent with the empirical robustness shown in low-bit regimes. revision: yes

  2. Referee: [Abstract] Abstract: the assertion of significant outperformance over strong baselines lacks any quantitative metrics, model sizes, bit-widths, datasets, or error bars. Without these, it is impossible to assess whether the two-step procedure truly outperforms GPTQ/AWQ-style methods or whether results reflect post-hoc selection.

    Authors: We acknowledge that the abstract summarizes results at a high level without specific numbers. The full paper reports detailed quantitative comparisons (perplexity and accuracy metrics, model sizes from millions to billions of parameters, 2-4 bit widths, multiple datasets, and direct comparisons to GPTQ and AWQ) in Section 4 with tables. In the revision we will incorporate a small number of key quantitative highlights (e.g., average perplexity improvement at 2-bit on representative models) into the abstract to make the performance claims more concrete and verifiable. revision: yes

  3. Referee: [Experiments] The weakest assumption (static global mask + sequential updates suffice without per-layer sensitivity analysis) is not tested via ablation; §4 experiments should include controls showing that removing calibration data or layer-wise error minimization does not degrade perplexity or downstream accuracy at 2-4 bits.

    Authors: This is a fair observation. While the existing experiments demonstrate consistent gains across diverse models and bit-widths, we agree that explicit ablations isolating the static global mask and column-wise updates (including variants with reduced or no calibration data and comparisons against per-layer sensitivity methods) would further validate the core assumption. We will add these targeted ablation studies to the revised Section 4, reporting perplexity and accuracy at 2-4 bits to confirm that the simplified two-step procedure remains effective. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic PTQ recipe with external empirical validation

full rationale

The paper defines SEPTQ as a two-step procedure (global element-wise importance scoring to produce a static mask, followed by mask-guided column-by-column quantization and weight updates). No equations, fitted parameters, or self-citations are shown that reduce the final quantized matrix or performance claim to the inputs by construction. The effectiveness assertion rests on reported experiments across models and bit-widths, which constitute independent evidence rather than a definitional loop. This is a standard empirical algorithmic contribution whose validity is falsifiable outside the method description itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated premise that importance scores computed once can reliably identify quantization locations and that column-wise masked updates converge to high-quality low-bit weights; no explicit free parameters, axioms, or new entities are named in the abstract.

pith-pipeline@v0.9.0 · 5568 in / 1329 out tokens · 67061 ms · 2026-05-10T16:30:16.617469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Ron Banner, Yury Nahshan, and Daniel Soudry. 2019. Post training 4-bit quanti- zation of convolutional networks for rapid-deployment. InConference on Neural Information Processing Systems (NeurIPS). 7948–7956

  2. [2]

    Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge.CoRRabs/2102.03315 (2021)

  3. [3]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language. InAAAI Conference on Artificial Intelligence (AAAI). 7432–7439

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. InConference on Neural Information Processing Systems (NeurIPS). 1877–1901

  5. [5]

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. 2023. QuIP: 2-Bit Quantization of Large Language Models With Guarantees. InConference on Neural Information Processing Systems (NeurIPS)

  6. [6]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.CoRR abs/2208.07339 (2022)

  7. [7]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics (NAACL). 4171–4186

  8. [8]

    Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, and Oleg Rybakov. 2022. 4-bit Conformer with Native Quantization Aware Training for Speech Recognition. InConference of the International Speech Com- munication Association (INTERSPEECH). 1711–1715

  9. [9]

    Xin Ding, Xiaoyu Liu, Yun Zhang, Zhijun Tu, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, and Yunhe Wang. 2023. CBQ: Cross-Block Quantization for Large Language Models.CoRRabs/2312.07950 (2023)

  10. [10]

    Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. 2024. Extreme Compression of Large Language Models via Additive Quantization. InInternational Conference on Machine Learn- ing (ICML)

  11. [11]

    Jun Fang, Ali Shafiee, Hamzah Abdel-Aziz, David Thorsley, Georgios Georgiadis, and Joseph Hassoun. 2020. Post-training Piecewise Linear Quantization for Deep Neural Networks. InEuropean Conference on Computer Vision (ECCV). 69–86

  12. [12]

    Elias Frantar and Dan Alistarh. 2022. Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. InConference on Neural Information Processing Systems (NeurIPS)

  13. [13]

    Elias Frantar and Dan Alistarh. 2024. QMoE: Sub-1-Bit Compression of Trillion Parameter Models. InConference on Machine Learning and Systems (MLSys)

  14. [14]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. CoRRabs/2210.17323 (2022)

  15. [15]

    Babak Hassibi and David G. Stork. 1992. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. InConference on Neural Information Processing Systems (NeurIPS). 164–171

  16. [16]

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. 2024. BiLLM: Pushing the Limit of Post- Training Quantization for LLMs. InInternational Conference on Machine Learning (ICML)

  17. [17]

    Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. 2021. Accurate Post Training Quantization With Small Calibration Sets. InInternational Conference on Machine Learning (ICML). 4466–4475

  18. [18]

    Minsoo Kim, Sihwa Lee, Sukjin Hong, Du-Seong Chang, and Jungwook Choi. 2022. Understanding and Improving Knowledge Distillation for Quantization Aware Training of Large Transformer Encoders. InConference on Empirical Methods in Natural Language Processing (EMNLP). 6713–6725

  19. [19]

    Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. 2022. The Optimal BERT Surgeon: KDD ’25, August 3–7, 2025, Toronto, ON, Canada Han Liu et al. Scalable and Accurate Second-Order Pruning for Large Language Models. In Conference on Empirical Methods in Natural Language Processing (EMNLP)....

  20. [20]

    Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.CoRRabs/2211.05100 (2022)

  21. [21]

    Dohyeok Lee, Seungyub Han, Taehyun Cho, and Jungwoo Lee. 2023. SPQR: Con- trolling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning. InConference on Neural Information Processing Systems (NeurIPS)

  22. [22]

    Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, and Dongsoo Lee. 2023. FlexRound: Learnable Rounding based on Element-wise Division for Post- Training Quantization. InInternational Conference on Machine Learning (ICML). 18913–18939

  23. [23]

    Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. 2021. BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. InInternational Conference on Learning Representations (ICLR)

  24. [24]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InConference on Machine Learning and Systems (MLSys)

  25. [25]

    Shih-Yang Liu, Zechun Liu, and Kwang-Ting Cheng. 2023. Oscillation-free Quantization for Low-bit Vision Transformers. InInternational Conference on Machine Learning (ICML). 21813–21824

  26. [26]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.CoRRabs/1907.11692 (2019)

  27. [27]

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. 2024. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. InAnnual Meeting of the Association for Computational Linguistics (ACL). 467–484

  28. [28]

    Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. 2021. Post-Training Quantization for Vision Transformer. InConference on Neural Information Processing Systems (NeurIPS). 28092–28103

  29. [29]

    Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger

    Mitchell P. Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The Penn Treebank: Annotating Predicate Argument Structure. InHuman Language Tech- nology

  30. [30]

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. InInternational Conference on Learning Repre- sentations (ICLR)

  31. [31]

    Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, and Tijmen Blankevoort. 2020. Up or Down? Adaptive Rounding for Post-Training Quantiza- tion. InInternational Conference on Machine Learning (ICML). 7197–7206

  32. [32]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog1 (2019)

  33. [33]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research (JMLR)21 (2020), 140:1–140:67

  34. [34]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. WinoGrande: an adversarial winograd schema challenge at scale.Commun. ACM 64, 9 (2021), 99–106

  35. [35]

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2024. OmniQuant: Omnidi- rectionally Calibrated Quantization for Large Language Models. InInternational Conference on Learning Representations (ICLR)

  36. [36]

    Shyam Anil Tailor, Javier Fernández-Marqués, and Nicholas Donald Lane. 2021. Degree-Quant: Quantization-Aware Training for Graph Neural Networks. In International Conference on Learning Representations (ICLR)

  37. [37]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. 2023. LLaMA: Open and Efficient Foundation Language Models.CoRR abs/2302.13971 (2023)

  38. [38]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InConference on Neural Information Processing Systems (NeurIPS). 5998–6008

  39. [39]

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. 2023. BitNet: Scaling 1-bit Transformers for Large Language Models.CoRRabs/2310.11453 (2023)

  40. [40]

    Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. InInternational Conference on Machine Learning (ICML). 38087–38099

  41. [41]

    Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. 2024. OneBit: Towards Extremely Low-bit Large Language Models.CoRRabs/2402.11295 (2024)

  42. [42]

    Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. InConference on Neural Information Processing Systems (NeurIPS)

  43. [43]

    Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. 2023. RPTQ: Reorder-based Post-training Quantization for Large Language Models.CoRR abs/2304.01089 (2023)

  44. [44]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. InAnnual Meeting of the Association for Computational Linguistics (ACL). 4791–4800

  45. [45]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models.CoRRabs/2205.01068 (2022)

  46. [46]

    Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. 2023. A Survey on Model Compression for Large Language Models.CoRRabs/2308.07633 (2023). SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models KDD ’25, August 3–7, 2025, Toronto, ON, Canada A The Detailed Proof Procedure of Eq. (4) Here we give the proof procedur...