EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
Pith reviewed 2026-05-22 10:04 UTC · model grok-4.3
The pith
EdgeRazor's mixed-precision distillation lets 1.88-bit LLMs outperform 2-bit and 3-bit baselines while cutting training costs 4-10x.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining Structural Quantization with Mixed Precision for bit-width control, Layer-Adaptive Feature Distillation to select informative features, and Entropy-Aware KL Divergence to balance loss on human and distilled data, the EdgeRazor framework enables effective sub-4-bit weight-activation quantization of LLMs. On Qwen and MobileLLM families this yields higher accuracy than existing 2-bit and 3-bit baselines at lower training budgets, higher overall compression ratios, and inference speedups up to 15 times over 16-bit baselines.
What carries the argument
EdgeRazor framework with its three integrated modules—Structural Quantization with Mixed Precision, Layer-Adaptive Feature Distillation, and Entropy-Aware KL Divergence—that together provide fine-grained bit control and balanced alignment during quantization-aware distillation.
If this is right
- Models achieve higher compression ratios at all tested bit widths and deliver measurable decoding speedups on edge hardware.
- Quantization-aware training for LLMs becomes viable with training budgets reduced by factors of 4 to 10.
- Sub-2-bit models become competitive with higher-precision baselines for practical deployment.
Where Pith is reading between the lines
- The same module combination could be tested on larger model families to check whether the efficiency advantage persists at scale.
- Similar adaptive distillation ideas might transfer to other compression techniques such as pruning or knowledge distillation without quantization.
- Further bit-width reductions below 1.58 bits could be explored by tightening the entropy-aware loss component.
Load-bearing premise
The three modules can be combined across models to produce the reported accuracy and efficiency gains without hidden instabilities or heavy per-model retuning.
What would settle it
Direct reproduction of the 1.88-bit Qwen3-0.6B evaluation on the same benchmarks, checking whether the claimed margins over published 2-bit and 3-bit baselines are recovered.
Figures
read the original abstract
Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and surpasses the strongest 3-bit baselines by 4.38, while the quantized MobileLLM-350M-EdgeRazor requires a training budget 4-10$\times$ lower than the leading quantization-aware training method. In terms of efficiency, EdgeRazor achieves higher compression ratios at all bit-widths, and the 1.58-bit Qwen3-0.6B-EdgeRazor reduces storage from 1.11 GB to 0.19 GB while accelerating decoding by 15.16$\times$ over the 16-bit baseline. These results empirically validate the effectiveness and efficiency of EdgeRazor. The codes can be accessed from \href{https://github.com/zhangsq-nju/EdgeRazor}{GitHub} and \href{https://huggingface.co/collections/zhangsq-nju/edgerazor-nbit}{Huggingface}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EdgeRazor, a lightweight framework for quantizing LLMs below 4 bits via mixed-precision quantization-aware distillation. It introduces three modules—Structural Quantization with Mixed Precision for per-layer bit-width control, Layer-Adaptive Feature Distillation for dynamic feature selection, and Entropy-Aware KL Divergence for balanced forward-reverse distillation—and evaluates them on MobileLLM and Qwen model families. The central claims are that the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms SOTA 2-bit baselines by 11.27 and 3-bit baselines by 4.38, that MobileLLM-350M-EdgeRazor requires 4-10× lower training budget than leading QAT methods, and that 1.58-bit variants achieve substantial storage reduction (1.11 GB to 0.19 GB) and 15.16× decoding speedup over FP16.
Significance. If the performance and efficiency results prove robust, the work could meaningfully advance practical deployment of LLMs on edge devices by demonstrating competitive accuracy at sub-2-bit precision with reduced training overhead. The open release of code on GitHub and Hugging Face collections is a clear positive for reproducibility and follow-up research.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the headline claim that EdgeRazor requires 4-10× lower training budget than leading QAT methods rests on the unstated assumption that determining the mixed-precision bit allocation and layer-adaptive feature selection incurs negligible extra search or hyperparameter cost; no quantitative breakdown of this overhead (e.g., search time, calibration steps, or scaling with model size) is provided, which directly affects whether the efficiency advantage holds.
- [§3 and §5] §3 (Method) and §5 (Ablations): the three modules are presented as jointly responsible for the reported gains, yet no ablation isolating the contribution of Structural Quantization with Mixed Precision versus Layer-Adaptive Feature Distillation versus Entropy-Aware KL Divergence is shown; without such controls it is impossible to verify that the combination avoids hidden instabilities or requires model-specific retuning that would undermine the lightweight claim.
minor comments (1)
- [Abstract] Abstract: the performance deltas (11.27 and 4.38) are stated without reference to the precise evaluation metric (e.g., perplexity, zero-shot accuracy) or the exact set of baselines and datasets, making the numbers difficult to interpret in isolation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below. We believe these points will help improve the clarity and robustness of our presentation, and we outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim that EdgeRazor requires 4-10× lower training budget than leading QAT methods rests on the unstated assumption that determining the mixed-precision bit allocation and layer-adaptive feature selection incurs negligible extra search or hyperparameter cost; no quantitative breakdown of this overhead (e.g., search time, calibration steps, or scaling with model size) is provided, which directly affects whether the efficiency advantage holds.
Authors: We appreciate the referee pointing out the need for a more explicit accounting of the overhead in our efficiency claims. The bit allocation and feature selection processes are indeed part of the framework, and while we designed them to be lightweight, we agree that a quantitative breakdown is necessary to fully support the 4-10× training budget reduction. In the revised manuscript, we will add a subsection or appendix detailing the computational cost of these steps, including measured search times on the evaluated models, the number of calibration steps, and observations on scaling. This will demonstrate that the overhead is small and does not undermine the reported efficiency advantages. revision: yes
-
Referee: [§3 and §5] §3 (Method) and §5 (Ablations): the three modules are presented as jointly responsible for the reported gains, yet no ablation isolating the contribution of Structural Quantization with Mixed Precision versus Layer-Adaptive Feature Distillation versus Entropy-Aware KL Divergence is shown; without such controls it is impossible to verify that the combination avoids hidden instabilities or requires model-specific retuning that would undermine the lightweight claim.
Authors: We agree that providing ablations that isolate the effect of each module would strengthen the analysis and help readers understand the necessity of each component. Our current §5 includes ablations on various design choices, but we recognize that a more targeted study removing one module at a time is missing. We will revise the ablation section to include new experiments where we evaluate performance with each module individually ablated (e.g., using uniform precision instead of mixed, fixed feature selection, or standard KL divergence). These results will be presented to show the contribution of each and to confirm the stability of the combined approach across the tested models. revision: yes
Circularity Check
No circularity in derivation chain; claims are empirical performance results.
full rationale
The paper introduces EdgeRazor as an empirical framework with three modules (Structural Quantization with Mixed Precision, Layer-Adaptive Feature Distillation, Entropy-Aware KL Divergence) and validates it via direct evaluations on MobileLLM and Qwen models. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. Central claims rest on reported accuracy and efficiency metrics against external baselines, with no load-bearing self-citation chains or ansatz smuggling. The derivation is self-contained against the stated empirical benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Structural Quantization with Mixed Precision (SQMP) ... every ⌊1/ρ⌉ consecutive output channels form one super-group, wherein one channel is quantized to 4-bit and the remainder to 1.58-bit.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Layer-Adaptive Feature Distillation (LAFD) ... cl = mean cosine similarity between adjacent teacher layers
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat.induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Entropy-Aware KL Divergence (EAKLD) ... λ derived from teacher output entropy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
QuaRot: Outlier-free 4-bit inference in rotated LLMs
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems 37, pages 100213–100240, 2024
work page 2024
-
[2]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[3]
PIQA: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020
work page 2020
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCand...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
EfficientQAT: Efficient quantization-aware training for large language models
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 10081–10100, 2025
work page 2025
-
[6]
Optimize weight rounding via signed gradient descent for the quantization of LLMs
Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Lv Kaokao, and Yi Liu. Optimize weight rounding via signed gradient descent for the quantization of LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11332–11350, 2024
work page 2024
-
[7]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019
work page 2019
-
[8]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
The case for 4-bit precision: K-bit inference scaling laws
Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: K-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning, pages 7750–7774, 2023
work page 2023
-
[11]
BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation
Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 102–116, 2024
work page 2024
-
[12]
Extreme compression of large language models via additive quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. InProceedings of the 41st International Conference on Machine Learning, pages 12284–12303, 2024
work page 2024
-
[13]
Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 55–65, 2019
work page 2019
-
[14]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantiza- tion for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InP...
work page 2025
-
[16]
APTQ: Attention-aware post- training mixed-precision quantization for large language models
Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. APTQ: Attention-aware post- training mixed-precision quantization for large language models. InProceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024
work page 2024
-
[17]
Aligning AI With Shared Human Values
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values.arXiv preprint arXiv:2008.02275, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2008
-
[18]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[19]
Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models. InProceedings of the 12th International Conference on Learning Representations, pages 12744–12762, 2024
work page 2024
-
[21]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[22]
BiLLM: Pushing the limit of post-training quantization for LLMs
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 20023–20042, 2024
work page 2024
-
[23]
SliM-LLM: Salience-driven mixed-precision quantization for large language models
Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692, 2025
work page 2025
-
[24]
Deokjae Lee and Hyun Oh Song. Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025
-
[25]
Jijie Li, Li Du, Hanyu Zhao, Bowen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025
-
[26]
GPTAQ: Efficient finetuning-free quantization for asymmetric calibration
Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 36690–36706, 2025
work page 2025
-
[27]
TGIF: A new dataset and benchmark on animated gif description
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated gif description. InProceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016
work page 2016
-
[28]
ARB-LLM: Alternating refined binarizations for large language models
Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Zhongchao Shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. ARB-LLM: Alternating refined binarizations for large language models. InProceedings of the 13th International Conference on Learning Representations, pages 93900– 93912, 2025
work page 2025
-
[29]
AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of the 6th Conference on Machine Learning and Systems, volume 6, pages 87–100, 2024
work page 2024
-
[30]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3214–3252, 2022
work page 2022
-
[31]
QServe: W4A8KV4 quantization and system co-design for efficient LLM serving
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. InProceedings of the 7th Conference on Machine Learning and Systems, 2025
work page 2025
-
[32]
VPTQ: Extreme low-bit vector post-training quantization for large language models
Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. VPTQ: Extreme low-bit vector post-training quantization for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8181–8196, 2024. 12
work page 2024
-
[33]
Llm-qat: Data-free quantization aware training for large language models,
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023
-
[34]
ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025
-
[35]
SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the 13th International Conference on Learning Representations, pages 92009–92032, 2025
work page 2025
-
[36]
Can a suit of armor conduct electricity? A new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018
work page 2018
-
[37]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[38]
Social IQa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 4463–4473, 2019
work page 2019
-
[39]
OmniQuant: Omnidirectionally calibrated quantization for large language models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceedings of the 12th International Conference on Learning Representations, pages 45472–45496, 2024
work page 2024
-
[40]
FlatQuant: Flatness matters for LLM quantization
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. FlatQuant: Flatness matters for LLM quantization. In Proceedings of the 42nd International Conference on Machine Learning, pages 57587–57613, 2025
work page 2025
-
[41]
MobileQuant: Mobile-friendly quantization for on-device language models
Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. MobileQuant: Mobile-friendly quantization for on-device language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9761–9771, 2024
work page 2024
-
[42]
BERT rediscovers the classical NLP pipeline
Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019
work page 2019
-
[43]
QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, pages 48630–48656, 2024
work page 2024
-
[44]
QTIP: Quantization with trellises and incoherence processing
Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. QTIP: Quantization with trellises and incoherence processing. InAdvances in Neural Information Processing Systems 37, pages 59597–59620, 2024
work page 2024
-
[45]
Hongyu Wang, Shuming Ma, Lingxiao Ma, Lei Wang, Wenhui Wang, Li Dong, Shaohan Huang, Huaijie Wang, Jilong Xue, Ruiping Wang, Jihao Bao, Conghui He, and Furu Wei. BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025
work page 2025
-
[46]
MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems 33, pages 5776–5788, 2020
work page 2020
-
[47]
Rethinking kullback-leibler divergence in knowledge distillation for large language models
Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025
work page 2025
-
[48]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, pages 38087–38099, 2023
work page 2023
-
[49]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
OneBit: Towards extremely low-bit large language models
Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems 37, pages 66357–66382, 2024
work page 2024
-
[51]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MAmmoTH: Building math generalist models through hybrid instruction tuning.arXiv preprint arXiv:2309.05653, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
work page 2019
-
[54]
ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models
Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, pages 22299–22307, 2025
work page 2025
-
[55]
LQER: Low-rank quantization error reconstruction for LLMs
Cheng Zhang, Jianyi Cheng, George A Constantinides, and Yiren Zhao. LQER: Low-rank quantization error reconstruction for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 58763–58779, 2024
work page 2024
-
[56]
Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025
-
[57]
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025
work page 2025
-
[58]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
MLVU: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: Benchmarking multi-task long video understanding. InProceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025
work page 2025
-
[60]
Zhi-Hua Zhou and Yuan Jiang. Nec4. 5: Neural ensemble based c4. 5.IEEE Transactions on knowledge and data engineering, 16(6):770–773, 2004
work page 2004
-
[61]
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 14 A Mixed-precision quantization A.1 Quantization function for weights and activations In this section, we provide the per-group symmetric quantization for both ...
work page 2024
-
[62]
Random allocation.The N high-precision rows are distributed uniformly at random, yielding pk i.i.d. ∼Unif[0,1]. Standard empirical process bounds imply that the discrepancy satisfies D∗ N(Prand) =O p(N −1/2).(16)
-
[63]
Stacked allocation.All N high-precision rows are clustered contiguously at one end of the output dimension, yielding Pstack = 0.5 dout , 1.5 dout , . . . , N−0.5 dout .(17) Since all points lie in a sub-interval of length ρ, taking t=ρ in the definition of D∗ N gives a deviation of1−ρ. Thus, the discrepancy is constant D∗ N(Pstack) = 1−ρ= Θ(1).(18)
-
[64]
Super-group allocation (ours).The 4-bit rows are placed on a deterministic equidistant grid with period⌊1/ρ⌉along the output dimension. Then, the normalized pattern is the midpoint grid Psuper = 2k−1 2N N k=1 .(19) For anyt∈[0,1], the number of points in[0, t]is⌊N t+ 1 2 ⌋, so that 1 N NX k=1 1{pk ≤t} −t ≤ 1 2N .(20) While rounding row indices to discrete...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.