pith. sign in

arxiv: 2605.04062 · v2 · pith:U6XO6XTRnew · submitted 2026-04-10 · 💻 cs.LG · cs.AI

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Pith reviewed 2026-05-22 10:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords large language modelsquantizationmixed precisiondistillationmodel compressionedge deploymentlow-bit inference
0
0 comments X

The pith

EdgeRazor's mixed-precision distillation lets 1.88-bit LLMs outperform 2-bit and 3-bit baselines while cutting training costs 4-10x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EdgeRazor as a framework that applies mixed-precision structural quantization, layer-adaptive feature distillation, and entropy-aware KL divergence to compress large language models for edge use. It reports that this combination produces a 1.88-bit Qwen3-0.6B model that exceeds state-of-the-art 2-bit results by 11.27 points and strongest 3-bit results by 4.38 points, while a quantized MobileLLM-350M version needs far less training compute than prior quantization-aware methods. A reader would care because the work targets the practical barriers of memory, speed, and retraining expense that currently limit powerful models on phones and other constrained hardware.

Core claim

By combining Structural Quantization with Mixed Precision for bit-width control, Layer-Adaptive Feature Distillation to select informative features, and Entropy-Aware KL Divergence to balance loss on human and distilled data, the EdgeRazor framework enables effective sub-4-bit weight-activation quantization of LLMs. On Qwen and MobileLLM families this yields higher accuracy than existing 2-bit and 3-bit baselines at lower training budgets, higher overall compression ratios, and inference speedups up to 15 times over 16-bit baselines.

What carries the argument

EdgeRazor framework with its three integrated modules—Structural Quantization with Mixed Precision, Layer-Adaptive Feature Distillation, and Entropy-Aware KL Divergence—that together provide fine-grained bit control and balanced alignment during quantization-aware distillation.

If this is right

  • Models achieve higher compression ratios at all tested bit widths and deliver measurable decoding speedups on edge hardware.
  • Quantization-aware training for LLMs becomes viable with training budgets reduced by factors of 4 to 10.
  • Sub-2-bit models become competitive with higher-precision baselines for practical deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same module combination could be tested on larger model families to check whether the efficiency advantage persists at scale.
  • Similar adaptive distillation ideas might transfer to other compression techniques such as pruning or knowledge distillation without quantization.
  • Further bit-width reductions below 1.58 bits could be explored by tightening the entropy-aware loss component.

Load-bearing premise

The three modules can be combined across models to produce the reported accuracy and efficiency gains without hidden instabilities or heavy per-model retuning.

What would settle it

Direct reproduction of the 1.88-bit Qwen3-0.6B evaluation on the same benchmarks, checking whether the claimed margins over published 2-bit and 3-bit baselines are recovered.

Figures

Figures reproduced from arXiv: 2605.04062 by Chen Wu, Le-Tong Huang, Nan Li, Shao-Qun Zhang, Shu-Hao Zhang, Xiang-Sheng Deng, Xin-Yi Zou, Zhi-Hua Zhou.

Figure 1
Figure 1. Figure 1: Overview of the EDGERAZOR framework. A 16-bit teacher guides an n-bit mixed-precision student through a joint objective of task-specific cross-entropy, AFD, and EAKLD. sub-billion [36] to hundreds of billions of parameters [1, 52], a compelling demand has emerged for the lightweight deployment of LLMs on resource-constrained devices, where limited storage, memory, and computational capacity impose stringen… view at source ↗
Figure 1
Figure 1. Figure 1: Performance comparison of EDGERAZOR and strong baselines. arXiv:2605.04062v2 [cs.LG] 21 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average performance of quantized Qwen3 under E [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of EDGERAZOR and state-of-the-art baselines at each bit-width. ARC-e ARC-c HellaS PIQA BoolQ WinoG SIQA OBQA Tr.QA2 Ethics MMLU IFEval GSM8K HumanE (a) Weight-only quantization W1.88 EdgeRazor vs. W2 baselines ARC-e ARC-c HellaS PIQA BoolQ WinoG SIQA OBQA Tr.QA2 Ethics MMLU IFEval GSM8K HumanE (b) Weight-activation quantization W1.88 EdgeRazor vs. W2 baselines BF16 EdgeRazor OmniQuan… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison of 1.88-bit EDGERAZOR and 2-bit baselines on Qwen3-0.6B. below the weight-only one, whereas the two EDGERAZOR curves nearly coincide, and the gap between EDGERAZOR and baselines is larger under weight-activation quantization; (3) In [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of 4-bit EDGERAZOR and strong baselines on Qwen2.5-Omni-7B. encoder surpasses AWQ by 0.44 on Video-MME and 1.42 on MLVU; (8) In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average performance and training budgets of [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Efficiency comparison of EDGERAZOR and other baselines at each bit-width. Storage Memory Prefilling Decoding 0.0 0.5 1.0 1.5 G B ( ↓ ) 0.19 0.51 1.11 1.46 EdgeRazor-TQ2_0 BF16 0 200 400 600 800 T o k e n s/s ( ↑ ) 711.67 317.07 337.99 20.91 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Efficiency comparison of deploying 1.58-bit and 4-bit Qwen3-0.6B via [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Super-group allocation for weight matrices, visualized via the transposed matrix [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Stacked allocation for weight matrices, visualized via the transposed matrix [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Frequency of the k layers with the lowest cl (k = 3). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Layer Index 0 500 1000 Frequency Mathematics Coding [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Frequency of the k layers with the lowest cl (k = 5). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Layer Index 0 500 1000 Frequency Mathematics Coding [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Frequency of the k layers with the lowest cl (k = 10). In Section 3.2, we adopt cosine similarity as an importance metric to quantify transformations across feature layers. In this section, we investigate how important feature patterns vary across data domains, thereby demonstrating the limitations of pre-specified layer supervision. We analyze two distinct domains, mathematics and coding, from our traini… view at source ↗
Figure 14
Figure 14. Figure 14: Average rank of each layer along with its corresponding range. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Scatter plots of top-1 probability versus CAKLD confidence on human-annotated datasets. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
read the original abstract

Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and surpasses the strongest 3-bit baselines by 4.38, while the quantized MobileLLM-350M-EdgeRazor requires a training budget 4-10$\times$ lower than the leading quantization-aware training method. In terms of efficiency, EdgeRazor achieves higher compression ratios at all bit-widths, and the 1.58-bit Qwen3-0.6B-EdgeRazor reduces storage from 1.11 GB to 0.19 GB while accelerating decoding by 15.16$\times$ over the 16-bit baseline. These results empirically validate the effectiveness and efficiency of EdgeRazor. The codes can be accessed from \href{https://github.com/zhangsq-nju/EdgeRazor}{GitHub} and \href{https://huggingface.co/collections/zhangsq-nju/edgerazor-nbit}{Huggingface}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes EdgeRazor, a lightweight framework for quantizing LLMs below 4 bits via mixed-precision quantization-aware distillation. It introduces three modules—Structural Quantization with Mixed Precision for per-layer bit-width control, Layer-Adaptive Feature Distillation for dynamic feature selection, and Entropy-Aware KL Divergence for balanced forward-reverse distillation—and evaluates them on MobileLLM and Qwen model families. The central claims are that the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms SOTA 2-bit baselines by 11.27 and 3-bit baselines by 4.38, that MobileLLM-350M-EdgeRazor requires 4-10× lower training budget than leading QAT methods, and that 1.58-bit variants achieve substantial storage reduction (1.11 GB to 0.19 GB) and 15.16× decoding speedup over FP16.

Significance. If the performance and efficiency results prove robust, the work could meaningfully advance practical deployment of LLMs on edge devices by demonstrating competitive accuracy at sub-2-bit precision with reduced training overhead. The open release of code on GitHub and Hugging Face collections is a clear positive for reproducibility and follow-up research.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline claim that EdgeRazor requires 4-10× lower training budget than leading QAT methods rests on the unstated assumption that determining the mixed-precision bit allocation and layer-adaptive feature selection incurs negligible extra search or hyperparameter cost; no quantitative breakdown of this overhead (e.g., search time, calibration steps, or scaling with model size) is provided, which directly affects whether the efficiency advantage holds.
  2. [§3 and §5] §3 (Method) and §5 (Ablations): the three modules are presented as jointly responsible for the reported gains, yet no ablation isolating the contribution of Structural Quantization with Mixed Precision versus Layer-Adaptive Feature Distillation versus Entropy-Aware KL Divergence is shown; without such controls it is impossible to verify that the combination avoids hidden instabilities or requires model-specific retuning that would undermine the lightweight claim.
minor comments (1)
  1. [Abstract] Abstract: the performance deltas (11.27 and 4.38) are stated without reference to the precise evaluation metric (e.g., perplexity, zero-shot accuracy) or the exact set of baselines and datasets, making the numbers difficult to interpret in isolation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below. We believe these points will help improve the clarity and robustness of our presentation, and we outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim that EdgeRazor requires 4-10× lower training budget than leading QAT methods rests on the unstated assumption that determining the mixed-precision bit allocation and layer-adaptive feature selection incurs negligible extra search or hyperparameter cost; no quantitative breakdown of this overhead (e.g., search time, calibration steps, or scaling with model size) is provided, which directly affects whether the efficiency advantage holds.

    Authors: We appreciate the referee pointing out the need for a more explicit accounting of the overhead in our efficiency claims. The bit allocation and feature selection processes are indeed part of the framework, and while we designed them to be lightweight, we agree that a quantitative breakdown is necessary to fully support the 4-10× training budget reduction. In the revised manuscript, we will add a subsection or appendix detailing the computational cost of these steps, including measured search times on the evaluated models, the number of calibration steps, and observations on scaling. This will demonstrate that the overhead is small and does not undermine the reported efficiency advantages. revision: yes

  2. Referee: [§3 and §5] §3 (Method) and §5 (Ablations): the three modules are presented as jointly responsible for the reported gains, yet no ablation isolating the contribution of Structural Quantization with Mixed Precision versus Layer-Adaptive Feature Distillation versus Entropy-Aware KL Divergence is shown; without such controls it is impossible to verify that the combination avoids hidden instabilities or requires model-specific retuning that would undermine the lightweight claim.

    Authors: We agree that providing ablations that isolate the effect of each module would strengthen the analysis and help readers understand the necessity of each component. Our current §5 includes ablations on various design choices, but we recognize that a more targeted study removing one module at a time is missing. We will revise the ablation section to include new experiments where we evaluate performance with each module individually ablated (e.g., using uniform precision instead of mixed, fixed feature selection, or standard KL divergence). These results will be presented to show the contribution of each and to confirm the stability of the combined approach across the tested models. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims are empirical performance results.

full rationale

The paper introduces EdgeRazor as an empirical framework with three modules (Structural Quantization with Mixed Precision, Layer-Adaptive Feature Distillation, Entropy-Aware KL Divergence) and validates it via direct evaluations on MobileLLM and Qwen models. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. Central claims rest on reported accuracy and efficiency metrics against external baselines, with no load-bearing self-citation chains or ansatz smuggling. The derivation is self-contained against the stated empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework modules are presented as novel but without derivation details.

pith-pipeline@v0.9.0 · 5891 in / 1059 out tokens · 27442 ms · 2026-05-22T10:04:24.738081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 12 internal anchors

  1. [1]

    QuaRot: Outlier-free 4-bit inference in rotated LLMs

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems 37, pages 100213–100240, 2024

  2. [2]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  3. [3]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCand...

  5. [5]

    EfficientQAT: Efficient quantization-aware training for large language models

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 10081–10100, 2025

  6. [6]

    Optimize weight rounding via signed gradient descent for the quantization of LLMs

    Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Lv Kaokao, and Yi Liu. Optimize weight rounding via signed gradient descent for the quantization of LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11332–11350, 2024

  7. [7]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019

  8. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    The case for 4-bit precision: K-bit inference scaling laws

    Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: K-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning, pages 7750–7774, 2023

  11. [11]

    BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation

    Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 102–116, 2024

  12. [12]

    Extreme compression of large language models via additive quantization

    Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. InProceedings of the 41st International Conference on Machine Learning, pages 12284–12303, 2024

  13. [13]

    How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

    Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 55–65, 2019

  14. [14]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantiza- tion for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  15. [15]

    Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InP...

  16. [16]

    APTQ: Attention-aware post- training mixed-precision quantization for large language models

    Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. APTQ: Attention-aware post- training mixed-precision quantization for large language models. InProceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

  17. [17]

    Aligning AI With Shared Human Values

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values.arXiv preprint arXiv:2008.02275, 2020

  18. [18]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  19. [19]

    Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models

    Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models. InProceedings of the 12th International Conference on Learning Representations, pages 12744–12762, 2024

  20. [21]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  21. [22]

    BiLLM: Pushing the limit of post-training quantization for LLMs

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 20023–20042, 2024

  22. [23]

    SliM-LLM: Salience-driven mixed-precision quantization for large language models

    Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692, 2025

  23. [24]

    Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025

    Deokjae Lee and Hyun Oh Song. Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025

  24. [25]

    Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

    Jijie Li, Li Du, Hanyu Zhao, Bowen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

  25. [26]

    GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

    Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 36690–36706, 2025

  26. [27]

    TGIF: A new dataset and benchmark on animated gif description

    Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated gif description. InProceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016

  27. [28]

    ARB-LLM: Alternating refined binarizations for large language models

    Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Zhongchao Shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. ARB-LLM: Alternating refined binarizations for large language models. InProceedings of the 13th International Conference on Learning Representations, pages 93900– 93912, 2025

  28. [29]

    AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of the 6th Conference on Machine Learning and Systems, volume 6, pages 87–100, 2024

  29. [30]

    TruthfulQA: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3214–3252, 2022

  30. [31]

    QServe: W4A8KV4 quantization and system co-design for efficient LLM serving

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. InProceedings of the 7th Conference on Machine Learning and Systems, 2025

  31. [32]

    VPTQ: Extreme low-bit vector post-training quantization for large language models

    Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. VPTQ: Extreme low-bit vector post-training quantization for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8181–8196, 2024. 12

  32. [33]

    Llm-qat: Data-free quantization aware training for large language models,

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

  33. [34]

    ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

  34. [35]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the 13th International Conference on Learning Representations, pages 92009–92032, 2025

  35. [36]

    Can a suit of armor conduct electricity? A new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

  36. [37]

    WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  37. [38]

    Social IQa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 4463–4473, 2019

  38. [39]

    OmniQuant: Omnidirectionally calibrated quantization for large language models

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceedings of the 12th International Conference on Learning Representations, pages 45472–45496, 2024

  39. [40]

    FlatQuant: Flatness matters for LLM quantization

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. FlatQuant: Flatness matters for LLM quantization. In Proceedings of the 42nd International Conference on Machine Learning, pages 57587–57613, 2025

  40. [41]

    MobileQuant: Mobile-friendly quantization for on-device language models

    Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. MobileQuant: Mobile-friendly quantization for on-device language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9761–9771, 2024

  41. [42]

    BERT rediscovers the classical NLP pipeline

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019

  42. [43]

    QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, pages 48630–48656, 2024

  43. [44]

    QTIP: Quantization with trellises and incoherence processing

    Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. QTIP: Quantization with trellises and incoherence processing. InAdvances in Neural Information Processing Systems 37, pages 59597–59620, 2024

  44. [45]

    BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025

    Hongyu Wang, Shuming Ma, Lingxiao Ma, Lei Wang, Wenhui Wang, Li Dong, Shaohan Huang, Huaijie Wang, Jilong Xue, Ruiping Wang, Jihao Bao, Conghui He, and Furu Wei. BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025

  45. [46]

    MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems 33, pages 5776–5788, 2020

  46. [47]

    Rethinking kullback-leibler divergence in knowledge distillation for large language models

    Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025

  47. [48]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, pages 38087–38099, 2023

  48. [49]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 13

  49. [50]

    OneBit: Towards extremely low-bit large language models

    Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems 37, pages 66357–66382, 2024

  50. [51]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  51. [52]

    MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MAmmoTH: Building math generalist models through hybrid instruction tuning.arXiv preprint arXiv:2309.05653, 2023

  52. [53]

    HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

  53. [54]

    ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models

    Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, pages 22299–22307, 2025

  54. [55]

    LQER: Low-rank quantization error reconstruction for LLMs

    Cheng Zhang, Jianyi Cheng, George A Constantinides, and Yiren Zhao. LQER: Low-rank quantization error reconstruction for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 58763–58779, 2024

  55. [56]

    1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

    Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

  56. [57]

    A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025

    Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025

  57. [58]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  58. [59]

    MLVU: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: Benchmarking multi-task long video understanding. InProceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025

  59. [60]

    Zhi-Hua Zhou and Yuan Jiang. Nec4. 5: Neural ensemble based c4. 5.IEEE Transactions on knowledge and data engineering, 16(6):770–773, 2004

  60. [61]

    A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

    Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 14 A Mixed-precision quantization A.1 Quantization function for weights and activations In this section, we provide the per-group symmetric quantization for both ...

  61. [62]

    ∼Unif[0,1]

    Random allocation.The N high-precision rows are distributed uniformly at random, yielding pk i.i.d. ∼Unif[0,1]. Standard empirical process bounds imply that the discrepancy satisfies D∗ N(Prand) =O p(N −1/2).(16)

  62. [63]

    , N−0.5 dout .(17) Since all points lie in a sub-interval of length ρ, taking t=ρ in the definition of D∗ N gives a deviation of1−ρ

    Stacked allocation.All N high-precision rows are clustered contiguously at one end of the output dimension, yielding Pstack = 0.5 dout , 1.5 dout , . . . , N−0.5 dout .(17) Since all points lie in a sub-interval of length ρ, taking t=ρ in the definition of D∗ N gives a deviation of1−ρ. Thus, the discrepancy is constant D∗ N(Pstack) = 1−ρ= Θ(1).(18)

  63. [64]

    "" Given a string, find out how many distinct characters (regardless of case) it consists of >>> count_distinct_characters(’xyzXYZ’) 3 >>> count_distinct_characters(’Jerry’) 4

    Super-group allocation (ours).The 4-bit rows are placed on a deterministic equidistant grid with period⌊1/ρ⌉along the output dimension. Then, the normalized pattern is the midpoint grid Psuper = 2k−1 2N N k=1 .(19) For anyt∈[0,1], the number of points in[0, t]is⌊N t+ 1 2 ⌋, so that 1 N NX k=1 1{pk ≤t} −t ≤ 1 2N .(20) While rounding row indices to discrete...