EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Chen Wu; Le-Tong Huang; Nan Li; Shao-Qun Zhang; Shu-Hao Zhang; Xiang-Sheng Deng; Xin-Yi Zou; Zhi-Hua Zhou

arxiv: 2605.04062 · v2 · pith:U6XO6XTRnew · submitted 2026-04-10 · 💻 cs.LG · cs.AI

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Shu-Hao Zhang , Le-Tong Huang , Xiang-Sheng Deng , Xin-Yi Zou , Chen Wu , Nan Li , Shao-Qun Zhang , Zhi-Hua Zhou This is my paper

Pith reviewed 2026-05-22 10:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords large language modelsquantizationmixed precisiondistillationmodel compressionedge deploymentlow-bit inference

0 comments

The pith

EdgeRazor's mixed-precision distillation lets 1.88-bit LLMs outperform 2-bit and 3-bit baselines while cutting training costs 4-10x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EdgeRazor as a framework that applies mixed-precision structural quantization, layer-adaptive feature distillation, and entropy-aware KL divergence to compress large language models for edge use. It reports that this combination produces a 1.88-bit Qwen3-0.6B model that exceeds state-of-the-art 2-bit results by 11.27 points and strongest 3-bit results by 4.38 points, while a quantized MobileLLM-350M version needs far less training compute than prior quantization-aware methods. A reader would care because the work targets the practical barriers of memory, speed, and retraining expense that currently limit powerful models on phones and other constrained hardware.

Core claim

By combining Structural Quantization with Mixed Precision for bit-width control, Layer-Adaptive Feature Distillation to select informative features, and Entropy-Aware KL Divergence to balance loss on human and distilled data, the EdgeRazor framework enables effective sub-4-bit weight-activation quantization of LLMs. On Qwen and MobileLLM families this yields higher accuracy than existing 2-bit and 3-bit baselines at lower training budgets, higher overall compression ratios, and inference speedups up to 15 times over 16-bit baselines.

What carries the argument

EdgeRazor framework with its three integrated modules—Structural Quantization with Mixed Precision, Layer-Adaptive Feature Distillation, and Entropy-Aware KL Divergence—that together provide fine-grained bit control and balanced alignment during quantization-aware distillation.

If this is right

Models achieve higher compression ratios at all tested bit widths and deliver measurable decoding speedups on edge hardware.
Quantization-aware training for LLMs becomes viable with training budgets reduced by factors of 4 to 10.
Sub-2-bit models become competitive with higher-precision baselines for practical deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same module combination could be tested on larger model families to check whether the efficiency advantage persists at scale.
Similar adaptive distillation ideas might transfer to other compression techniques such as pruning or knowledge distillation without quantization.
Further bit-width reductions below 1.58 bits could be explored by tightening the entropy-aware loss component.

Load-bearing premise

The three modules can be combined across models to produce the reported accuracy and efficiency gains without hidden instabilities or heavy per-model retuning.

What would settle it

Direct reproduction of the 1.88-bit Qwen3-0.6B evaluation on the same benchmarks, checking whether the claimed margins over published 2-bit and 3-bit baselines are recovered.

Figures

Figures reproduced from arXiv: 2605.04062 by Chen Wu, Le-Tong Huang, Nan Li, Shao-Qun Zhang, Shu-Hao Zhang, Xiang-Sheng Deng, Xin-Yi Zou, Zhi-Hua Zhou.

**Figure 1.** Figure 1: Overview of the EDGERAZOR framework. A 16-bit teacher guides an n-bit mixed-precision student through a joint objective of task-specific cross-entropy, AFD, and EAKLD. sub-billion [36] to hundreds of billions of parameters [1, 52], a compelling demand has emerged for the lightweight deployment of LLMs on resource-constrained devices, where limited storage, memory, and computational capacity impose stringen… view at source ↗

**Figure 1.** Figure 1: Performance comparison of EDGERAZOR and strong baselines. arXiv:2605.04062v2 [cs.LG] 21 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Average performance of quantized Qwen3 under E [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison of EDGERAZOR and state-of-the-art baselines at each bit-width. ARC-e ARC-c HellaS PIQA BoolQ WinoG SIQA OBQA Tr.QA2 Ethics MMLU IFEval GSM8K HumanE (a) Weight-only quantization W1.88 EdgeRazor vs. W2 baselines ARC-e ARC-c HellaS PIQA BoolQ WinoG SIQA OBQA Tr.QA2 Ethics MMLU IFEval GSM8K HumanE (b) Weight-activation quantization W1.88 EdgeRazor vs. W2 baselines BF16 EdgeRazor OmniQuan… view at source ↗

**Figure 4.** Figure 4: Performance comparison of 1.88-bit EDGERAZOR and 2-bit baselines on Qwen3-0.6B. below the weight-only one, whereas the two EDGERAZOR curves nearly coincide, and the gap between EDGERAZOR and baselines is larger under weight-activation quantization; (3) In [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison of 4-bit EDGERAZOR and strong baselines on Qwen2.5-Omni-7B. encoder surpasses AWQ by 0.44 on Video-MME and 1.42 on MLVU; (8) In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Average performance and training budgets of [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Efficiency comparison of EDGERAZOR and other baselines at each bit-width. Storage Memory Prefilling Decoding 0.0 0.5 1.0 1.5 G B ( ↓ ) 0.19 0.51 1.11 1.46 EdgeRazor-TQ2_0 BF16 0 200 400 600 800 T o k e n s/s ( ↑ ) 711.67 317.07 337.99 20.91 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Efficiency comparison of deploying 1.58-bit and 4-bit Qwen3-0.6B via [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Super-group allocation for weight matrices, visualized via the transposed matrix [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Stacked allocation for weight matrices, visualized via the transposed matrix [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Frequency of the k layers with the lowest cl (k = 3). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Layer Index 0 500 1000 Frequency Mathematics Coding [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Frequency of the k layers with the lowest cl (k = 5). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Layer Index 0 500 1000 Frequency Mathematics Coding [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Frequency of the k layers with the lowest cl (k = 10). In Section 3.2, we adopt cosine similarity as an importance metric to quantify transformations across feature layers. In this section, we investigate how important feature patterns vary across data domains, thereby demonstrating the limitations of pre-specified layer supervision. We analyze two distinct domains, mathematics and coding, from our traini… view at source ↗

**Figure 14.** Figure 14: Average rank of each layer along with its corresponding range. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Scatter plots of top-1 probability versus CAKLD confidence on human-annotated datasets. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

read the original abstract

Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and surpasses the strongest 3-bit baselines by 4.38, while the quantized MobileLLM-350M-EdgeRazor requires a training budget 4-10$\times$ lower than the leading quantization-aware training method. In terms of efficiency, EdgeRazor achieves higher compression ratios at all bit-widths, and the 1.58-bit Qwen3-0.6B-EdgeRazor reduces storage from 1.11 GB to 0.19 GB while accelerating decoding by 15.16$\times$ over the 16-bit baseline. These results empirically validate the effectiveness and efficiency of EdgeRazor. The codes can be accessed from \href{https://github.com/zhangsq-nju/EdgeRazor}{GitHub} and \href{https://huggingface.co/collections/zhangsq-nju/edgerazor-nbit}{Huggingface}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes EdgeRazor, a lightweight framework for quantizing LLMs below 4 bits via mixed-precision quantization-aware distillation. It introduces three modules—Structural Quantization with Mixed Precision for per-layer bit-width control, Layer-Adaptive Feature Distillation for dynamic feature selection, and Entropy-Aware KL Divergence for balanced forward-reverse distillation—and evaluates them on MobileLLM and Qwen model families. The central claims are that the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms SOTA 2-bit baselines by 11.27 and 3-bit baselines by 4.38, that MobileLLM-350M-EdgeRazor requires 4-10× lower training budget than leading QAT methods, and that 1.58-bit variants achieve substantial storage reduction (1.11 GB to 0.19 GB) and 15.16× decoding speedup over FP16.

Significance. If the performance and efficiency results prove robust, the work could meaningfully advance practical deployment of LLMs on edge devices by demonstrating competitive accuracy at sub-2-bit precision with reduced training overhead. The open release of code on GitHub and Hugging Face collections is a clear positive for reproducibility and follow-up research.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the headline claim that EdgeRazor requires 4-10× lower training budget than leading QAT methods rests on the unstated assumption that determining the mixed-precision bit allocation and layer-adaptive feature selection incurs negligible extra search or hyperparameter cost; no quantitative breakdown of this overhead (e.g., search time, calibration steps, or scaling with model size) is provided, which directly affects whether the efficiency advantage holds.
[§3 and §5] §3 (Method) and §5 (Ablations): the three modules are presented as jointly responsible for the reported gains, yet no ablation isolating the contribution of Structural Quantization with Mixed Precision versus Layer-Adaptive Feature Distillation versus Entropy-Aware KL Divergence is shown; without such controls it is impossible to verify that the combination avoids hidden instabilities or requires model-specific retuning that would undermine the lightweight claim.

minor comments (1)

[Abstract] Abstract: the performance deltas (11.27 and 4.38) are stated without reference to the precise evaluation metric (e.g., perplexity, zero-shot accuracy) or the exact set of baselines and datasets, making the numbers difficult to interpret in isolation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below. We believe these points will help improve the clarity and robustness of our presentation, and we outline the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim that EdgeRazor requires 4-10× lower training budget than leading QAT methods rests on the unstated assumption that determining the mixed-precision bit allocation and layer-adaptive feature selection incurs negligible extra search or hyperparameter cost; no quantitative breakdown of this overhead (e.g., search time, calibration steps, or scaling with model size) is provided, which directly affects whether the efficiency advantage holds.

Authors: We appreciate the referee pointing out the need for a more explicit accounting of the overhead in our efficiency claims. The bit allocation and feature selection processes are indeed part of the framework, and while we designed them to be lightweight, we agree that a quantitative breakdown is necessary to fully support the 4-10× training budget reduction. In the revised manuscript, we will add a subsection or appendix detailing the computational cost of these steps, including measured search times on the evaluated models, the number of calibration steps, and observations on scaling. This will demonstrate that the overhead is small and does not undermine the reported efficiency advantages. revision: yes
Referee: [§3 and §5] §3 (Method) and §5 (Ablations): the three modules are presented as jointly responsible for the reported gains, yet no ablation isolating the contribution of Structural Quantization with Mixed Precision versus Layer-Adaptive Feature Distillation versus Entropy-Aware KL Divergence is shown; without such controls it is impossible to verify that the combination avoids hidden instabilities or requires model-specific retuning that would undermine the lightweight claim.

Authors: We agree that providing ablations that isolate the effect of each module would strengthen the analysis and help readers understand the necessity of each component. Our current §5 includes ablations on various design choices, but we recognize that a more targeted study removing one module at a time is missing. We will revise the ablation section to include new experiments where we evaluate performance with each module individually ablated (e.g., using uniform precision instead of mixed, fixed feature selection, or standard KL divergence). These results will be presented to show the contribution of each and to confirm the stability of the combined approach across the tested models. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims are empirical performance results.

full rationale

The paper introduces EdgeRazor as an empirical framework with three modules (Structural Quantization with Mixed Precision, Layer-Adaptive Feature Distillation, Entropy-Aware KL Divergence) and validates it via direct evaluations on MobileLLM and Qwen models. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. Central claims rest on reported accuracy and efficiency metrics against external baselines, with no load-bearing self-citation chains or ansatz smuggling. The derivation is self-contained against the stated empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework modules are presented as novel but without derivation details.

pith-pipeline@v0.9.0 · 5891 in / 1059 out tokens · 27442 ms · 2026-05-22T10:04:24.738081+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Structural Quantization with Mixed Precision (SQMP) ... every ⌊1/ρ⌉ consecutive output channels form one super-group, wherein one channel is quantized to 4-bit and the remainder to 1.58-bit.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Layer-Adaptive Feature Distillation (LAFD) ... cl = mean cosine similarity between adjacent teacher layers
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Entropy-Aware KL Divergence (EAKLD) ... λ derived from teacher output entropy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 12 internal anchors

[1]

QuaRot: Outlier-free 4-bit inference in rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems 37, pages 100213–100240, 2024

work page 2024
[2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[3]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020

work page 2020
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCand...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

EfficientQAT: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 10081–10100, 2025

work page 2025
[6]

Optimize weight rounding via signed gradient descent for the quantization of LLMs

Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Lv Kaokao, and Yi Liu. Optimize weight rounding via signed gradient descent for the quantization of LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11332–11350, 2024

work page 2024
[7]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019

work page 2019
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

The case for 4-bit precision: K-bit inference scaling laws

Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: K-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning, pages 7750–7774, 2023

work page 2023
[11]

BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation

Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 102–116, 2024

work page 2024
[12]

Extreme compression of large language models via additive quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. InProceedings of the 41st International Conference on Machine Learning, pages 12284–12303, 2024

work page 2024
[13]

How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 55–65, 2019

work page 2019
[14]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantiza- tion for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InP...

work page 2025
[16]

APTQ: Attention-aware post- training mixed-precision quantization for large language models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. APTQ: Attention-aware post- training mixed-precision quantization for large language models. InProceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

work page 2024
[17]

Aligning AI With Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values.arXiv preprint arXiv:2008.02275, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2008
[18]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[19]

Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models

Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models. InProceedings of the 12th International Conference on Learning Representations, pages 12744–12762, 2024

work page 2024
[21]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

BiLLM: Pushing the limit of post-training quantization for LLMs

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 20023–20042, 2024

work page 2024
[23]

SliM-LLM: Salience-driven mixed-precision quantization for large language models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692, 2025

work page 2025
[24]

Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025

Deokjae Lee and Hyun Oh Song. Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025

work page arXiv 2025
[25]

Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

Jijie Li, Li Du, Hanyu Zhao, Bowen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

work page arXiv 2025
[26]

GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 36690–36706, 2025

work page 2025
[27]

TGIF: A new dataset and benchmark on animated gif description

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated gif description. InProceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016

work page 2016
[28]

ARB-LLM: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Zhongchao Shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. ARB-LLM: Alternating refined binarizations for large language models. InProceedings of the 13th International Conference on Learning Representations, pages 93900– 93912, 2025

work page 2025
[29]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of the 6th Conference on Machine Learning and Systems, volume 6, pages 87–100, 2024

work page 2024
[30]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3214–3252, 2022

work page 2022
[31]

QServe: W4A8KV4 quantization and system co-design for efficient LLM serving

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. InProceedings of the 7th Conference on Machine Learning and Systems, 2025

work page 2025
[32]

VPTQ: Extreme low-bit vector post-training quantization for large language models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. VPTQ: Extreme low-bit vector post-training quantization for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8181–8196, 2024. 12

work page 2024
[33]

Llm-qat: Data-free quantization aware training for large language models,

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

work page arXiv 2023
[34]

ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

work page arXiv 2025
[35]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the 13th International Conference on Learning Representations, pages 92009–92032, 2025

work page 2025
[36]

Can a suit of armor conduct electricity? A new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

work page 2018
[37]

WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[38]

Social IQa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 4463–4473, 2019

work page 2019
[39]

OmniQuant: Omnidirectionally calibrated quantization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceedings of the 12th International Conference on Learning Representations, pages 45472–45496, 2024

work page 2024
[40]

FlatQuant: Flatness matters for LLM quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. FlatQuant: Flatness matters for LLM quantization. In Proceedings of the 42nd International Conference on Machine Learning, pages 57587–57613, 2025

work page 2025
[41]

MobileQuant: Mobile-friendly quantization for on-device language models

Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. MobileQuant: Mobile-friendly quantization for on-device language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9761–9771, 2024

work page 2024
[42]

BERT rediscovers the classical NLP pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019

work page 2019
[43]

QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, pages 48630–48656, 2024

work page 2024
[44]

QTIP: Quantization with trellises and incoherence processing

Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. QTIP: Quantization with trellises and incoherence processing. InAdvances in Neural Information Processing Systems 37, pages 59597–59620, 2024

work page 2024
[45]

BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025

Hongyu Wang, Shuming Ma, Lingxiao Ma, Lei Wang, Wenhui Wang, Li Dong, Shaohan Huang, Huaijie Wang, Jilong Xue, Ruiping Wang, Jihao Bao, Conghui He, and Furu Wei. BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025

work page 2025
[46]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems 33, pages 5776–5788, 2020

work page 2020
[47]

Rethinking kullback-leibler divergence in knowledge distillation for large language models

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025

work page 2025
[48]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, pages 38087–38099, 2023

work page 2023
[49]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

OneBit: Towards extremely low-bit large language models

Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems 37, pages 66357–66382, 2024

work page 2024
[51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MAmmoTH: Building math generalist models through hybrid instruction tuning.arXiv preprint arXiv:2309.05653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

work page 2019
[54]

ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models

Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, pages 22299–22307, 2025

work page 2025
[55]

LQER: Low-rank quantization error reconstruction for LLMs

Cheng Zhang, Jianyi Cheng, George A Constantinides, and Yiren Zhao. LQER: Low-rank quantization error reconstruction for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 58763–58779, 2024

work page 2024
[56]

1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

work page arXiv 2025
[57]

A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025

Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025

work page 2025
[58]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

MLVU: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: Benchmarking multi-task long video understanding. InProceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025

work page 2025
[60]

Zhi-Hua Zhou and Yuan Jiang. Nec4. 5: Neural ensemble based c4. 5.IEEE Transactions on knowledge and data engineering, 16(6):770–773, 2004

work page 2004
[61]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 14 A Mixed-precision quantization A.1 Quantization function for weights and activations In this section, we provide the per-group symmetric quantization for both ...

work page 2024
[62]

∼Unif[0,1]

Random allocation.The N high-precision rows are distributed uniformly at random, yielding pk i.i.d. ∼Unif[0,1]. Standard empirical process bounds imply that the discrepancy satisfies D∗ N(Prand) =O p(N −1/2).(16)

work page
[63]

, N−0.5 dout .(17) Since all points lie in a sub-interval of length ρ, taking t=ρ in the definition of D∗ N gives a deviation of1−ρ

Stacked allocation.All N high-precision rows are clustered contiguously at one end of the output dimension, yielding Pstack = 0.5 dout , 1.5 dout , . . . , N−0.5 dout .(17) Since all points lie in a sub-interval of length ρ, taking t=ρ in the definition of D∗ N gives a deviation of1−ρ. Thus, the discrepancy is constant D∗ N(Pstack) = 1−ρ= Θ(1).(18)

work page
[64]

"" Given a string, find out how many distinct characters (regardless of case) it consists of >>> count_distinct_characters(’xyzXYZ’) 3 >>> count_distinct_characters(’Jerry’) 4

Super-group allocation (ours).The 4-bit rows are placed on a deterministic equidistant grid with period⌊1/ρ⌉along the output dimension. Then, the normalized pattern is the midpoint grid Psuper = 2k−1 2N N k=1 .(19) For anyt∈[0,1], the number of points in[0, t]is⌊N t+ 1 2 ⌋, so that 1 N NX k=1 1{pk ≤t} −t ≤ 1 2N .(20) While rounding row indices to discrete...

work page arXiv 1936

[1] [1]

QuaRot: Outlier-free 4-bit inference in rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems 37, pages 100213–100240, 2024

work page 2024

[2] [2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[3] [3]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020

work page 2020

[4] [4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCand...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

EfficientQAT: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 10081–10100, 2025

work page 2025

[6] [6]

Optimize weight rounding via signed gradient descent for the quantization of LLMs

Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Lv Kaokao, and Yi Liu. Optimize weight rounding via signed gradient descent for the quantization of LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11332–11350, 2024

work page 2024

[7] [7]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019

work page 2019

[8] [8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

The case for 4-bit precision: K-bit inference scaling laws

Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: K-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning, pages 7750–7774, 2023

work page 2023

[11] [11]

BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation

Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 102–116, 2024

work page 2024

[12] [12]

Extreme compression of large language models via additive quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. InProceedings of the 41st International Conference on Machine Learning, pages 12284–12303, 2024

work page 2024

[13] [13]

How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 55–65, 2019

work page 2019

[14] [14]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantiza- tion for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InP...

work page 2025

[16] [16]

APTQ: Attention-aware post- training mixed-precision quantization for large language models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. APTQ: Attention-aware post- training mixed-precision quantization for large language models. InProceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

work page 2024

[17] [17]

Aligning AI With Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values.arXiv preprint arXiv:2008.02275, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2008

[18] [18]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[19] [19]

Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models

Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models. InProceedings of the 12th International Conference on Learning Representations, pages 12744–12762, 2024

work page 2024

[20] [21]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[21] [22]

BiLLM: Pushing the limit of post-training quantization for LLMs

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 20023–20042, 2024

work page 2024

[22] [23]

SliM-LLM: Salience-driven mixed-precision quantization for large language models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692, 2025

work page 2025

[23] [24]

Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025

Deokjae Lee and Hyun Oh Song. Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025

work page arXiv 2025

[24] [25]

Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

Jijie Li, Li Du, Hanyu Zhao, Bowen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

work page arXiv 2025

[25] [26]

GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 36690–36706, 2025

work page 2025

[26] [27]

TGIF: A new dataset and benchmark on animated gif description

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated gif description. InProceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016

work page 2016

[27] [28]

ARB-LLM: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Zhongchao Shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. ARB-LLM: Alternating refined binarizations for large language models. InProceedings of the 13th International Conference on Learning Representations, pages 93900– 93912, 2025

work page 2025

[28] [29]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of the 6th Conference on Machine Learning and Systems, volume 6, pages 87–100, 2024

work page 2024

[29] [30]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3214–3252, 2022

work page 2022

[30] [31]

QServe: W4A8KV4 quantization and system co-design for efficient LLM serving

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. InProceedings of the 7th Conference on Machine Learning and Systems, 2025

work page 2025

[31] [32]

VPTQ: Extreme low-bit vector post-training quantization for large language models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. VPTQ: Extreme low-bit vector post-training quantization for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8181–8196, 2024. 12

work page 2024

[32] [33]

Llm-qat: Data-free quantization aware training for large language models,

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

work page arXiv 2023

[33] [34]

ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

work page arXiv 2025

[34] [35]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the 13th International Conference on Learning Representations, pages 92009–92032, 2025

work page 2025

[35] [36]

Can a suit of armor conduct electricity? A new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

work page 2018

[36] [37]

WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[37] [38]

Social IQa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 4463–4473, 2019

work page 2019

[38] [39]

OmniQuant: Omnidirectionally calibrated quantization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceedings of the 12th International Conference on Learning Representations, pages 45472–45496, 2024

work page 2024

[39] [40]

FlatQuant: Flatness matters for LLM quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. FlatQuant: Flatness matters for LLM quantization. In Proceedings of the 42nd International Conference on Machine Learning, pages 57587–57613, 2025

work page 2025

[40] [41]

MobileQuant: Mobile-friendly quantization for on-device language models

Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. MobileQuant: Mobile-friendly quantization for on-device language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9761–9771, 2024

work page 2024

[41] [42]

BERT rediscovers the classical NLP pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019

work page 2019

[42] [43]

QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, pages 48630–48656, 2024

work page 2024

[43] [44]

QTIP: Quantization with trellises and incoherence processing

Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. QTIP: Quantization with trellises and incoherence processing. InAdvances in Neural Information Processing Systems 37, pages 59597–59620, 2024

work page 2024

[44] [45]

BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025

Hongyu Wang, Shuming Ma, Lingxiao Ma, Lei Wang, Wenhui Wang, Li Dong, Shaohan Huang, Huaijie Wang, Jilong Xue, Ruiping Wang, Jihao Bao, Conghui He, and Furu Wei. BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025

work page 2025

[45] [46]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems 33, pages 5776–5788, 2020

work page 2020

[46] [47]

Rethinking kullback-leibler divergence in knowledge distillation for large language models

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025

work page 2025

[47] [48]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, pages 38087–38099, 2023

work page 2023

[48] [49]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [50]

OneBit: Towards extremely low-bit large language models

Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems 37, pages 66357–66382, 2024

work page 2024

[50] [51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [52]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MAmmoTH: Building math generalist models through hybrid instruction tuning.arXiv preprint arXiv:2309.05653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [53]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

work page 2019

[53] [54]

ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models

Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, pages 22299–22307, 2025

work page 2025

[54] [55]

LQER: Low-rank quantization error reconstruction for LLMs

Cheng Zhang, Jianyi Cheng, George A Constantinides, and Yiren Zhao. LQER: Low-rank quantization error reconstruction for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 58763–58779, 2024

work page 2024

[55] [56]

1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

work page arXiv 2025

[56] [57]

A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025

Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025

work page 2025

[57] [58]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [59]

MLVU: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: Benchmarking multi-task long video understanding. InProceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025

work page 2025

[59] [60]

Zhi-Hua Zhou and Yuan Jiang. Nec4. 5: Neural ensemble based c4. 5.IEEE Transactions on knowledge and data engineering, 16(6):770–773, 2004

work page 2004

[60] [61]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 14 A Mixed-precision quantization A.1 Quantization function for weights and activations In this section, we provide the per-group symmetric quantization for both ...

work page 2024

[61] [62]

∼Unif[0,1]

Random allocation.The N high-precision rows are distributed uniformly at random, yielding pk i.i.d. ∼Unif[0,1]. Standard empirical process bounds imply that the discrepancy satisfies D∗ N(Prand) =O p(N −1/2).(16)

work page

[62] [63]

, N−0.5 dout .(17) Since all points lie in a sub-interval of length ρ, taking t=ρ in the definition of D∗ N gives a deviation of1−ρ

Stacked allocation.All N high-precision rows are clustered contiguously at one end of the output dimension, yielding Pstack = 0.5 dout , 1.5 dout , . . . , N−0.5 dout .(17) Since all points lie in a sub-interval of length ρ, taking t=ρ in the definition of D∗ N gives a deviation of1−ρ. Thus, the discrepancy is constant D∗ N(Pstack) = 1−ρ= Θ(1).(18)

work page

[63] [64]

"" Given a string, find out how many distinct characters (regardless of case) it consists of >>> count_distinct_characters(’xyzXYZ’) 3 >>> count_distinct_characters(’Jerry’) 4

Super-group allocation (ours).The 4-bit rows are placed on a deterministic equidistant grid with period⌊1/ρ⌉along the output dimension. Then, the normalized pattern is the midpoint grid Psuper = 2k−1 2N N k=1 .(19) For anyt∈[0,1], the number of points in[0, t]is⌊N t+ 1 2 ⌋, so that 1 N NX k=1 1{pk ≤t} −t ≤ 1 2N .(20) While rounding row indices to discrete...

work page arXiv 1936