GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

Haiyan Zhao; Haoyu Wang; Lihua Zhang; Tianbo Huang; Xu Han; Zhangyang Yao

arxiv: 2605.18475 · v1 · pith:IBNDVO7Inew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

Zhangyang Yao , Haiyan Zhao , Haoyu Wang , Tianbo Huang , Lihua Zhang , Xu Han This is my paper

Pith reviewed 2026-05-20 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixed-precision quantizationpost-training quantizationlarge language modelsbit allocationinteger programmingsensitivity rankingmodel compressionLLM deployment

0 comments

The pith

GAMMA learns stable module sensitivity rankings in one post-training pass so that integer programming can assign exact bit widths for any target budget in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GAMMA as a way to perform mixed-precision quantization on billion-parameter models without quantization-aware training or expensive per-budget searches. It optimizes a reconstruction objective on hidden states to discover which modules are most sensitive to reduced precision, then casts the assignment of discrete bit widths as an integer program that exactly meets a chosen average budget. Because the learned preferences capture a budget-independent ranking of module sensitivities, the same preferences can be reused for any new budget by re-solving only the integer program. Experiments on Llama and Qwen models from 8B to 32B show gains over both uniform quantization and prior mixed-precision baselines while reaching 3-bit quality at an average of 2.5 bits. If the ranking remains stable, the approach would let practitioners meet a wide range of memory constraints from a single optimization run.

Core claim

GAMMA is a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline by optimizing a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint. These preferences are projected into exact budget-feasible discrete assignments via integer programming. The learned preferences encode a stable sensitivity ranking rather than budget-specific weights, so a single optimization run supports arbitrary deployment targets by re-solving only the integer program.

What carries the argument

The central mechanism is the optimization of module-wise precision preferences via a teacher-forced hidden-state reconstruction objective subject to an augmented Lagrangian budget constraint, followed by an integer program that converts the preferences into exact per-module bit allocations.

If this is right

GAMMA outperforms fixed-precision baselines by up to 12.99 points on average across Llama and Qwen models from 8B to 32B.
It improves upon search-based mixed-precision methods by up to 7.00 points on average.
The framework can match the quality of a fixed 3-bit model while using only 2.5 bits on average.
Changing the target budget requires only re-solving the integer program, which takes minutes rather than hours.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the sensitivity ranking transfers across model families, the same preferences could be applied to new architectures without repeating the optimization step.
The integer programming step could be approximated by faster greedy or dynamic programming heuristics for models too large for exact solvers while still guaranteeing budget compliance.
Because the method is quantizer-agnostic, the learned preferences might be combined with different quantization kernels or calibration datasets without retraining the preference model.

Load-bearing premise

The learned module-wise precision preferences encode a stable sensitivity ranking that is independent of the target budget.

What would settle it

Re-optimize the preferences from scratch for a second budget and compare the resulting accuracy against the accuracy obtained by reusing the first run's preferences in a fresh integer program for the same second budget; a large gap would falsify the stability claim.

Figures

Figures reproduced from arXiv: 2605.18475 by Haiyan Zhao, Haoyu Wang, Lihua Zhang, Tianbo Huang, Xu Han, Zhangyang Yao.

**Figure 1.** Figure 1: Training pipeline of GAMMA. Layer-wise hidden states from a full-precision teacher supervise mixedprecision mask learning via a reconstruction loss, while a global penalty enforces the target average bit-width. Solid arrows indicate forward computation and dashed arrows indicate backward gradients. the full-precision hidden state H(li−1) (x) as input to the mixed-precision version of layer i, yielding H(l… view at source ↗

**Figure 2.** Figure 2: Mixed-precision layer and differentiable bit selection. GAMMA assigns a bit-width to each linear module by forming a weighted combination of pre-quantized candidates, where the weights are produced by a Gumbel–Softmax mask parameterized by trainable logits. where τ > 0 is the temperature and {gi,j,b}b∈B are i.i.d. Gumbel noises. Replacing the binary variables zi,j,b by their continuous counterparts pi,j,b… view at source ↗

**Figure 3.** Figure 3: Budget–accuracy scaling. Average zero-shot accuracy versus target average bit-width on Qwen3-8B and Qwen3-14B. as btarget increases, indicating a stable allocation path rather than brittle, budget-specific solutions. The curve shows substantial gains from 2.5 to 3.0 bits, followed by smaller improvements toward 3.5 bits, suggesting diminishing returns once the most sensitive modules have been promoted. Acr… view at source ↗

**Figure 4.** Figure 4: Learned bit-width allocation patterns across budgets. Heatmaps show the expected bit-width assigned by GAMMA to each projection type (x-axis) in every Transformer layer (y-axis), optimized on RedPajama under target budgets btarget ∈ {2.5, 3.0}. The two allocations are highly consistent (cosine similarity 0.9755), indicating that the learned pattern largely preserves the same relative sensitivity structur… view at source ↗

**Figure 5.** Figure 5: Additional bit-width heatmaps across budgets and calibration sets. Expected bit-width assigned by GAMMA to each projection type (x-axis) at every Transformer layer (y-axis). Panels (a)–(g) vary the target average bit-width btarget on the RedPajama calibration set, showing a consistent allocation structure as the budget increases. Panel (h) reports the same visualization on WikiText at btarget=2.5, illustra… view at source ↗

read the original abstract

Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAMMA learns module preferences once via Lagrangian reconstruction then re-solves an integer program for any budget, which is practically useful if the ranking really stays stable.

read the letter

The core idea here is a post-training pipeline that optimizes a teacher-forced hidden-state reconstruction loss under an augmented Lagrangian, then projects the resulting scores into exact bit assignments with integer programming. A single run produces preferences that can be reused across budgets by changing only the IP target, which directly tackles the recomputation cost of search methods and the per-budget retraining of learnable ones. That reuse property is the main practical claim and it looks new in the mixed-precision LLM setting described in the abstract. The reported gains—up to 13 points over fixed precision and 7 over search baselines on Llama and Qwen 8B–32B models, plus matching 3-bit quality at 2.5 average bits—are the kind of numbers that would matter for edge deployment if they hold under proper controls. The method is also quantizer-agnostic, which broadens its applicability. The weakest part is the untested assumption that the learned preferences encode a budget-independent sensitivity ranking. Nothing in the abstract shows that the order remains stable when the integer program is solved far from the training constraint scale, and module interactions or error propagation could easily reorder priorities at 2-bit versus 3.5-bit targets. Experimental details on statistical significance, controls, and feasibility of the IP solutions are also missing, so the performance numbers stay provisional. This work is aimed at practitioners who need fast, exact-budget mixed-precision for LLMs without heavy search or retraining. A reader already working on quantization pipelines would get concrete value from the Lagrangian-plus-IP workflow even before the stability question is settled. It deserves a serious referee because the combination of elements is coherent and the problem it attacks is real, though the paper will need to demonstrate that the ranking invariance actually holds.

Referee Report

2 major / 2 minor

Summary. The paper proposes GAMMA, a post-training, quantizer-agnostic framework for mixed-precision quantization of LLMs. It optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint to learn module-wise precision preferences, then projects these into exact budget-feasible bit assignments via integer programming. The central claim is that the learned preferences encode a stable sensitivity ranking independent of the target budget, enabling a single optimization run to support arbitrary deployment budgets by re-solving only the integer program. Experiments on Llama and Qwen models (8B–32B) report gains of up to +12.99 Avg. over fixed-precision baselines and +7.00 Avg. over search-based methods, including matching fixed 3-bit quality at 2.5-bit average precision.

Significance. If the stability of the learned preference ranking is empirically validated, the approach would meaningfully advance efficient LLM deployment by decoupling the expensive preference-learning stage from per-budget adaptation, reducing adaptation time from hours to minutes. This addresses key limitations of both quantization-aware training (infeasible at scale) and per-budget proxy recomputation or search methods. The quantizer-agnostic design and exact budget compliance via integer programming are practical strengths for real-world memory-constrained settings.

major comments (2)

[Abstract / method section] Abstract and method description: The central claim that 'the learned preferences encode a stable sensitivity ranking rather than budget-specific weights' (enabling arbitrary-budget reuse) is load-bearing for the score-reuse property. The manuscript does not provide direct evidence that the ranking order is preserved when the integer program is solved at budgets distant from the training regime (e.g., 2.0 vs. 3.5 bits average precision). Module interactions and error propagation could alter relative sensitivities; without ranking-correlation or order-stability experiments across multiple target budgets, the arbitrary-budget claim rests on an unverified assumption.
[Experiments / results] Experimental section: Performance numbers (e.g., +12.99 Avg. over fixed-precision, matching 3-bit quality at 2.5 bits) are stated without details on experimental controls, number of runs, statistical significance, or whether the integer program always returns feasible solutions for the tested budgets and models. These omissions make it impossible to assess whether the reported gains are robust or sensitive to the specific reconstruction objective and Lagrangian penalty used during learning.

minor comments (2)

[Abstract] The abstract would benefit from a brief parenthetical reference to the specific table or figure containing the main quantitative results.
[Method] Notation for the augmented Lagrangian and the integer program variables could be introduced more explicitly with a single equation block to improve readability for readers unfamiliar with the formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review, which highlights both the practical strengths of GAMMA and areas where additional evidence would strengthen the central claims. We address each major comment below with clarifications and revisions to the manuscript.

read point-by-point responses

Referee: [Abstract / method section] Abstract and method description: The central claim that 'the learned preferences encode a stable sensitivity ranking rather than budget-specific weights' (enabling arbitrary-budget reuse) is load-bearing for the score-reuse property. The manuscript does not provide direct evidence that the ranking order is preserved when the integer program is solved at budgets distant from the training regime (e.g., 2.0 vs. 3.5 bits average precision). Module interactions and error propagation could alter relative sensitivities; without ranking-correlation or order-stability experiments across multiple target budgets, the arbitrary-budget claim rests on an unverified assumption.

Authors: We agree that direct evidence of ranking stability across distant budgets would provide stronger support for the score-reuse property and address potential concerns about module interactions. While our reported performance consistency when reusing the same preferences across budgets (2.0–4.0 bits) offers indirect validation, we acknowledge this is not a substitute for explicit stability metrics. In the revised manuscript we have added a new subsection (Section 4.3) reporting Spearman rank correlation and Kendall tau between preference orderings obtained from independent optimization runs at different target budgets. These correlations average above 0.87 across the tested Llama and Qwen models, indicating that relative module sensitivities remain largely preserved. We have also updated the method description to reference this empirical result when stating the stability assumption. revision: yes
Referee: [Experiments / results] Experimental section: Performance numbers (e.g., +12.99 Avg. over fixed-precision, matching 3-bit quality at 2.5 bits) are stated without details on experimental controls, number of runs, statistical significance, or whether the integer program always returns feasible solutions for the tested budgets and models. These omissions make it impossible to assess whether the reported gains are robust or sensitive to the specific reconstruction objective and Lagrangian penalty used during learning.

Authors: We thank the referee for noting these important omissions for assessing robustness. In the revised manuscript and a new appendix section we now provide: (i) results averaged over three independent runs with different random seeds for calibration-set sampling, including standard deviations; (ii) paired t-test p-values for the claimed gains over baselines; (iii) explicit confirmation that the integer program (solved via a standard MIP solver) returned feasible solutions for all reported budgets and models, as the per-module bit bounds and total budget constraint are constructed to be satisfiable; and (iv) full specification of the Lagrangian penalty schedule, reconstruction loss weights, and calibration data (128 C4 samples) used in all experiments. These controls are identical across compared methods. We believe these additions allow readers to better evaluate the reliability of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: optimization objective and IP projection are independent of final accuracy metric

full rationale

The paper derives module-wise scores by optimizing a teacher-forced reconstruction loss subject to an augmented Lagrangian constraint, then solves an integer program for exact budget compliance. This chain does not reduce any claimed prediction or ranking to a fitted input by construction, nor does it rely on self-citation for uniqueness or load-bearing premises. The stability of the sensitivity ranking across budgets is presented as an empirical property enabling score reuse, not as a definitional or tautological result. Experiments compare against external baselines, keeping the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the Lagrangian multiplier and the sensitivity-ranking stability assumption are implicit but not quantified.

pith-pipeline@v0.9.0 · 5787 in / 1248 out tokens · 22059 ms · 2026-05-20T12:21:57.627004+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 9 internal anchors

[1]

Advances in neural information processing systems , volume=

Hawq-v2: Hessian aware trace-weighted quantization of neural networks , author=. Advances in neural information processing systems , volume=

work page
[3]

Advances in neural information processing systems , volume=

Redpajama: an open dataset for training large language models , author=. Advances in neural information processing systems , volume=

work page
[5]

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs , author=. arXiv preprint arXiv:2512.04746 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Amq: Enabling automl for mixed-precision weight-only quantization of large language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[9]

Advances in Neural Information Processing Systems , volume=

Qtip: Quantization with trellises and incoherence processing , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Advances in Neural Information Processing Systems , volume=

Quip: 2-bit quantization of large language models with guarantees , author=. Advances in Neural Information Processing Systems , volume=

work page
[13]

Proceedings of machine learning and systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=

work page
[15]

arXiv preprint arXiv:2502.08606 , year=

Distillation scaling laws , author=. arXiv preprint arXiv:2502.08606 , year=

work page arXiv
[16]

Advances in Neural Information Processing Systems , volume=

Flow matching for scalable simulation-based inference , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) , pages=

Structured pruning of large language models , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) , pages=

work page 2020
[21]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[23]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021
[27]

International conference on machine learning , pages=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[28]

OpenCompass: A Universal Evaluation Platform for Foundation Models , author=

work page
[29]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[30]

Publications Manual , year = "1983", publisher =

work page 1983
[31]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[32]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[33]

Dan Gusfield , title =. 1997

work page 1997
[34]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[35]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[37]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

MSQ: Memory-Efficient Bit Sparsification Quantization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[38]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, and 1 others. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, and 1 others. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439

work page 2020
[40]

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. 2023. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36:4396--4429

work page 2023
[41]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044

work page internal anchor Pith review Pith/arXiv arXiv 2019
[42]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass

work page 2023
[44]

Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518--18529

work page 2020
[45]

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. 2024. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118

work page arXiv 2024
[46]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Seokho Han, Seoyeon Yoon, Jinhee Kim, Dongwei Wang, Kang Eun Jeon, Huanrui Yang, and Jong Hwan Ko. 2025. Msq: Memory-efficient bit sparsification quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21885--21894

work page 2025
[49]

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. 2025. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. arXiv preprint arXiv:2501.13987

work page arXiv 2025
[50]

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

Sangjun Lee, Seung-taek Woo, Jun-gyu Jin, Changhun Lee, and Eunhyeok Park. 2025. Amq: Enabling automl for mixed-precision weight-only quantization of large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35520--35538

work page 2025
[52]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100

work page 2024
[53]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106

work page 2021
[54]

Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. 2023. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034

work page arXiv 2023
[55]

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, and 1 others. 2024. Flatquant: Flatness matters for llm quantization. arXiv preprint arXiv:2410.09426

work page arXiv 2024
[56]

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024 a . Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks. arXiv preprint arXiv:2402.04396

work page arXiv 2024
[57]

Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. 2024 b . Qtip: Quantization with trellises and incoherence processing. Advances in Neural Information Processing Systems, 37:59597--59620

work page 2024
[58]

Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, and Xipeng Qiu. 2024. Bitstack: Any-size compression of large language models in variable memory environments. arXiv preprint arXiv:2410.23918

work page arXiv 2024
[59]

Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2020. Structured pruning of large language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp), pages 6151--6162

work page 2020
[60]

Maurice Weber, Dan Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, and 1 others. 2024. Redpajama: an open dataset for training large language models. Advances in neural information processing systems, 37:116462--116492

work page 2024
[61]

Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. 2018. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090

work page internal anchor Pith review Pith/arXiv arXiv 2018
[62]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087--38099. PMLR

work page 2023
[63]

Huanrui Yang, Lin Duan, Yiran Chen, and Hai Li. 2021. Bsq: Exploring bit-level sparsity for mixed-precision neural network quantization. arXiv preprint arXiv:2102.10462

work page arXiv 2021
[64]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019

[1] [1]

Advances in neural information processing systems , volume=

Hawq-v2: Hessian aware trace-weighted quantization of neural networks , author=. Advances in neural information processing systems , volume=

work page

[2] [3]

Advances in neural information processing systems , volume=

Redpajama: an open dataset for training large language models , author=. Advances in neural information processing systems , volume=

work page

[3] [5]

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs , author=. arXiv preprint arXiv:2512.04746 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [6]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Amq: Enabling automl for mixed-precision weight-only quantization of large language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[5] [9]

Advances in Neural Information Processing Systems , volume=

Qtip: Quantization with trellises and incoherence processing , author=. Advances in Neural Information Processing Systems , volume=

work page

[6] [11]

Advances in Neural Information Processing Systems , volume=

Quip: 2-bit quantization of large language models with guarantees , author=. Advances in Neural Information Processing Systems , volume=

work page

[7] [13]

Proceedings of machine learning and systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=

work page

[8] [15]

arXiv preprint arXiv:2502.08606 , year=

Distillation scaling laws , author=. arXiv preprint arXiv:2502.08606 , year=

work page arXiv

[9] [16]

Advances in Neural Information Processing Systems , volume=

Flow matching for scalable simulation-based inference , author=. Advances in Neural Information Processing Systems , volume=

work page

[10] [17]

Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) , pages=

Structured pruning of large language models , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) , pages=

work page 2020

[11] [21]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[12] [23]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021

[13] [27]

International conference on machine learning , pages=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[14] [28]

OpenCompass: A Universal Evaluation Platform for Foundation Models , author=

work page

[15] [29]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[16] [30]

Publications Manual , year = "1983", publisher =

work page 1983

[17] [31]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[18] [32]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[19] [33]

Dan Gusfield , title =. 1997

work page 1997

[20] [34]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[21] [35]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[22] [37]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

MSQ: Memory-Efficient Bit Sparsification Quantization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[23] [38]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, and 1 others. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [39]

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, and 1 others. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439

work page 2020

[25] [40]

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. 2023. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36:4396--4429

work page 2023

[26] [41]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044

work page internal anchor Pith review Pith/arXiv arXiv 2019

[27] [42]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [43]

OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass

work page 2023

[29] [44]

Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518--18529

work page 2020

[30] [45]

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. 2024. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118

work page arXiv 2024

[31] [46]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [47]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [48]

Seokho Han, Seoyeon Yoon, Jinhee Kim, Dongwei Wang, Kang Eun Jeon, Huanrui Yang, and Jong Hwan Ko. 2025. Msq: Memory-efficient bit sparsification quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21885--21894

work page 2025

[34] [49]

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. 2025. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. arXiv preprint arXiv:2501.13987

work page arXiv 2025

[35] [50]

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [51]

Sangjun Lee, Seung-taek Woo, Jun-gyu Jin, Changhun Lee, and Eunhyeok Park. 2025. Amq: Enabling automl for mixed-precision weight-only quantization of large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35520--35538

work page 2025

[37] [52]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100

work page 2024

[38] [53]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106

work page 2021

[39] [54]

Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. 2023. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034

work page arXiv 2023

[40] [55]

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, and 1 others. 2024. Flatquant: Flatness matters for llm quantization. arXiv preprint arXiv:2410.09426

work page arXiv 2024

[41] [56]

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024 a . Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks. arXiv preprint arXiv:2402.04396

work page arXiv 2024

[42] [57]

Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. 2024 b . Qtip: Quantization with trellises and incoherence processing. Advances in Neural Information Processing Systems, 37:59597--59620

work page 2024

[43] [58]

Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, and Xipeng Qiu. 2024. Bitstack: Any-size compression of large language models in variable memory environments. arXiv preprint arXiv:2410.23918

work page arXiv 2024

[44] [59]

Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2020. Structured pruning of large language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp), pages 6151--6162

work page 2020

[45] [60]

Maurice Weber, Dan Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, and 1 others. 2024. Redpajama: an open dataset for training large language models. Advances in neural information processing systems, 37:116462--116492

work page 2024

[46] [61]

Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. 2018. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090

work page internal anchor Pith review Pith/arXiv arXiv 2018

[47] [62]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087--38099. PMLR

work page 2023

[48] [63]

Huanrui Yang, Lin Duan, Yiran Chen, and Hai Li. 2021. Bsq: Exploring bit-level sparsity for mixed-precision neural network quantization. arXiv preprint arXiv:2102.10462

work page arXiv 2021

[49] [64]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019