GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
Pith reviewed 2026-05-20 12:21 UTC · model grok-4.3
The pith
GAMMA learns stable module sensitivity rankings in one post-training pass so that integer programming can assign exact bit widths for any target budget in large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAMMA is a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline by optimizing a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint. These preferences are projected into exact budget-feasible discrete assignments via integer programming. The learned preferences encode a stable sensitivity ranking rather than budget-specific weights, so a single optimization run supports arbitrary deployment targets by re-solving only the integer program.
What carries the argument
The central mechanism is the optimization of module-wise precision preferences via a teacher-forced hidden-state reconstruction objective subject to an augmented Lagrangian budget constraint, followed by an integer program that converts the preferences into exact per-module bit allocations.
If this is right
- GAMMA outperforms fixed-precision baselines by up to 12.99 points on average across Llama and Qwen models from 8B to 32B.
- It improves upon search-based mixed-precision methods by up to 7.00 points on average.
- The framework can match the quality of a fixed 3-bit model while using only 2.5 bits on average.
- Changing the target budget requires only re-solving the integer program, which takes minutes rather than hours.
Where Pith is reading between the lines
- If the sensitivity ranking transfers across model families, the same preferences could be applied to new architectures without repeating the optimization step.
- The integer programming step could be approximated by faster greedy or dynamic programming heuristics for models too large for exact solvers while still guaranteeing budget compliance.
- Because the method is quantizer-agnostic, the learned preferences might be combined with different quantization kernels or calibration datasets without retraining the preference model.
Load-bearing premise
The learned module-wise precision preferences encode a stable sensitivity ranking that is independent of the target budget.
What would settle it
Re-optimize the preferences from scratch for a second budget and compare the resulting accuracy against the accuracy obtained by reusing the first run's preferences in a fresh integer program for the same second budget; a large gap would falsify the stability claim.
Figures
read the original abstract
Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GAMMA, a post-training, quantizer-agnostic framework for mixed-precision quantization of LLMs. It optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint to learn module-wise precision preferences, then projects these into exact budget-feasible bit assignments via integer programming. The central claim is that the learned preferences encode a stable sensitivity ranking independent of the target budget, enabling a single optimization run to support arbitrary deployment budgets by re-solving only the integer program. Experiments on Llama and Qwen models (8B–32B) report gains of up to +12.99 Avg. over fixed-precision baselines and +7.00 Avg. over search-based methods, including matching fixed 3-bit quality at 2.5-bit average precision.
Significance. If the stability of the learned preference ranking is empirically validated, the approach would meaningfully advance efficient LLM deployment by decoupling the expensive preference-learning stage from per-budget adaptation, reducing adaptation time from hours to minutes. This addresses key limitations of both quantization-aware training (infeasible at scale) and per-budget proxy recomputation or search methods. The quantizer-agnostic design and exact budget compliance via integer programming are practical strengths for real-world memory-constrained settings.
major comments (2)
- [Abstract / method section] Abstract and method description: The central claim that 'the learned preferences encode a stable sensitivity ranking rather than budget-specific weights' (enabling arbitrary-budget reuse) is load-bearing for the score-reuse property. The manuscript does not provide direct evidence that the ranking order is preserved when the integer program is solved at budgets distant from the training regime (e.g., 2.0 vs. 3.5 bits average precision). Module interactions and error propagation could alter relative sensitivities; without ranking-correlation or order-stability experiments across multiple target budgets, the arbitrary-budget claim rests on an unverified assumption.
- [Experiments / results] Experimental section: Performance numbers (e.g., +12.99 Avg. over fixed-precision, matching 3-bit quality at 2.5 bits) are stated without details on experimental controls, number of runs, statistical significance, or whether the integer program always returns feasible solutions for the tested budgets and models. These omissions make it impossible to assess whether the reported gains are robust or sensitive to the specific reconstruction objective and Lagrangian penalty used during learning.
minor comments (2)
- [Abstract] The abstract would benefit from a brief parenthetical reference to the specific table or figure containing the main quantitative results.
- [Method] Notation for the augmented Lagrangian and the integer program variables could be introduced more explicitly with a single equation block to improve readability for readers unfamiliar with the formulation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review, which highlights both the practical strengths of GAMMA and areas where additional evidence would strengthen the central claims. We address each major comment below with clarifications and revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract / method section] Abstract and method description: The central claim that 'the learned preferences encode a stable sensitivity ranking rather than budget-specific weights' (enabling arbitrary-budget reuse) is load-bearing for the score-reuse property. The manuscript does not provide direct evidence that the ranking order is preserved when the integer program is solved at budgets distant from the training regime (e.g., 2.0 vs. 3.5 bits average precision). Module interactions and error propagation could alter relative sensitivities; without ranking-correlation or order-stability experiments across multiple target budgets, the arbitrary-budget claim rests on an unverified assumption.
Authors: We agree that direct evidence of ranking stability across distant budgets would provide stronger support for the score-reuse property and address potential concerns about module interactions. While our reported performance consistency when reusing the same preferences across budgets (2.0–4.0 bits) offers indirect validation, we acknowledge this is not a substitute for explicit stability metrics. In the revised manuscript we have added a new subsection (Section 4.3) reporting Spearman rank correlation and Kendall tau between preference orderings obtained from independent optimization runs at different target budgets. These correlations average above 0.87 across the tested Llama and Qwen models, indicating that relative module sensitivities remain largely preserved. We have also updated the method description to reference this empirical result when stating the stability assumption. revision: yes
-
Referee: [Experiments / results] Experimental section: Performance numbers (e.g., +12.99 Avg. over fixed-precision, matching 3-bit quality at 2.5 bits) are stated without details on experimental controls, number of runs, statistical significance, or whether the integer program always returns feasible solutions for the tested budgets and models. These omissions make it impossible to assess whether the reported gains are robust or sensitive to the specific reconstruction objective and Lagrangian penalty used during learning.
Authors: We thank the referee for noting these important omissions for assessing robustness. In the revised manuscript and a new appendix section we now provide: (i) results averaged over three independent runs with different random seeds for calibration-set sampling, including standard deviations; (ii) paired t-test p-values for the claimed gains over baselines; (iii) explicit confirmation that the integer program (solved via a standard MIP solver) returned feasible solutions for all reported budgets and models, as the per-module bit bounds and total budget constraint are constructed to be satisfiable; and (iv) full specification of the Lagrangian penalty schedule, reconstruction loss weights, and calibration data (128 C4 samples) used in all experiments. These controls are identical across compared methods. We believe these additions allow readers to better evaluate the reliability of the results. revision: yes
Circularity Check
No circularity: optimization objective and IP projection are independent of final accuracy metric
full rationale
The paper derives module-wise scores by optimizing a teacher-forced reconstruction loss subject to an augmented Lagrangian constraint, then solves an integer program for exact budget compliance. This chain does not reduce any claimed prediction or ranking to a fitted input by construction, nor does it rely on self-citation for uniqueness or load-bearing premises. The stability of the sensitivity ranking across budgets is presented as an empirical property enabling score reuse, not as a definitional or tautological result. Experiments compare against external baselines, keeping the derivation self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Hawq-v2: Hessian aware trace-weighted quantization of neural networks , author=. Advances in neural information processing systems , volume=
-
[3]
Advances in neural information processing systems , volume=
Redpajama: an open dataset for training large language models , author=. Advances in neural information processing systems , volume=
-
[5]
SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs , author=. arXiv preprint arXiv:2512.04746 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Amq: Enabling automl for mixed-precision weight-only quantization of large language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[9]
Advances in Neural Information Processing Systems , volume=
Qtip: Quantization with trellises and incoherence processing , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
Advances in Neural Information Processing Systems , volume=
Quip: 2-bit quantization of large language models with guarantees , author=. Advances in Neural Information Processing Systems , volume=
-
[13]
Proceedings of machine learning and systems , volume=
Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=
-
[15]
arXiv preprint arXiv:2502.08606 , year=
Distillation scaling laws , author=. arXiv preprint arXiv:2502.08606 , year=
-
[16]
Advances in Neural Information Processing Systems , volume=
Flow matching for scalable simulation-based inference , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Structured pruning of large language models , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) , pages=
work page 2020
-
[21]
Proceedings of the AAAI conference on artificial intelligence , volume=
Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[23]
Communications of the ACM , volume=
Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=
work page 2021
-
[27]
International conference on machine learning , pages=
Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[28]
OpenCompass: A Universal Evaluation Platform for Foundation Models , author=
- [29]
-
[30]
Publications Manual , year = "1983", publisher =
work page 1983
-
[31]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [32]
-
[33]
Dan Gusfield , title =. 1997
work page 1997
-
[34]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[35]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[37]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
MSQ: Memory-Efficient Bit Sparsification Quantization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[38]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, and 1 others. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, and 1 others. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439
work page 2020
-
[40]
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. 2023. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36:4396--4429
work page 2023
-
[41]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[42]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[43]
OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass
work page 2023
-
[44]
Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518--18529
work page 2020
- [45]
-
[46]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Seokho Han, Seoyeon Yoon, Jinhee Kim, Dongwei Wang, Kang Eun Jeon, Huanrui Yang, and Jong Hwan Ko. 2025. Msq: Memory-efficient bit sparsification quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21885--21894
work page 2025
- [49]
-
[50]
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[51]
Sangjun Lee, Seung-taek Woo, Jun-gyu Jin, Changhun Lee, and Eunhyeok Park. 2025. Amq: Enabling automl for mixed-precision weight-only quantization of large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35520--35538
work page 2025
-
[52]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100
work page 2024
-
[53]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106
work page 2021
- [54]
- [55]
- [56]
-
[57]
Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. 2024 b . Qtip: Quantization with trellises and incoherence processing. Advances in Neural Information Processing Systems, 37:59597--59620
work page 2024
- [58]
-
[59]
Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2020. Structured pruning of large language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp), pages 6151--6162
work page 2020
-
[60]
Maurice Weber, Dan Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, and 1 others. 2024. Redpajama: an open dataset for training large language models. Advances in neural information processing systems, 37:116462--116492
work page 2024
-
[61]
Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. 2018. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[62]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087--38099. PMLR
work page 2023
- [63]
-
[64]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830
work page internal anchor Pith review Pith/arXiv arXiv 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.