Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data
Pith reviewed 2026-06-28 10:38 UTC · model grok-4.3
The pith
Recover-LoRA restores 80-95% accuracy in 2-bit quantized LLMs using logit distillation on synthetic data alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Recover-LoRA trains low-rank adapters on the 2-bit quantized gate and up-projection layers via logit distillation with synthetic data, recovering 80-95% of the accuracy lost to quantization on most benchmarks while requiring only 10k synthetic samples and no access to original labeled training data.
What carries the argument
Recover-LoRA: low-rank adapters trained by logit distillation on synthetic data to correct errors from 2-bit quantization of gate and up-projection layers
If this is right
- W4/W2-GateUp mixed precision yields 7.5-23.3% TPS improvement over uniform W4 across 4B-20B models and two hardware platforms.
- Recovery reaches 80-95% on nine of twelve benchmarks with only 10k synthetic samples.
- Synthetic data performs comparably to curated labeled data for the distillation-based recovery.
- The recovered model generalizes to out-of-distribution evaluation tasks.
Where Pith is reading between the lines
- The selective layer choice could be paired with other compression methods to further reduce memory use on edge devices.
- The same distillation setup might extend to recovering from other forms of layer-wise corruption beyond quantization.
- Scaling the number of synthetic samples or varying their generation method could be tested to see if recovery rates improve on the remaining three benchmarks.
Load-bearing premise
Logit distillation on synthetic data generated without access to the original training distribution can reliably restore performance lost from 2-bit quantization of the gate and up-projection layers.
What would settle it
Running the same Recover-LoRA procedure on Qwen3-4B but measuring whether accuracy recovery falls below 80% on the same nine benchmarks when synthetic data is replaced by data drawn from a markedly different distribution.
read the original abstract
Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general model weight corruption -- to the setting of ultra-low-bit quantization. We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision, yielding a mixed-precision GateUp configuration. We demonstrate via roofline analysis across three model families (4B--20B) and two hardware platforms that a W4/W2-GateUp deployment (4-bit base with 2-bit gate/up) delivers 7.5--23.3\% TPS improvement over uniform W4 depending on model and context length, while confining quantization error to a predictable subset of layers. We then apply Recover-LoRA -- training low-rank adapters on the quantized layers via logit distillation with synthetic data -- to recover accuracy lost from 2-bit quantization of the gate and up layers. In a case study on Qwen3-4B, Recover-LoRA achieves 80--95\% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data. We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks. Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends Recover-LoRA to post-quantization recovery for LLMs, proposing a mixed-precision W4/W2-GateUp strategy that quantizes only the gate and up-projection layers of MLPs to 2 bits. It applies low-rank adapters trained via logit distillation on 10k synthetic samples (no labeled data) and reports 80--95% accuracy recovery on 9 of 12 benchmarks for Qwen3-4B, plus 7.5--23.3% TPS gains over uniform W4 via roofline analysis on 4B--20B models across two hardware platforms.
Significance. If the recovery results hold under rigorous validation, the method would offer a practical, low-data route to aggressive compression for edge deployment while confining error to a predictable subset of layers. The roofline analysis across model families and platforms is a concrete strength that grounds the throughput claims.
major comments (2)
- [Abstract / Experimental Results] Abstract and § on experimental results: the headline recovery figures (80--95% on 9/12 benchmarks with 10k synthetic samples) are presented without any description of the quantization procedure for the gate/up layers, the synthetic data generation process, baseline comparisons, error bars, or evaluation protocol, rendering it impossible to determine whether the numbers support the central recovery claim.
- [Method / Distillation Experiments] Method and § on distillation: the claim that logit distillation on synthetic data restores performance lost specifically from 2-bit gate/up quantization rests on the untested assumption that the synthetic activations overlap with the high-error regimes of those layers; no activation-distribution analysis or ablation is supplied to address this load-bearing point.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the original submission would benefit from greater detail on experimental procedures and additional analysis to support the core claims. We have revised the manuscript to address both major comments.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] Abstract and § on experimental results: the headline recovery figures (80--95% on 9/12 benchmarks with 10k synthetic samples) are presented without any description of the quantization procedure for the gate/up layers, the synthetic data generation process, baseline comparisons, error bars, or evaluation protocol, rendering it impossible to determine whether the numbers support the central recovery claim.
Authors: We agree that the abstract and experimental results section require additional detail for reproducibility and to substantiate the recovery claims. In the revised manuscript we have expanded both sections to include: (1) a precise description of the 2-bit quantization procedure applied to the gate and up-projection layers (including the quantizer and scaling method); (2) the synthetic data generation process (prompt-based generation from the unquantized model with diversity controls); (3) explicit baseline comparisons (uniform W4, W2 on all layers, and alternative recovery techniques); (4) error bars computed over three independent runs with different random seeds; and (5) a clarified evaluation protocol specifying the 12 benchmarks, the exact recovery metric (recovered accuracy relative to the FP16 baseline), and the train/eval split. These changes appear in the updated Abstract and the Experimental Results section. revision: yes
-
Referee: [Method / Distillation Experiments] Method and § on distillation: the claim that logit distillation on synthetic data restores performance lost specifically from 2-bit gate/up quantization rests on the untested assumption that the synthetic activations overlap with the high-error regimes of those layers; no activation-distribution analysis or ablation is supplied to address this load-bearing point.
Authors: We acknowledge that the original manuscript does not contain an explicit activation-distribution analysis or ablation study directly testing overlap between synthetic activations and the high-error regimes induced by 2-bit gate/up quantization. This is a substantive observation. In the revised version we have added a new subsection under Method that reports activation histograms and KL-divergence statistics between synthetic and real activations for the gate and up layers, together with an ablation that substitutes real calibration data for the synthetic set. The added results show substantial distributional overlap in the regions where quantization error is largest, thereby supporting the original claim while addressing the referee's concern. revision: yes
Circularity Check
No circularity: purely empirical recovery measurements
full rationale
The paper reports measured accuracy recovery percentages (80-95% on 9/12 benchmarks) from applying an existing method to a new quantization setting, using standard benchmarks and 10k synthetic samples. No equations, fitted parameters, or derivations are presented that reduce to inputs by construction. The reference to Recover-LoRA as prior work is a normal citation of an earlier method and does not serve as the load-bearing justification for the reported empirical outcomes, which stand as independent measurements.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Logit distillation on synthetic data can recover accuracy lost from 2-bit quantization of gate and up-projection layers
Reference graph
Works this paper leans on
-
[1]
Recover-
Das, Devleena and Patwari, Rajeev and Sirasao, Ashish , booktitle=. Recover-
-
[2]
Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , booktitle=
-
[3]
Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , booktitle=
-
[4]
Transactions of the Association for Computational Linguistics , volume=
A Survey on Model Compression for Large Language Models , author=. Transactions of the Association for Computational Linguistics , volume=
-
[5]
Liu, Zechun and Oguz, Barlas and Zhao, Changsheng and Chang, Ernie and Stock, Pierre and Mehdad, Yashar and Shi, Yangyang and Krishnamoorthi, Raghuraman and Chandra, Vikas , journal=
-
[6]
Ke, Wenjing and Li, Zhe and Li, Dong and Tian, Lu and Barsoum, Emad , booktitle=
-
[7]
Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=
-
[8]
Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle=
-
[9]
Xu, Yuhui and Xie, Lingxi and Gu, Xiaotao and Chen, Xin and Chang, Heng and Zhang, Hengheng and Chen, Zhengsu and Zhang, Xiaopeng and Tian, Qi , booktitle=
-
[10]
Tseng, Albert and Chee, Jerry and Sun, Qingyao and Kuleshov, Volodymyr and De Sa, Christopher , booktitle=
-
[11]
Egiazarian, Vage and Panferov, Andrei and Kuznedelev, Denis and Frantar, Elias and Babenko, Artem and Alistarh, Dan , booktitle=
-
[12]
Dong, Zhen and Yao, Zhewei and Gholami, Amir and Mahoney, Michael W and Keutzer, Kurt , booktitle=
-
[13]
Wang, Kuan and Liu, Zhijian and Lin, Yujun and Lin, Ji and Han, Song , booktitle=
-
[14]
Lee, Changhun and Jin, Jungyu and Kim, Taesu and Kim, Hyungjun and Park, Eunhyeok , booktitle=
-
[16]
Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The
-
[18]
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=
-
[19]
International Conference on Learning Representations (ICLR) , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations (ICLR) , year=
-
[20]
Think You Have Solved Question Answering? Try
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think You Have Solved Question Answering? Try
-
[21]
Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin , journal=
-
[22]
Bisk, Yonatan and Zellers, Rowan and Le Bras, Ronan and Gao, Jianfeng and Choi, Yejin , booktitle=
-
[23]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
2018
-
[24]
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=
-
[25]
Forecasting
Patwari, Rajeev and Sirasao, Ashish and Das, Devleena , journal=. Forecasting
-
[26]
DeepSeek-AI , journal=
-
[27]
Diagnosing
Cim, Musa and Topcu, Burak and Kandemir, Mahmut , booktitle=. Diagnosing
-
[28]
Proceedings of the 40th International Conference on Machine Learning (ICML) , year=
The Case for 4-Bit Precision: k-Bit Inference Scaling Laws , author=. Proceedings of the 40th International Conference on Machine Learning (ICML) , year=
-
[29]
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning
Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian , booktitle=. ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning
-
[30]
PIQA : Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA : Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432--7439, 2020
2020
-
[31]
Diagnosing FP4 inference: A layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4
Musa Cim, Burak Topcu, and Mahmut Kandemir. Diagnosing FP4 inference: A layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 . In Workshop on Scientific Methods for Understanding Deep Learning (Sci4DL), 2026
2026
-
[32]
BoolQ : Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ : Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
2019
-
[33]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
Recover- LoRA : Data-free accuracy recovery of degraded language models via low-rank adaptation
Devleena Das, Rajeev Patwari, and Ashish Sirasao. Recover- LoRA : Data-free accuracy recovery of degraded language models via low-rank adaptation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2377--2386, 2025
2025
-
[35]
DeepSeek-AI. DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
The case for 4-bit precision: k-bit inference scaling laws
Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023
2023
-
[37]
QLoRA : Efficient finetuning of quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 10088--10115, 2023
2023
-
[38]
HAWQ : Hessian AWare quantization of neural networks with mixed-precision
Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. HAWQ : Hessian AWare quantization of neural networks with mixed-precision. In International Conference on Computer Vision (ICCV), 2019
2019
-
[39]
AQLM : Extreme compression of large language models via additive quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. AQLM : Extreme compression of large language models via additive quantization. In International Conference on Machine Learning (ICML), 2024
2024
-
[40]
GPTQ : Accurate post-training quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ : Accurate post-training quantization for generative pre-trained transformers. In International Conference on Learning Representations (ICLR), 2023
2023
-
[41]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR), 2021
2021
-
[43]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[44]
LoRA : Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022
2022
-
[45]
DL-QAT : Weight-decomposed low-rank quantization-aware training for large language models
Wenjing Ke, Zhe Li, Dong Li, Lu Tian, and Emad Barsoum. DL-QAT : Weight-decomposed low-rank quantization-aware training for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2024
2024
-
[46]
OWQ : Outlier-aware weight quantization for efficient fine-tuning and inference of large language models
Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. OWQ : Outlier-aware weight quantization for efficient fine-tuning and inference of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024
2024
-
[47]
Paroquant: Pairwise rotation quantization for efficient reasoning LLM inference
Yesheng Liang, Haisheng Chen, Zihan Zhang, Song Han, and Zhijian Liu. Paroquant: Pairwise rotation quantization for efficient reasoning LLM inference. In International Conference on Learning Representations (ICLR), 2026
2026
-
[48]
AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration. In Proceedings of Machine Learning and Systems, volume 6, pages 87--100, 2024
2024
-
[49]
Llm-qat: Data-free quantization aware training for large language models.CoRR, abs/2305.17888,
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT : Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023
-
[50]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018
2018
-
[51]
Rajeev Patwari, Ashish Sirasao, and Devleena Das. Forecasting LLM inference performance via hardware-agnostic analytical modeling. arXiv preprint arXiv:2508.00904, 2025
-
[52]
WinoGrande : An adversarial Winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande : An adversarial Winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021
2021
-
[53]
QuIP \# : Even better LLM quantization with hadamard incoherence and lattice codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP \# : Even better LLM quantization with hadamard incoherence and lattice codebooks. In International Conference on Machine Learning (ICML), 2024
2024
-
[54]
HAQ : Hardware-aware automated quantization with mixed precision
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. HAQ : Hardware-aware automated quantization with mixed precision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
2019
-
[55]
QA-LoRA : Quantization-aware low-rank adaptation of large language models
Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, and Qi Tian. QA-LoRA : Quantization-aware low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2024
2024
-
[56]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Wang, Bowen Zheng, Chengyuan Yu, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
HellaSwag : Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag : Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019
2019
-
[58]
A survey on model compression for large language models
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. Transactions of the Association for Computational Linguistics, 12: 0 1556--1577, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.