ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression

Chao Zhang; Li Wang; Merouane Debbah; Samson Lasaulce; Wenya Yu

arxiv: 2606.00494 · v2 · pith:SV3ILQS5new · submitted 2026-05-30 · 💻 cs.LG

ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression

Wenya Yu , Chao Zhang , Li Wang , Samson Lasaulce , Merouane Debbah This is my paper

Pith reviewed 2026-06-28 19:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationLoRALLM compressionorthogonal projectionquantization noisemodel plasticityadapter-aware compression

0 comments

The pith

By projecting quantization noise onto the low-rank manifold, ProjQ allows LoRA to correct more error than standard post-training quantization permits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard PTQ leaves behind spread-out noise across weights that limited-rank LoRA adapters cannot fully correct, wasting their capacity on uncorrectable components. ProjQ inserts an orthogonal projection step that reshapes the dominant quantization noise into low-rank form before the adapter is applied. This offloads more of the error to the adapter while shrinking the residual in the orthogonal directions that LoRA cannot reach. Theory shows this yields strictly higher model plasticity for later fine-tuning, and experiments on LLaMA-2 and Qwen models confirm lower compensation loss and better downstream results at 3 bits than standard 4-bit pipelines.

Core claim

ProjQ constrains quantization noise to the low-rank manifold via orthogonal subspace projection and derives an efficient alternating algorithm that shapes the noise into low-rank structure. This offloads dominant error components to the subsequent adapter while minimizing the residual error in the orthogonal uncorrectable subspace, preserving strictly greater model plasticity for downstream tasks compared to standard PTQ.

What carries the argument

orthogonal subspace projection that moves dominant quantization noise components onto the low-rank manifold for adapter correction

If this is right

Up to 2 times lower evaluation loss during quantization error compensation.
3-bit quantized models reach the language-modeling performance of standard 4-bit baselines.
Consistent gains on LLaMA-2, Qwen2.5, and Qwen3 for both compensation accuracy and task fine-tuning.
Adapter capacity is used more for task improvement rather than noise cleanup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection idea could be tested with other low-rank adapters to see whether the plasticity gain is specific to LoRA.
If the noise-shaping step works at even lower bit widths, it might enable viable 2-bit models on the same hardware.
Measuring the rank of the residual error after projection on new model families would test how general the low-rank assumption holds.

Load-bearing premise

Quantization noise from standard PTQ is spread out enough that LoRA cannot correct it, yet an orthogonal projection can relocate the main components onto the low-rank manifold without creating new uncorrectable residuals.

What would settle it

A side-by-side run of ProjQ and standard PTQ followed by identical LoRA fine-tuning on the same downstream tasks, checking whether the measured plasticity gap vanishes or reverses.

Figures

Figures reproduced from arXiv: 2606.00494 by Chao Zhang, Li Wang, Merouane Debbah, Samson Lasaulce, Wenya Yu.

**Figure 1.** Figure 1: Overview of the proposed framework ProjQ. ProjQ shapes activation-space quantization error into a low-rank subspace that LoRA can correct easily, while minimizing the uncorrectable orthogonal residual. settings where QLoRA degrades. However, LoftQ operates primarily in the weight space by treating all weights equally, whereas prior PTQ literature suggests that minimizing activation-aware output error is … view at source ↗

read the original abstract

Post-Training Quantization (PTQ) and Low-Rank Adaptation (LoRA) constitute the standard pipeline for efficient Large Language Model (LLM) deployment. However, applying them sequentially poses a problem: PTQ often leaves behind random noise that is spread out (across the model's weights) in a way LoRA can't easily fix, meaning that LoRA ends up wasting its limited capacity trying to fix uncorrectable noise instead of improving task performance. In this paper, we propose \textbf{ProjQ}, a novel framework for constraining quantization noise to the low-rank manifold via orthogonal subspace projection. We derive an efficient alternating algorithm that shapes the quantization noise into a low-rank structure, effectively offloading dominant error components to the subsequent adapter while minimizing the residual error in the orthogonal "uncorrectable" subspace. Our theoretical analysis demonstrates that ProjQ preserves strictly greater model plasticity for downstream tasks compared to standard PTQ. Extensive experiments on LLaMA-2, Qwen2.5 and Qwen3 confirm that ProjQ consistently outperforms existing methods in both quantization error compensation and downstream task fine-tuning, achieving up to $2\times$ lower evaluation loss for compensation and matching the performance of standard 4-bit baselines on language modeling tasks with only 3 bits. The code is available on https://github.com/yy9301/ProjQ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProjQ adds an orthogonal projection before quantization to steer noise into LoRA's low-rank space, with experiments showing gains, but the 'strictly greater plasticity' claim depends on an unverified noise model.

read the letter

The core move is projecting quantization noise onto a low-rank subspace so that the subsequent LoRA adapter can absorb more of it instead of fighting diffuse error. They pair this with an alternating algorithm that shapes the noise during quantization.

What stands out is the explicit targeting of the PTQ-LoRA mismatch. Most pipelines quantize first and adapt later; this one tries to make the quantization step aware of the adapter's limited capacity by moving dominant error components into the correctable subspace. The experiments on LLaMA-2, Qwen2.5, and Qwen3 report up to 2x lower compensation loss and downstream performance that matches 4-bit baselines at 3 bits. Code release helps.

The soft spot is the theoretical part. The abstract asserts a rigorous demonstration that ProjQ preserves strictly greater plasticity, yet this hinges on the projection reducing the uncorrectable residual without creating new error in the orthogonal complement. No error model, bound, or derivation appears in the provided text, so it is unclear whether the inequality follows from the construction or from an implicit assumption that standard PTQ noise is sufficiently isotropic. If that assumption fails on real weight distributions, the claimed advantage shrinks.

The work is aimed at groups already running PTQ-then-LoRA pipelines and looking for incremental improvements in bit-width. A reader focused on practical compression would find the method and numbers worth checking.

Send it to review. The pipeline problem is real, the experiments use standard models, and referees can test whether the projection step delivers the stated error reduction under actual conditions.

Referee Report

2 major / 2 minor

Summary. The paper proposes ProjQ, a PTQ method that uses orthogonal subspace projection to constrain quantization noise to the low-rank manifold compatible with subsequent LoRA adapters. It introduces an efficient alternating algorithm to shape the noise, claims a theoretical result that this yields strictly greater downstream model plasticity than standard PTQ, and reports experiments on LLaMA-2, Qwen2.5 and Qwen3 showing up to 2× lower compensation loss and matching 4-bit baselines at 3 bits. Code is released.

Significance. If the plasticity theorem holds and the alternating procedure is stable, the method could meaningfully improve the PTQ+LoRA pipeline by reducing wasted adapter capacity on uncorrectable noise. The explicit code release is a clear strength for reproducibility.

major comments (2)

[Theoretical analysis (abstract and §3–4)] Theoretical analysis (abstract and §3–4): the claim that ProjQ 'preserves strictly greater model plasticity' is asserted without the error model, the precise definition of plasticity (e.g., bound on fine-tuning loss or effective rank in the orthogonal complement), or the inequality derivation. It is therefore impossible to confirm that the result follows from the projection construction rather than from the unstated assumption that the projection step is lossless w.r.t. the uncorrectable subspace.
[§4 (alternating algorithm)] §4 (alternating algorithm): the description of how the orthogonal projection and quantization steps are alternated does not include a convergence argument or a bound showing that the residual norm in the orthogonal complement is strictly smaller than under standard PTQ; without this, the 'strictly greater plasticity' claim remains unverified.

minor comments (2)

[Abstract] The abstract states 'up to 2× lower evaluation loss' but does not specify the exact metric or baseline; a table or equation reference would clarify the comparison.
[§2–3] Notation for the low-rank manifold and the orthogonal complement is introduced without an explicit definition or diagram; a short notation table or figure would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important gaps in the presentation of the theoretical results, which we will address through targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: Theoretical analysis (abstract and §3–4): the claim that ProjQ 'preserves strictly greater model plasticity' is asserted without the error model, the precise definition of plasticity (e.g., bound on fine-tuning loss or effective rank in the orthogonal complement), or the inequality derivation. It is therefore impossible to confirm that the result follows from the projection construction rather than from the unstated assumption that the projection step is lossless w.r.t. the uncorrectable subspace.

Authors: We agree that the current presentation of the theoretical analysis is incomplete. In the revised manuscript we will explicitly introduce the quantization error model, define model plasticity as the achievable reduction in downstream fine-tuning loss within the orthogonal complement, and provide the full step-by-step derivation of the strict inequality. The derivation will show that the improvement follows directly from the orthogonal projection step without invoking any lossless assumption on the uncorrectable subspace. revision: yes
Referee: §4 (alternating algorithm): the description of how the orthogonal projection and quantization steps are alternated does not include a convergence argument or a bound showing that the residual norm in the orthogonal complement is strictly smaller than under standard PTQ; without this, the 'strictly greater plasticity' claim remains unverified.

Authors: We acknowledge that a convergence argument and a comparative residual-norm bound are necessary to substantiate the claim. The revised version will include a convergence proof for the alternating procedure (under standard Lipschitz assumptions on the quantization operator) together with an explicit bound establishing that the residual norm in the orthogonal complement is strictly smaller than the corresponding quantity under standard PTQ. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claim stated without equations reducing to input by construction

full rationale

The provided abstract and description assert a theoretical analysis showing strictly greater plasticity but supply no equations, error bounds, or derivations. No self-citations, fitted parameters renamed as predictions, ansatzes smuggled via citation, or self-definitional steps are quoted. The central claim is presented as independently derived from the ProjQ construction; absent any exhibited reduction (e.g., Eq. X = Eq. Y by construction), the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that standard quantization noise is uncorrectable by LoRA and that projection can move dominant error components without new side effects. No free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption PTQ leaves behind random noise spread across weights in a way that LoRA cannot easily fix.
Explicitly stated in the abstract as the core problem motivating the projection step.

pith-pipeline@v0.9.1-grok · 5785 in / 1287 out tokens · 26451 ms · 2026-06-28T19:09:05.717459+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 22 canonical work pages · 12 internal anchors

[2]

Piqa: Reasoning about physical commonsense in natural language

Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

2020
[3]

Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. M. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36: 0 4396--4429, 2023

2023
[6]

Qlora: Efficient finetuning of quantized llms

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

2023
[9]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022

2022
[11]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6: 0 87--100, 2024

2024
[13]

A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B

Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994

1994
[15]

A corpus and cloze evaluation for deeper understanding of commonsense stories

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 839--849, 2016

2016
[16]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

2020
[17]

J., and Pilanci, M

Saha, R., Sagan, N., Srivastava, V., Goldsmith, A. J., and Pilanci, M. Compressing large language models using low rank and low precision decomposition. Advances in Neural Information Processing Systems, 37: 0 88981--89018, 2024

2024
[18]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4149--4158, 2019

2019
[20]

Svd-llm: Truncation-aware singular value decomposition for large language model compression

Wang, X., Zheng, Y., Wan, Z., and Zhang, M. Svd-llm: Truncation-aware singular value decomposition for large language model compression. ICLR, 2025

2025
[21]

Qa-lora: Quantization-aware low-rank adaptation of large language models

Xu, Y., Xie, L., Gu, X., Chen, X., Chang, H., Zhang, H., Chen, Z., Zhang, X., and Tian, Q. Qa-lora: Quantization-aware low-rank adaptation of large language models. ICLR, 2024

2024
[22]

Lowra: Accurate and efficient lora fine-tuning of llms under 2 bits

Zhou, Z., Zhang, Q., Kumbong, H., and Olukotun, K. Lowra: Accurate and efficient lora fine-tuning of llms under 2 bits. ICML, 2025

2025
[23]

Large language models in 6g from standard to on-device networks

Zou, H., Zhao, Q., Lasaulce, S., Zhang, C., Tian, Y., Bariah, L., Bader, F., and Debbah, M. Large language models in 6g from standard to on-device networks. Nature Reviews Electrical Engineering, pp.\ 1--12, 2026

2026
[24]

Nature Reviews Electrical Engineering , pages=

Large language models in 6G from standard to on-device networks , author=. Nature Reviews Electrical Engineering , pages=. 2026 , publisher=

2026
[25]

Proceedings of machine learning and systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=
[26]

ICLR , year=

Svd-llm: Truncation-aware singular value decomposition for large language model compression , author=. ICLR , year=
[27]

arXiv preprint arXiv:2410.21271 , year=

EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation , author=. arXiv preprint arXiv:2410.21271 , year=

work page arXiv
[28]

ICML , year=

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits , author=. ICML , year=
[29]

ICLR , year=

QA-LoRA: Quantization-aware low-rank adaptation of large language models , author=. ICLR , year=
[30]

Advances in Neural Information Processing Systems , volume=

Quip: 2-bit quantization of large language models with guarantees , author=. Advances in Neural Information Processing Systems , volume=
[31]

Omniquant: Omnidirectionally calibrated quantization for large language models.arXiv preprint arXiv:2308.13137, 2023

Omniquant: Omnidirectionally calibrated quantization for large language models , author=. arXiv preprint arXiv:2308.13137 , year=

work page arXiv
[32]

arXiv preprint arXiv:2303.08302 , year=

A comprehensive study on post-training quantization for large language models , author=. arXiv preprint arXiv:2303.08302 , year=

work page arXiv
[33]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

ACM Computing Surveys , volume=

Towards efficient generative large language model serving: A survey from algorithms to systems , author=. ACM Computing Surveys , volume=. 2025 , publisher=

2025
[35]

arXiv preprint arXiv:2411.06084 , year=

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques , author=. arXiv preprint arXiv:2411.06084 , year=

work page arXiv
[36]

Applied Intelligence , volume=

A comprehensive review of model compression techniques in machine learning , author=. Applied Intelligence , volume=. 2024 , publisher=

2024
[37]

Advances in neural information processing systems , volume=

Post training 4-bit quantization of convolutional networks for rapid-deployment , author=. Advances in neural information processing systems , volume=
[38]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Quantization and training of neural networks for efficient integer-arithmetic-only inference , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[39]

European conference on computer vision , pages=

Post-training piecewise linear quantization for deep neural networks , author=. European conference on computer vision , pages=. 2020 , organization=

2020
[40]

International conference on machine learning , pages=

Accurate post training quantization with small calibration sets , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[41]

Advances in Neural Information Processing Systems , volume=

Optimal brain compression: A framework for accurate post-training quantization and pruning , author=. Advances in Neural Information Processing Systems , volume=
[42]

Advances in neural information processing systems , volume=

Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=
[43]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2501.18475 , year=

Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization , author=. arXiv preprint arXiv:2501.18475 , year=

work page arXiv
[45]

arXiv preprint arXiv:2505.03802 , year=

Efficient fine-tuning of quantized models via adaptive rank and bitwidth , author=. arXiv preprint arXiv:2505.03802 , year=

work page arXiv
[46]

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs , author=. arXiv preprint arXiv:2509.25214 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[48]

arXiv preprint arXiv:2505.21895 , year=

Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization , author=. arXiv preprint arXiv:2505.21895 , year=

work page arXiv
[49]

5-VL , author=

Efficient Fine-Tuning of Multimodal Language Models for Medical AI via LoRA and 4-bit Quantization on Qwen2. 5-VL , author=. 2025 7th International Conference on Data-driven Optimization of Complex Systems (DOCS) , pages=. 2025 , organization=

2025
[50]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Controlled low-rank adaptation with subspace regularization for continued training on large language models , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[51]

2025 62nd ACM/IEEE Design Automation Conference (DAC) , pages=

DuQTTA: Dual Quantized Tensor-Train Adaptation with Decoupling Magnitude-Direction for Efficient Fine-Tuning of LLMs , author=. 2025 62nd ACM/IEEE Design Automation Conference (DAC) , pages=. 2025 , organization=

2025
[52]

IEEE transactions on pattern analysis and machine intelligence , volume=

Robust recovery of subspace structures by low-rank representation , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2012 , publisher=

2012
[53]

arXiv preprint arXiv:2311.12023 , year=

Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning , author=. arXiv preprint arXiv:2311.12023 , year=

work page arXiv
[54]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[55]

Expert Systems with Applications , volume=

Imbalanced complemented subspace representation with adaptive weight learning , author=. Expert Systems with Applications , volume=. 2024 , publisher=

2024
[56]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
[61]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[62]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[65]

Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

The penn treebank: Annotating predicate argument structure , author=. Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

1994
[66]

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

A corpus and cloze evaluation for deeper understanding of commonsense stories , author=. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

2016
[67]

GitHub repository , howpublished =

ModelCloud.ai , title =. GitHub repository , howpublished =
[68]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[69]

arXiv preprint arXiv:2310.08659 , year=

Loftq: Lora-fine-tuning-aware quantization for large language models , author=. arXiv preprint arXiv:2310.08659 , year=

work page arXiv
[70]

tinybenchmarks: Evaluating LLMs with fewer examples

tinyBenchmarks: evaluating LLMs with fewer examples , author=. arXiv preprint arXiv:2402.14992 , year=

work page arXiv
[71]

2023 , publisher=

Stanford alpaca: An instruction-following llama model , author=. 2023 , publisher=

2023
[72]

Commonsense\_170k Dataset , author =
[73]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[75]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=
[76]

LoRDQ: activation-aware Low-Rank Decomposition and Quantization for Large Language Model Compression , author=
[77]

Commonsenseqa: A question answering challenge targeting commonsense knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019
[78]

Advances in Neural Information Processing Systems , volume=

Compressing large language models using low rank and low precision decomposition , author=. Advances in Neural Information Processing Systems , volume=

[1] [2]

Piqa: Reasoning about physical commonsense in natural language

Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

2020

[2] [3]

Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. M. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36: 0 4396--4429, 2023

2023

[3] [6]

Qlora: Efficient finetuning of quantized llms

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

2023

[4] [9]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022

2022

[5] [11]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6: 0 87--100, 2024

2024

[6] [13]

A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B

Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994

1994

[7] [15]

A corpus and cloze evaluation for deeper understanding of commonsense stories

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 839--849, 2016

2016

[8] [16]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

2020

[9] [17]

J., and Pilanci, M

Saha, R., Sagan, N., Srivastava, V., Goldsmith, A. J., and Pilanci, M. Compressing large language models using low rank and low precision decomposition. Advances in Neural Information Processing Systems, 37: 0 88981--89018, 2024

2024

[10] [18]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4149--4158, 2019

2019

[11] [20]

Svd-llm: Truncation-aware singular value decomposition for large language model compression

Wang, X., Zheng, Y., Wan, Z., and Zhang, M. Svd-llm: Truncation-aware singular value decomposition for large language model compression. ICLR, 2025

2025

[12] [21]

Qa-lora: Quantization-aware low-rank adaptation of large language models

Xu, Y., Xie, L., Gu, X., Chen, X., Chang, H., Zhang, H., Chen, Z., Zhang, X., and Tian, Q. Qa-lora: Quantization-aware low-rank adaptation of large language models. ICLR, 2024

2024

[13] [22]

Lowra: Accurate and efficient lora fine-tuning of llms under 2 bits

Zhou, Z., Zhang, Q., Kumbong, H., and Olukotun, K. Lowra: Accurate and efficient lora fine-tuning of llms under 2 bits. ICML, 2025

2025

[14] [23]

Large language models in 6g from standard to on-device networks

Zou, H., Zhao, Q., Lasaulce, S., Zhang, C., Tian, Y., Bariah, L., Bader, F., and Debbah, M. Large language models in 6g from standard to on-device networks. Nature Reviews Electrical Engineering, pp.\ 1--12, 2026

2026

[15] [24]

Nature Reviews Electrical Engineering , pages=

Large language models in 6G from standard to on-device networks , author=. Nature Reviews Electrical Engineering , pages=. 2026 , publisher=

2026

[16] [25]

Proceedings of machine learning and systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=

[17] [26]

ICLR , year=

Svd-llm: Truncation-aware singular value decomposition for large language model compression , author=. ICLR , year=

[18] [27]

arXiv preprint arXiv:2410.21271 , year=

EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation , author=. arXiv preprint arXiv:2410.21271 , year=

work page arXiv

[19] [28]

ICML , year=

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits , author=. ICML , year=

[20] [29]

ICLR , year=

QA-LoRA: Quantization-aware low-rank adaptation of large language models , author=. ICLR , year=

[21] [30]

Advances in Neural Information Processing Systems , volume=

Quip: 2-bit quantization of large language models with guarantees , author=. Advances in Neural Information Processing Systems , volume=

[22] [31]

Omniquant: Omnidirectionally calibrated quantization for large language models.arXiv preprint arXiv:2308.13137, 2023

Omniquant: Omnidirectionally calibrated quantization for large language models , author=. arXiv preprint arXiv:2308.13137 , year=

work page arXiv

[23] [32]

arXiv preprint arXiv:2303.08302 , year=

A comprehensive study on post-training quantization for large language models , author=. arXiv preprint arXiv:2303.08302 , year=

work page arXiv

[24] [33]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [34]

ACM Computing Surveys , volume=

Towards efficient generative large language model serving: A survey from algorithms to systems , author=. ACM Computing Surveys , volume=. 2025 , publisher=

2025

[26] [35]

arXiv preprint arXiv:2411.06084 , year=

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques , author=. arXiv preprint arXiv:2411.06084 , year=

work page arXiv

[27] [36]

Applied Intelligence , volume=

A comprehensive review of model compression techniques in machine learning , author=. Applied Intelligence , volume=. 2024 , publisher=

2024

[28] [37]

Advances in neural information processing systems , volume=

Post training 4-bit quantization of convolutional networks for rapid-deployment , author=. Advances in neural information processing systems , volume=

[29] [38]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Quantization and training of neural networks for efficient integer-arithmetic-only inference , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[30] [39]

European conference on computer vision , pages=

Post-training piecewise linear quantization for deep neural networks , author=. European conference on computer vision , pages=. 2020 , organization=

2020

[31] [40]

International conference on machine learning , pages=

Accurate post training quantization with small calibration sets , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[32] [41]

Advances in Neural Information Processing Systems , volume=

Optimal brain compression: A framework for accurate post-training quantization and pruning , author=. Advances in Neural Information Processing Systems , volume=

[33] [42]

Advances in neural information processing systems , volume=

Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=

[34] [43]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [44]

arXiv preprint arXiv:2501.18475 , year=

Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization , author=. arXiv preprint arXiv:2501.18475 , year=

work page arXiv

[36] [45]

arXiv preprint arXiv:2505.03802 , year=

Efficient fine-tuning of quantized models via adaptive rank and bitwidth , author=. arXiv preprint arXiv:2505.03802 , year=

work page arXiv

[37] [46]

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs , author=. arXiv preprint arXiv:2509.25214 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[39] [48]

arXiv preprint arXiv:2505.21895 , year=

Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization , author=. arXiv preprint arXiv:2505.21895 , year=

work page arXiv

[40] [49]

5-VL , author=

Efficient Fine-Tuning of Multimodal Language Models for Medical AI via LoRA and 4-bit Quantization on Qwen2. 5-VL , author=. 2025 7th International Conference on Data-driven Optimization of Complex Systems (DOCS) , pages=. 2025 , organization=

2025

[41] [50]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Controlled low-rank adaptation with subspace regularization for continued training on large language models , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[42] [51]

2025 62nd ACM/IEEE Design Automation Conference (DAC) , pages=

DuQTTA: Dual Quantized Tensor-Train Adaptation with Decoupling Magnitude-Direction for Efficient Fine-Tuning of LLMs , author=. 2025 62nd ACM/IEEE Design Automation Conference (DAC) , pages=. 2025 , organization=

2025

[43] [52]

IEEE transactions on pattern analysis and machine intelligence , volume=

Robust recovery of subspace structures by low-rank representation , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2012 , publisher=

2012

[44] [53]

arXiv preprint arXiv:2311.12023 , year=

Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning , author=. arXiv preprint arXiv:2311.12023 , year=

work page arXiv

[45] [54]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

[46] [55]

Expert Systems with Applications , volume=

Imbalanced complemented subspace representation with adaptive weight learning , author=. Expert Systems with Applications , volume=. 2024 , publisher=

2024

[47] [56]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [57]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [59]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [60]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

[52] [61]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[53] [62]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [63]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [64]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905

[56] [65]

Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

The penn treebank: Annotating predicate argument structure , author=. Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

1994

[57] [66]

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

A corpus and cloze evaluation for deeper understanding of commonsense stories , author=. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

2016

[58] [67]

GitHub repository , howpublished =

ModelCloud.ai , title =. GitHub repository , howpublished =

[59] [68]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

[60] [69]

arXiv preprint arXiv:2310.08659 , year=

Loftq: Lora-fine-tuning-aware quantization for large language models , author=. arXiv preprint arXiv:2310.08659 , year=

work page arXiv

[61] [70]

tinybenchmarks: Evaluating LLMs with fewer examples

tinyBenchmarks: evaluating LLMs with fewer examples , author=. arXiv preprint arXiv:2402.14992 , year=

work page arXiv

[62] [71]

2023 , publisher=

Stanford alpaca: An instruction-following llama model , author=. 2023 , publisher=

2023

[63] [72]

Commonsense\_170k Dataset , author =

[64] [73]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [74]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[66] [75]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

[67] [76]

LoRDQ: activation-aware Low-Rank Decomposition and Quantization for Large Language Model Compression , author=

[68] [77]

Commonsenseqa: A question answering challenge targeting commonsense knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019

[69] [78]

Advances in Neural Information Processing Systems , volume=

Compressing large language models using low rank and low precision decomposition , author=. Advances in Neural Information Processing Systems , volume=