Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

Ashish Khisti; Liza Babaoglu; Shuangyi Chen

arxiv: 2606.12876 · v1 · pith:WONGK2HLnew · submitted 2026-06-11 · 💻 cs.LG · cs.CL· cs.IT· math.IT

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

Liza Babaoglu , Shuangyi Chen , Ashish Khisti This is my paper

Pith reviewed 2026-06-27 07:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.ITmath.IT

keywords multi-bitwidth quantizationadditive codebookspost-training quantizationLLM quantizationsuccessive refinementMatryoshka supervisionGaussian weightsweighted MSE

0 comments

The pith

LLM weights following a Gaussian distribution can be reconstructed with increasing fidelity at multiple bitwidths from one model via additive codebooks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that a single trained LLM checkpoint can deliver inference at several different precisions without retraining or storing separate models. It argues that because typical LLM weights are Gaussian, successive addition of bits yields steadily better reconstructions when measured by a weighted mean squared error tied to actual LLM loss. The approach trains additive codebooks under Matryoshka-style supervision so that ordered subsets of those codebooks produce usable partial reconstructions at each target bitwidth. If correct, this removes the need to keep multiple quantized versions of the same model for different hardware constraints.

Core claim

The paper establishes that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated under a weighted mean squared error distortion motivated by LLM loss functions. This is realized in practice through additive codebooks and Matryoshka-style supervision in the loss function, producing a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level.

What carries the argument

Additive codebooks trained with Matryoshka-style supervision to realize successive refinement of Gaussian weight reconstructions under weighted MSE.

If this is right

A single checkpoint can serve multiple bitwidths, eliminating the storage cost of separate quantized models.
Inference-time selection of precision becomes possible without any retraining step.
Competitive perplexity and downstream accuracy are retained on Qwen, LLaMA, Gemma, and Mistral architectures.
Memory overhead during deployment drops because only one set of codebooks is stored.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same additive structure might support on-the-fly bitwidth changes inside a single forward pass on heterogeneous hardware.
If weight distributions deviate from Gaussian, the successive-refinement guarantee would require a different codebook design or distortion measure.
Deployment pipelines could replace multiple quantized checkpoints with one model plus a small set of additive codebooks.
The method opens a route to unified training that targets several hardware tiers simultaneously.

Load-bearing premise

LLM weights follow a Gaussian distribution and weighted mean squared error is the right distortion measure for measuring reconstruction quality.

What would settle it

Measuring whether adding successive codebooks produces a strictly decreasing reconstruction error that matches the information-theoretic bound for successive refinement of a Gaussian source under the chosen weighted MSE.

Figures

Figures reproduced from arXiv: 2606.12876 by Ashish Khisti, Liza Babaoglu, Shuangyi Chen.

**Figure 1.** Figure 1: LLM deployments Modern LLMs often contain tens or hundreds of billions of parameters, imposing substantial memory and computational demands. Deployment is particularly challenging across heterogeneous platforms that not only include large cloud servers but also personal devices and edge hardware with strict memory, latency, or power constraints (Yan & Ding, 2025; Zheng et al., 2024). Quantization address… view at source ↗

**Figure 2.** Figure 2: Illustration of additive codebook-based reconstruction Additive quantization (AQ) removes the strict ordering imposed by residuals by representing weights as sums of multiple independently selected vectors drawn from learned codebooks. These components are optimized jointly rather than sequentially, resulting in a more flexible representation with higher expressive capacity at extreme compression ratios… view at source ↗

**Figure 3.** Figure 3: Qwen-7B layer 0 self attn.v proj weights fitted to a Gaussian distribution We now characterize the successive refinability of Gaussian sources under weighted MSE, culminating in Theorem 1. Empirically, layer weights are well-approximated by Gaussian distributions, as seen in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Rate-distortion equivalence The central challenge is that the WMSE distortion couples all entries of W through the fixed data matrix X, making it non-obvious that successive refinement is achievable without a rate penalty. The proof resolves this by reducing to a standard Gaussian MSE problem, then revealing the nested structure of reverse water-filling. The reverse water-filling method is an optimizat… view at source ↗

**Figure 5.** Figure 5: Visual of LDbyD However, it is our Drop-by-Drop loss that ensures this droppable property holds in practice, where codebooks can be dropped at inference to provide a smooth tradeoff between compression rate and reconstruction quality. We introduce a novel training objective, the Drop-by-Drop loss, that enables a single quantized model to be evaluated using different numbers of codebooks, without retrainin… view at source ↗

**Figure 6.** Figure 6: Perplexity on WikiText2 (↓). Extreme outliers for AQLM (Drop) are clipped. Evaluation Benchmarks. We evaluate Drop-by-Drop across a diverse set of LLMs spanning multiple architectures and scales, including GEMMA-2B, META-LLAMA-3-8B, MISTRAL-7B-V0.3, and the QWEN2.5 family (Team et al., 2024; Grattafiori et al., 2024; Jiang et al., 2023; Team, 2024), including 0.5B-INSTRUCT, 3B, 7B-INSTRUCT, 14B-INSTRUCT, a… view at source ↗

**Figure 7.** Figure 7: Average zero-shot accuracy (↑) across 5 aforementioned tasks 7 Discussion and Future Work Drop-by-Drop demonstrates that a single quantized model can flexibly serve multiple resource budgets without retraining or maintaining separate checkpoints. In conventional methods, supporting multiple precisions requires training and storing a separate model for each target configuration. In our setup, this correspo… view at source ↗

**Figure 8.** Figure 8: Training time and disk space required to produce a suite of quantized models [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Qwen-7B layer weights fitted to Gaussian distributions [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Reverse water-filling method confirming that the distortion equivalence holds in matrix form as well. The optimal codebook for W is then recovered from the optimal Yb by inverting the relation Y = WX, whose precise form depends on the structure of X and is treated separately in each of the three cases below. Case 1: X is square and invertible (din = n). Set Wc = Yb X −1 . Then, WcX = Yb, and therefore E h… view at source ↗

**Figure 11.** Figure 11: Successive refinement under reverse water-filling: case (a) [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Successive refinement under reverse water-filling: case (b) and case (c) [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

read the original abstract

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable single-checkpoint method for multi-bitwidth LLM weights via ordered additive codebooks and Matryoshka supervision, but the successive-refinement optimality claim hinges on unshown derivations for the Gaussian and weighted-MSE assumptions.

read the letter

The main thing to know is that Drop-by-Drop trains additive codebooks once, then lets you pick ordered subsets at inference to hit different bitwidths from the same stored model. This directly cuts storage when you need to serve the same weights at 2-bit, 3-bit, or 4-bit on different hardware.

It does the practical part cleanly. The Matryoshka supervision is a straightforward way to force the codebooks to improve reconstruction as more bits are added, and the additive structure matches the successive-refinement goal without needing separate trainings. Coverage across Qwen, LLaMA, Gemma, and Mistral is reasonable for a quantization paper.

The soft spot is the theory. The abstract states that LLM weights are Gaussian and can be optimally reconstructed under a weighted MSE motivated by LLM loss, but it gives no derivation, no check on whether the source meets the conditions for successive refinability, and no argument that the chosen distortion actually tracks downstream loss. The stress-test concern holds on the abstract: if either the distribution or the distortion assumption is off, the optimality guarantee does not transfer to the codebook construction. The full paper needs to show the math or drop the strong claim and treat the weighting as a heuristic. The free parameters in the codebooks and MSE weights also make it worth checking whether the reported gains depend on post-hoc tuning.

This is for people working on post-training quantization and multi-precision deployment. A reader who wants a concrete way to avoid multiple checkpoints will get usable ideas even if the theory section needs tightening. It deserves peer review because the engineering framing is distinct from prior additive quantization work and the experiments are on real models.

Referee Report

3 major / 1 minor

Summary. The paper introduces Drop-by-Drop, a post-training quantization framework for LLMs that uses additive codebooks combined with Matryoshka-style supervision to produce a single model supporting inference at multiple bitwidths. It claims this realizes information-theoretic successive refinement, allowing optimal reconstruction of Gaussian-distributed LLM weights with increasing fidelity under a weighted MSE distortion motivated by LLM loss functions, thereby reducing storage overhead while maintaining competitive perplexity and accuracy on models such as Qwen, LLaMA, Gemma, and Mistral.

Significance. If the successive-refinement optimality result holds and the empirical results generalize, the method would enable flexible multi-precision deployment from one checkpoint, addressing a practical deployment bottleneck for heterogeneous hardware without requiring separate models per bitwidth.

major comments (3)

[Abstract] Abstract: The claim that LLM weights 'can be optimally reconstructed with increasing fidelity as additional bits are incorporated' under weighted MSE is presented as theoretically grounded in successive refinement, yet no derivation, rate-distortion analysis, or verification of the successive-refinability conditions for the chosen distortion is supplied; this is load-bearing for the central optimality assertion.
[Abstract] Abstract: The Gaussian assumption on LLM weight distributions is invoked without any cited empirical verification, moment analysis, or reference to a specific section establishing its validity across the evaluated architectures; the optimality transfer from information theory to the additive-codebook construction depends on this premise.
[Abstract] Abstract: The weighted MSE is described as 'motivated by LLM loss functions' but is not defined explicitly nor shown to align with actual downstream loss; without this link, the optimality claim risks circularity with post-hoc weighting choices.

minor comments (1)

[Abstract] The abstract mentions 'ordered subsets of codebooks' but does not clarify how the additive structure interacts with the Matryoshka supervision in the loss; a brief equation or pseudocode would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these precise comments on the abstract. Each point identifies a place where the presentation of the theoretical grounding can be strengthened. We will revise the manuscript to address them directly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that LLM weights 'can be optimally reconstructed with increasing fidelity as additional bits are incorporated' under weighted MSE is presented as theoretically grounded in successive refinement, yet no derivation, rate-distortion analysis, or verification of the successive-refinability conditions for the chosen distortion is supplied; this is load-bearing for the central optimality assertion.

Authors: We agree the abstract states the optimality result without supplying the supporting derivation. The full manuscript develops the successive-refinement argument for additive codebooks under the weighted MSE, but the abstract itself does not reference the specific conditions or rate-distortion steps. In revision we will either (a) add a concise parenthetical outline of the key conditions or (b) point explicitly to the theorem establishing successive refinability, so the claim is no longer unsupported in the abstract. revision: yes
Referee: [Abstract] Abstract: The Gaussian assumption on LLM weight distributions is invoked without any cited empirical verification, moment analysis, or reference to a specific section establishing its validity across the evaluated architectures; the optimality transfer from information theory to the additive-codebook construction depends on this premise.

Authors: The manuscript states that LLM weights 'commonly follow a Gaussian distribution' but does not include the requested empirical checks or section reference in the abstract. We will add a short clause citing the relevant figure or appendix that reports weight histograms and moment statistics for the evaluated models (Qwen, LLaMA, Gemma, Mistral), thereby grounding the assumption. revision: yes
Referee: [Abstract] Abstract: The weighted MSE is described as 'motivated by LLM loss functions' but is not defined explicitly nor shown to align with actual downstream loss; without this link, the optimality claim risks circularity with post-hoc weighting choices.

Authors: The abstract uses the phrase 'motivated by LLM loss functions' without defining the weighting or demonstrating the link. We will revise the abstract to give the explicit form of the weighted MSE and add a one-sentence pointer to the section that derives the weighting from the LLM training objective, removing any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained in information-theoretic claims

full rationale

The abstract presents the optimality result as established from information theory and successive refinement under a stated Gaussian assumption and a weighted MSE chosen to be motivated by (but not defined as identical to) LLM loss functions. No equations, self-citations, or fitted parameters are quoted that reduce the claimed reconstruction guarantee to a tautology or to the input data by construction. The Matryoshka supervision is described as a practical realization step rather than a definitional loop. The paper therefore remains self-contained against external benchmarks for its theoretical grounding.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Only abstract available; ledger populated from stated claims. Gaussian distribution of weights and suitability of weighted MSE are taken as given without independent verification in the provided text.

free parameters (1)

codebook parameters and weighting in MSE
Not numerically specified; the method relies on learned codebooks whose values are fitted during training.

axioms (2)

domain assumption LLM weights commonly follow a Gaussian distribution
Explicitly stated in abstract as the basis for optimal reconstruction.
domain assumption Weighted mean squared error is the appropriate distortion measure motivated by LLM loss functions
Invoked to justify successive refinement optimality.

pith-pipeline@v0.9.1-grok · 5731 in / 1454 out tokens · 16165 ms · 2026-06-27T07:27:27.374689+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 25 canonical work pages · 12 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[3]

Advances in Neural Information Processing Systems , volume=

Matryoshka representation learning , author=. Advances in Neural Information Processing Systems , volume=
[4]

2025 International Joint Conference on Neural Networks (IJCNN) , pages=

Lsaq: Layer-specific adaptive quantization for large language model deployment , author=. 2025 International Joint Conference on Neural Networks (IJCNN) , pages=. 2025 , organization=

2025
[8]

ACM Transactions on Intelligent Systems and Technology , volume=

A comprehensive overview of large language models , author=. ACM Transactions on Intelligent Systems and Technology , volume=. 2025 , publisher=

2025
[13]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

2024 , howpublished =

Google , title =. 2024 , howpublished =

2024
[17]

arXiv preprint arXiv:2401.15347 , year=

A comprehensive survey of compression algorithms for language models , author=. arXiv preprint arXiv:2401.15347 , year=

work page arXiv
[18]

A Simple and Effective Pruning Approach for Large Language Models

A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910
[20]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[21]

Advances in Neural Information Processing Systems , volume=

Compressing large language models using low rank and low precision decomposition , author=. Advances in Neural Information Processing Systems , volume=
[22]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Albert: A lite bert for self-supervised learning of language representations , author=. arXiv preprint arXiv:1909.11942 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[23]

int8 (): 8-bit matrix multiplication for transformers at scale , author=

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in neural information processing systems , volume=
[24]

Proceedings of machine learning and systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=
[26]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Additive quantization for extreme vector compression , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
[27]

Universally Slimmable Networks and Improved Training Techniques , year=

Yu, Jiahui and Huang, Thomas , booktitle=. Universally Slimmable Networks and Improved Training Techniques , year=
[28]

Advances in neural information processing systems , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in neural information processing systems , volume=
[29]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Fastbert: a self-distilling bert with adaptive inference time , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[30]

Proceedings of the ACM SIGCOMM 2024 Conference , pages=

Cachegen: Kv cache compression and streaming for fast large language model serving , author=. Proceedings of the ACM SIGCOMM 2024 Conference , pages=

2024
[32]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Adabits: Neural network quantization with adaptive bit-widths , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[33]

Workshop on Machine Learning and Compression, NeurIPS 2024 , year =

AdaQuantLM: LLM Quantization with Adaptive Bit-Widths , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year =

2024
[34]

International Conference on Machine Learning (ICML) , year=

Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff , author=. International Conference on Machine Learning (ICML) , year=
[35]

Sensors , volume=

Approximate nearest neighbor search by residual vector quantization , author=. Sensors , volume=. 2010 , publisher=

2010
[36]

IEEE Transactions on information theory , volume=

Successive refinement of information , author=. IEEE Transactions on information theory , volume=. 1991 , publisher=

1991
[37]

2006 , publisher=

Elements of information theory , author=. 2006 , publisher=

2006
[38]

IEEE Transactions on Information Theory , volume=

All Sources Are Nearly Successively Refinable , author=. IEEE Transactions on Information Theory , volume=. 2001 , publisher=

2001
[39]

Elements of Information Theory , pages=

Rate distortion theory , author=. Elements of Information Theory , pages=
[40]

1989 , school=

Successive Refinement of Information , author=. 1989 , school=

1989
[41]

, title =

Sakrison, David J. , title =. IEEE Transactions on Information Theory , year =
[43]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[46]

Journal of Machine Learning Research , volume=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=
[47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[48]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[49]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year=
[51]

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , pages=

Learning multi-rate vector quantization for remote deep inference , author=. 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , pages=. 2023 , organization=

2023
[53]

2024 , howpublished =

Tianxiang Chu , title =. 2024 , howpublished =

2024
[54]

Advances in Neural Information Processing Systems , volume=

Matryoshka query transformer for large vision-language models , author=. Advances in Neural Information Processing Systems , volume=
[55]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Vptq: Extreme low-bit vector post-training quantization for large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[56]

Proceedings of machine learning research , volume=

Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks , author=. Proceedings of machine learning research , volume=
[59]

Additive quantization for extreme vector compression

Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 931--938, 2014

2014
[60]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

2020
[61]

Adaquantlm: Llm quantization with adaptive bit-widths

Shuangyi Chen and Ashish J Khisti. Adaquantlm: Llm quantization with adaptive bit-widths. In Workshop on Machine Learning and Compression, NeurIPS 2024, 2024

2024
[62]

Approximate nearest neighbor search by residual vector quantization

Yongjian Chen, Tao Guan, and Cheng Wang. Approximate nearest neighbor search by residual vector quantization. Sensors, 10 0 (12): 0 11259--11273, 2010

2010
[63]

Quip‑for‑all: Unified quip implementation for llm quantization

Tianxiang Chu. Quip‑for‑all: Unified quip implementation for llm quantization. https://github.com/chu‑tianxiang/QuIP‑for‑all, 2024. Accessed: 2026

2024
[64]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[65]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2 edition, 2006

2006
[66]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 35: 0 30318--30332, 2022

2022
[67]

Extreme compression of large language models via additive quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118, 2024

work page arXiv 2024
[68]

Successive Refinement of Information

William Howard Robinson Equitz. Successive Refinement of Information. PhD thesis, Stanford University, 1989

1989
[69]

Successive refinement of information

William HR Equitz and Thomas M Cover. Successive refinement of information. IEEE Transactions on information theory, 37 0 (2): 0 269--275, 1991

1991
[70]

Remote inference over dynamic links via adaptive rate deep task-oriented vector quantization

Eyal Fishel, May Malka, Shai Ginzach, and Nir Shlezinger. Remote inference over dynamic links via adaptive rate deep task-oriented vector quantization. arXiv preprint arXiv:2501.02521, 2025

work page arXiv 2025
[71]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[72]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Matryoshka query transformer for large vision-language models

Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems, 37: 0 50168--50188, 2024

2024
[74]

Residual quantization with implicit neural codebooks

Iris AM Huijben, Matthijs Douze, Matthew Muckley, Ruud JG Van Sloun, and Jakob Verbeek. Residual quantization with implicit neural codebooks. arXiv preprint arXiv:2401.14732, 2024

work page arXiv 2024
[75]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Adabits: Neural network quantization with adaptive bit-widths

Qing Jin, Linjie Yang, and Zhenyu Liao. Adabits: Neural network quantization with adaptive bit-widths. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 2146--2156, 2020

2020
[77]

Matryoshka representation learning

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35: 0 30233--30249, 2022

2022
[78]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6: 0 87--100, 2024

2024
[79]

Fastbert: a self-distilling bert with adaptive inference time

Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. Fastbert: a self-distilling bert with adaptive inference time. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.\ 6035--6044, 2020

2020
[80]

Vptq: Extreme low-bit vector post-training quantization for large language models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 8181--8196, 2024 a

2024
[81]

Cachegen: Kv cache compression and streaming for fast large language model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pp.\ 38--56, 2024 b

2024
[82]

Learning multi-rate vector quantization for remote deep inference

May Malka, Shai Ginzach, and Nir Shlezinger. Learning multi-rate vector quantization for remote deep inference. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), pp.\ 1--5. IEEE, 2023

2023
[83]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[84]

Matryoshka quantization

Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, and Aditya Kusupati. Matryoshka quantization. arXiv preprint arXiv:2502.06786, 2025

work page arXiv 2025
[85]

A comprehensive overview of large language models

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16 0 (5): 0 1--72, 2025

2025
[86]

Anybcq: Hardware efficient flexible binary-coded quantization for multi-precision llms

Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Anybcq: Hardware efficient flexible binary-coded quantization for multi-precision llms. arXiv preprint arXiv:2510.10467, 2025

work page arXiv 2025
[87]

Any-precision llm: Low-cost deployment of multiple, different-sized llms

Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W Lee. Any-precision llm: Low-cost deployment of multiple, different-sized llms. arXiv preprint arXiv:2402.10517, 2024

work page arXiv 2024
[88]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

2020
[89]

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34: 0 13937--13949, 2021

2021
[90]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 8732--8740, 2020

2020
[91]

Sakrison

David J. Sakrison. The rate distortion function for a gaussian process with a weighted square error criterion. IEEE Transactions on Information Theory, 14 0 (5): 0 506–508, 1968

1968
[92]

Resq: Mixed-precision quantization of large language models with low-rank residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals. arXiv preprint arXiv:2412.14363, 2024

work page arXiv 2024
[93]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[94]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[95]

Algorithmic progress in language models: A decomposition of perplexity

Nicholas Thompson, Enric Carr, Christopher Chiang, Johannes Erner, Maarten Hobbhahn, Sara Hooker, William Mann, David Wallace, and Jason Wei. Algorithmic progress in language models: A decomposition of perplexity. arXiv preprint arXiv:2505.04075, 2024

work page arXiv 2024
[96]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[97]

Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks. Proceedings of machine learning research, 235: 0 48630, 2024

2024
[98]

Qinco2: Vector compression and search with improved implicit neural codebooks

Th \'e ophane Vallaeys, Matthew Muckley, Jakob Verbeek, and Matthijs Douze. Qinco2: Vector compression and search with improved implicit neural codebooks. arXiv preprint arXiv:2501.03078, 2025

work page arXiv 2025
[99]

Are we there yet? a measurement study of efficiency for llm applications on mobile devices

Xiao Yan and Yi Ding. Are we there yet? a measurement study of efficiency for llm applications on mobile devices. arXiv preprint arXiv:2504.00002, 2025

work page arXiv 2025

Showing first 80 references.

[1] [1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

[2] [3]

Advances in Neural Information Processing Systems , volume=

Matryoshka representation learning , author=. Advances in Neural Information Processing Systems , volume=

[3] [4]

2025 International Joint Conference on Neural Networks (IJCNN) , pages=

Lsaq: Layer-specific adaptive quantization for large language model deployment , author=. 2025 International Joint Conference on Neural Networks (IJCNN) , pages=. 2025 , organization=

2025

[4] [8]

ACM Transactions on Intelligent Systems and Technology , volume=

A comprehensive overview of large language models , author=. ACM Transactions on Intelligent Systems and Technology , volume=. 2025 , publisher=

2025

[5] [13]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [16]

2024 , howpublished =

Google , title =. 2024 , howpublished =

2024

[7] [17]

arXiv preprint arXiv:2401.15347 , year=

A comprehensive survey of compression algorithms for language models , author=. arXiv preprint arXiv:2401.15347 , year=

work page arXiv

[8] [18]

A Simple and Effective Pruning Approach for Large Language Models

A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [19]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910

[10] [20]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

[11] [21]

Advances in Neural Information Processing Systems , volume=

Compressing large language models using low rank and low precision decomposition , author=. Advances in Neural Information Processing Systems , volume=

[12] [22]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Albert: A lite bert for self-supervised learning of language representations , author=. arXiv preprint arXiv:1909.11942 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[13] [23]

int8 (): 8-bit matrix multiplication for transformers at scale , author=

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in neural information processing systems , volume=

[14] [24]

Proceedings of machine learning and systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=

[15] [26]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Additive quantization for extreme vector compression , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

[16] [27]

Universally Slimmable Networks and Improved Training Techniques , year=

Yu, Jiahui and Huang, Thomas , booktitle=. Universally Slimmable Networks and Improved Training Techniques , year=

[17] [28]

Advances in neural information processing systems , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in neural information processing systems , volume=

[18] [29]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Fastbert: a self-distilling bert with adaptive inference time , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

[19] [30]

Proceedings of the ACM SIGCOMM 2024 Conference , pages=

Cachegen: Kv cache compression and streaming for fast large language model serving , author=. Proceedings of the ACM SIGCOMM 2024 Conference , pages=

2024

[20] [32]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Adabits: Neural network quantization with adaptive bit-widths , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[21] [33]

Workshop on Machine Learning and Compression, NeurIPS 2024 , year =

AdaQuantLM: LLM Quantization with Adaptive Bit-Widths , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year =

2024

[22] [34]

International Conference on Machine Learning (ICML) , year=

Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff , author=. International Conference on Machine Learning (ICML) , year=

[23] [35]

Sensors , volume=

Approximate nearest neighbor search by residual vector quantization , author=. Sensors , volume=. 2010 , publisher=

2010

[24] [36]

IEEE Transactions on information theory , volume=

Successive refinement of information , author=. IEEE Transactions on information theory , volume=. 1991 , publisher=

1991

[25] [37]

2006 , publisher=

Elements of information theory , author=. 2006 , publisher=

2006

[26] [38]

IEEE Transactions on Information Theory , volume=

All Sources Are Nearly Successively Refinable , author=. IEEE Transactions on Information Theory , volume=. 2001 , publisher=

2001

[27] [39]

Elements of Information Theory , pages=

Rate distortion theory , author=. Elements of Information Theory , pages=

[28] [40]

1989 , school=

Successive Refinement of Information , author=. 1989 , school=

1989

[29] [41]

, title =

Sakrison, David J. , title =. IEEE Transactions on Information Theory , year =

[30] [43]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023

[31] [46]

Journal of Machine Learning Research , volume=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

[32] [47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[33] [48]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[34] [49]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

[35] [51]

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , pages=

Learning multi-rate vector quantization for remote deep inference , author=. 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , pages=. 2023 , organization=

2023

[36] [53]

2024 , howpublished =

Tianxiang Chu , title =. 2024 , howpublished =

2024

[37] [54]

Advances in Neural Information Processing Systems , volume=

Matryoshka query transformer for large vision-language models , author=. Advances in Neural Information Processing Systems , volume=

[38] [55]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Vptq: Extreme low-bit vector post-training quantization for large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[39] [56]

Proceedings of machine learning research , volume=

Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks , author=. Proceedings of machine learning research , volume=

[40] [59]

Additive quantization for extreme vector compression

Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 931--938, 2014

2014

[41] [60]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

2020

[42] [61]

Adaquantlm: Llm quantization with adaptive bit-widths

Shuangyi Chen and Ashish J Khisti. Adaquantlm: Llm quantization with adaptive bit-widths. In Workshop on Machine Learning and Compression, NeurIPS 2024, 2024

2024

[43] [62]

Approximate nearest neighbor search by residual vector quantization

Yongjian Chen, Tao Guan, and Cheng Wang. Approximate nearest neighbor search by residual vector quantization. Sensors, 10 0 (12): 0 11259--11273, 2010

2010

[44] [63]

Quip‑for‑all: Unified quip implementation for llm quantization

Tianxiang Chu. Quip‑for‑all: Unified quip implementation for llm quantization. https://github.com/chu‑tianxiang/QuIP‑for‑all, 2024. Accessed: 2026

2024

[45] [64]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [65]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2 edition, 2006

2006

[47] [66]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 35: 0 30318--30332, 2022

2022

[48] [67]

Extreme compression of large language models via additive quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118, 2024

work page arXiv 2024

[49] [68]

Successive Refinement of Information

William Howard Robinson Equitz. Successive Refinement of Information. PhD thesis, Stanford University, 1989

1989

[50] [69]

Successive refinement of information

William HR Equitz and Thomas M Cover. Successive refinement of information. IEEE Transactions on information theory, 37 0 (2): 0 269--275, 1991

1991

[51] [70]

Remote inference over dynamic links via adaptive rate deep task-oriented vector quantization

Eyal Fishel, May Malka, Shai Ginzach, and Nir Shlezinger. Remote inference over dynamic links via adaptive rate deep task-oriented vector quantization. arXiv preprint arXiv:2501.02521, 2025

work page arXiv 2025

[52] [71]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [72]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [73]

Matryoshka query transformer for large vision-language models

Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems, 37: 0 50168--50188, 2024

2024

[55] [74]

Residual quantization with implicit neural codebooks

Iris AM Huijben, Matthijs Douze, Matthew Muckley, Ruud JG Van Sloun, and Jakob Verbeek. Residual quantization with implicit neural codebooks. arXiv preprint arXiv:2401.14732, 2024

work page arXiv 2024

[56] [75]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [76]

Adabits: Neural network quantization with adaptive bit-widths

Qing Jin, Linjie Yang, and Zhenyu Liao. Adabits: Neural network quantization with adaptive bit-widths. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 2146--2156, 2020

2020

[58] [77]

Matryoshka representation learning

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35: 0 30233--30249, 2022

2022

[59] [78]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6: 0 87--100, 2024

2024

[60] [79]

Fastbert: a self-distilling bert with adaptive inference time

Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. Fastbert: a self-distilling bert with adaptive inference time. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.\ 6035--6044, 2020

2020

[61] [80]

Vptq: Extreme low-bit vector post-training quantization for large language models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 8181--8196, 2024 a

2024

[62] [81]

Cachegen: Kv cache compression and streaming for fast large language model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pp.\ 38--56, 2024 b

2024

[63] [82]

Learning multi-rate vector quantization for remote deep inference

May Malka, Shai Ginzach, and Nir Shlezinger. Learning multi-rate vector quantization for remote deep inference. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), pp.\ 1--5. IEEE, 2023

2023

[64] [83]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[65] [84]

Matryoshka quantization

Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, and Aditya Kusupati. Matryoshka quantization. arXiv preprint arXiv:2502.06786, 2025

work page arXiv 2025

[66] [85]

A comprehensive overview of large language models

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16 0 (5): 0 1--72, 2025

2025

[67] [86]

Anybcq: Hardware efficient flexible binary-coded quantization for multi-precision llms

Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Anybcq: Hardware efficient flexible binary-coded quantization for multi-precision llms. arXiv preprint arXiv:2510.10467, 2025

work page arXiv 2025

[68] [87]

Any-precision llm: Low-cost deployment of multiple, different-sized llms

Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W Lee. Any-precision llm: Low-cost deployment of multiple, different-sized llms. arXiv preprint arXiv:2402.10517, 2024

work page arXiv 2024

[69] [88]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

2020

[70] [89]

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34: 0 13937--13949, 2021

2021

[71] [90]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 8732--8740, 2020

2020

[72] [91]

Sakrison

David J. Sakrison. The rate distortion function for a gaussian process with a weighted square error criterion. IEEE Transactions on Information Theory, 14 0 (5): 0 506–508, 1968

1968

[73] [92]

Resq: Mixed-precision quantization of large language models with low-rank residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals. arXiv preprint arXiv:2412.14363, 2024

work page arXiv 2024

[74] [93]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [94]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[76] [95]

Algorithmic progress in language models: A decomposition of perplexity

Nicholas Thompson, Enric Carr, Christopher Chiang, Johannes Erner, Maarten Hobbhahn, Sara Hooker, William Mann, David Wallace, and Jason Wei. Algorithmic progress in language models: A decomposition of perplexity. arXiv preprint arXiv:2505.04075, 2024

work page arXiv 2024

[77] [96]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [97]

Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks. Proceedings of machine learning research, 235: 0 48630, 2024

2024

[79] [98]

Qinco2: Vector compression and search with improved implicit neural codebooks

Th \'e ophane Vallaeys, Matthew Muckley, Jakob Verbeek, and Matthijs Douze. Qinco2: Vector compression and search with improved implicit neural codebooks. arXiv preprint arXiv:2501.03078, 2025

work page arXiv 2025

[80] [99]

Are we there yet? a measurement study of efficiency for llm applications on mobile devices

Xiao Yan and Yi Ding. Are we there yet? a measurement study of efficiency for llm applications on mobile devices. arXiv preprint arXiv:2504.00002, 2025

work page arXiv 2025