pith. sign in

arxiv: 2606.12876 · v1 · pith:WONGK2HLnew · submitted 2026-06-11 · 💻 cs.LG · cs.CL· cs.IT· math.IT

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

Pith reviewed 2026-06-27 07:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.ITmath.IT
keywords multi-bitwidth quantizationadditive codebookspost-training quantizationLLM quantizationsuccessive refinementMatryoshka supervisionGaussian weightsweighted MSE
0
0 comments X

The pith

LLM weights following a Gaussian distribution can be reconstructed with increasing fidelity at multiple bitwidths from one model via additive codebooks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that a single trained LLM checkpoint can deliver inference at several different precisions without retraining or storing separate models. It argues that because typical LLM weights are Gaussian, successive addition of bits yields steadily better reconstructions when measured by a weighted mean squared error tied to actual LLM loss. The approach trains additive codebooks under Matryoshka-style supervision so that ordered subsets of those codebooks produce usable partial reconstructions at each target bitwidth. If correct, this removes the need to keep multiple quantized versions of the same model for different hardware constraints.

Core claim

The paper establishes that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated under a weighted mean squared error distortion motivated by LLM loss functions. This is realized in practice through additive codebooks and Matryoshka-style supervision in the loss function, producing a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level.

What carries the argument

Additive codebooks trained with Matryoshka-style supervision to realize successive refinement of Gaussian weight reconstructions under weighted MSE.

If this is right

  • A single checkpoint can serve multiple bitwidths, eliminating the storage cost of separate quantized models.
  • Inference-time selection of precision becomes possible without any retraining step.
  • Competitive perplexity and downstream accuracy are retained on Qwen, LLaMA, Gemma, and Mistral architectures.
  • Memory overhead during deployment drops because only one set of codebooks is stored.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same additive structure might support on-the-fly bitwidth changes inside a single forward pass on heterogeneous hardware.
  • If weight distributions deviate from Gaussian, the successive-refinement guarantee would require a different codebook design or distortion measure.
  • Deployment pipelines could replace multiple quantized checkpoints with one model plus a small set of additive codebooks.
  • The method opens a route to unified training that targets several hardware tiers simultaneously.

Load-bearing premise

LLM weights follow a Gaussian distribution and weighted mean squared error is the right distortion measure for measuring reconstruction quality.

What would settle it

Measuring whether adding successive codebooks produces a strictly decreasing reconstruction error that matches the information-theoretic bound for successive refinement of a Gaussian source under the chosen weighted MSE.

Figures

Figures reproduced from arXiv: 2606.12876 by Ashish Khisti, Liza Babaoglu, Shuangyi Chen.

Figure 1
Figure 1. Figure 1: LLM deployments Modern LLMs often contain tens or hundreds of billions of parameters, imposing substantial mem￾ory and computational demands. Deployment is particularly challenging across heterogeneous plat￾forms that not only include large cloud servers but also personal devices and edge hardware with strict memory, latency, or power constraints (Yan & Ding, 2025; Zheng et al., 2024). Quantization address… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of additive codebook-based reconstruction Additive quantization (AQ) removes the strict order￾ing imposed by residuals by representing weights as sums of multiple independently selected vectors drawn from learned codebooks. These components are optimized jointly rather than sequentially, result￾ing in a more flexible representation with higher ex￾pressive capacity at extreme compression ratios… view at source ↗
Figure 3
Figure 3. Figure 3: Qwen-7B layer 0 self attn.v proj weights fitted to a Gaussian distribution We now characterize the successive refinability of Gaussian sources under weighted MSE, culminat￾ing in Theorem 1. Empirically, layer weights are well-approximated by Gaussian distributions, as seen in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Rate-distortion equivalence The central challenge is that the WMSE distortion couples all entries of W through the fixed data matrix X, mak￾ing it non-obvious that successive re￾finement is achievable without a rate penalty. The proof resolves this by re￾ducing to a standard Gaussian MSE problem, then revealing the nested structure of reverse water-filling. The reverse water-filling method is an op￾timizat… view at source ↗
Figure 5
Figure 5. Figure 5: Visual of LDbyD However, it is our Drop-by-Drop loss that ensures this drop￾pable property holds in practice, where codebooks can be dropped at inference to provide a smooth tradeoff between compression rate and reconstruction quality. We introduce a novel training objective, the Drop-by-Drop loss, that enables a single quantized model to be evaluated using different numbers of codebooks, without retrainin… view at source ↗
Figure 6
Figure 6. Figure 6: Perplexity on WikiText2 (↓). Extreme outliers for AQLM (Drop) are clipped. Evaluation Benchmarks. We evaluate Drop-by-Drop across a diverse set of LLMs spanning multiple architectures and scales, including GEMMA-2B, META-LLAMA-3-8B, MISTRAL-7B-V0.3, and the QWEN2.5 family (Team et al., 2024; Grattafiori et al., 2024; Jiang et al., 2023; Team, 2024), including 0.5B-INSTRUCT, 3B, 7B-INSTRUCT, 14B-INSTRUCT, a… view at source ↗
Figure 7
Figure 7. Figure 7: Average zero-shot accuracy (↑) across 5 aforementioned tasks 7 Discussion and Future Work Drop-by-Drop demonstrates that a single quantized model can flexibly serve multiple re￾source budgets without retraining or maintaining separate checkpoints. In conventional methods, supporting multiple precisions requires training and storing a separate model for each target configuration. In our setup, this correspo… view at source ↗
Figure 8
Figure 8. Figure 8: Training time and disk space required to produce a suite of quantized models [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qwen-7B layer weights fitted to Gaussian distributions [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reverse water-filling method confirming that the distortion equivalence holds in matrix form as well. The optimal codebook for W is then recovered from the optimal Yb by inverting the relation Y = WX, whose precise form depends on the structure of X and is treated separately in each of the three cases below. Case 1: X is square and invertible (din = n). Set Wc = Yb X −1 . Then, WcX = Yb, and therefore E h… view at source ↗
Figure 11
Figure 11. Figure 11: Successive refinement under reverse water-filling: case (a) [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Successive refinement under reverse water-filling: case (b) and case (c) [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
read the original abstract

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Drop-by-Drop, a post-training quantization framework for LLMs that uses additive codebooks combined with Matryoshka-style supervision to produce a single model supporting inference at multiple bitwidths. It claims this realizes information-theoretic successive refinement, allowing optimal reconstruction of Gaussian-distributed LLM weights with increasing fidelity under a weighted MSE distortion motivated by LLM loss functions, thereby reducing storage overhead while maintaining competitive perplexity and accuracy on models such as Qwen, LLaMA, Gemma, and Mistral.

Significance. If the successive-refinement optimality result holds and the empirical results generalize, the method would enable flexible multi-precision deployment from one checkpoint, addressing a practical deployment bottleneck for heterogeneous hardware without requiring separate models per bitwidth.

major comments (3)
  1. [Abstract] Abstract: The claim that LLM weights 'can be optimally reconstructed with increasing fidelity as additional bits are incorporated' under weighted MSE is presented as theoretically grounded in successive refinement, yet no derivation, rate-distortion analysis, or verification of the successive-refinability conditions for the chosen distortion is supplied; this is load-bearing for the central optimality assertion.
  2. [Abstract] Abstract: The Gaussian assumption on LLM weight distributions is invoked without any cited empirical verification, moment analysis, or reference to a specific section establishing its validity across the evaluated architectures; the optimality transfer from information theory to the additive-codebook construction depends on this premise.
  3. [Abstract] Abstract: The weighted MSE is described as 'motivated by LLM loss functions' but is not defined explicitly nor shown to align with actual downstream loss; without this link, the optimality claim risks circularity with post-hoc weighting choices.
minor comments (1)
  1. [Abstract] The abstract mentions 'ordered subsets of codebooks' but does not clarify how the additive structure interacts with the Matryoshka supervision in the loss; a brief equation or pseudocode would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these precise comments on the abstract. Each point identifies a place where the presentation of the theoretical grounding can be strengthened. We will revise the manuscript to address them directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that LLM weights 'can be optimally reconstructed with increasing fidelity as additional bits are incorporated' under weighted MSE is presented as theoretically grounded in successive refinement, yet no derivation, rate-distortion analysis, or verification of the successive-refinability conditions for the chosen distortion is supplied; this is load-bearing for the central optimality assertion.

    Authors: We agree the abstract states the optimality result without supplying the supporting derivation. The full manuscript develops the successive-refinement argument for additive codebooks under the weighted MSE, but the abstract itself does not reference the specific conditions or rate-distortion steps. In revision we will either (a) add a concise parenthetical outline of the key conditions or (b) point explicitly to the theorem establishing successive refinability, so the claim is no longer unsupported in the abstract. revision: yes

  2. Referee: [Abstract] Abstract: The Gaussian assumption on LLM weight distributions is invoked without any cited empirical verification, moment analysis, or reference to a specific section establishing its validity across the evaluated architectures; the optimality transfer from information theory to the additive-codebook construction depends on this premise.

    Authors: The manuscript states that LLM weights 'commonly follow a Gaussian distribution' but does not include the requested empirical checks or section reference in the abstract. We will add a short clause citing the relevant figure or appendix that reports weight histograms and moment statistics for the evaluated models (Qwen, LLaMA, Gemma, Mistral), thereby grounding the assumption. revision: yes

  3. Referee: [Abstract] Abstract: The weighted MSE is described as 'motivated by LLM loss functions' but is not defined explicitly nor shown to align with actual downstream loss; without this link, the optimality claim risks circularity with post-hoc weighting choices.

    Authors: The abstract uses the phrase 'motivated by LLM loss functions' without defining the weighting or demonstrating the link. We will revise the abstract to give the explicit form of the weighted MSE and add a one-sentence pointer to the section that derives the weighting from the LLM training objective, removing any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained in information-theoretic claims

full rationale

The abstract presents the optimality result as established from information theory and successive refinement under a stated Gaussian assumption and a weighted MSE chosen to be motivated by (but not defined as identical to) LLM loss functions. No equations, self-citations, or fitted parameters are quoted that reduce the claimed reconstruction guarantee to a tautology or to the input data by construction. The Matryoshka supervision is described as a practical realization step rather than a definitional loop. The paper therefore remains self-contained against external benchmarks for its theoretical grounding.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Only abstract available; ledger populated from stated claims. Gaussian distribution of weights and suitability of weighted MSE are taken as given without independent verification in the provided text.

free parameters (1)
  • codebook parameters and weighting in MSE
    Not numerically specified; the method relies on learned codebooks whose values are fitted during training.
axioms (2)
  • domain assumption LLM weights commonly follow a Gaussian distribution
    Explicitly stated in abstract as the basis for optimal reconstruction.
  • domain assumption Weighted mean squared error is the appropriate distortion measure motivated by LLM loss functions
    Invoked to justify successive refinement optimality.

pith-pipeline@v0.9.1-grok · 5731 in / 1454 out tokens · 16165 ms · 2026-06-27T07:27:27.374689+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  2. [3]

    Advances in Neural Information Processing Systems , volume=

    Matryoshka representation learning , author=. Advances in Neural Information Processing Systems , volume=

  3. [4]

    2025 International Joint Conference on Neural Networks (IJCNN) , pages=

    Lsaq: Layer-specific adaptive quantization for large language model deployment , author=. 2025 International Joint Conference on Neural Networks (IJCNN) , pages=. 2025 , organization=

  4. [8]

    ACM Transactions on Intelligent Systems and Technology , volume=

    A comprehensive overview of large language models , author=. ACM Transactions on Intelligent Systems and Technology , volume=. 2025 , publisher=

  5. [13]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  6. [16]

    2024 , howpublished =

    Google , title =. 2024 , howpublished =

  7. [17]

    arXiv preprint arXiv:2401.15347 , year=

    A comprehensive survey of compression algorithms for language models , author=. arXiv preprint arXiv:2401.15347 , year=

  8. [18]

    A Simple and Effective Pruning Approach for Large Language Models

    A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=

  9. [19]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=

  10. [20]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  11. [21]

    Advances in Neural Information Processing Systems , volume=

    Compressing large language models using low rank and low precision decomposition , author=. Advances in Neural Information Processing Systems , volume=

  12. [22]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Albert: A lite bert for self-supervised learning of language representations , author=. arXiv preprint arXiv:1909.11942 , year=

  13. [23]

    int8 (): 8-bit matrix multiplication for transformers at scale , author=

    Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in neural information processing systems , volume=

  14. [24]

    Proceedings of machine learning and systems , volume=

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=

  15. [26]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Additive quantization for extreme vector compression , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  16. [27]

    Universally Slimmable Networks and Improved Training Techniques , year=

    Yu, Jiahui and Huang, Thomas , booktitle=. Universally Slimmable Networks and Improved Training Techniques , year=

  17. [28]

    Advances in neural information processing systems , volume=

    Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in neural information processing systems , volume=

  18. [29]

    Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

    Fastbert: a self-distilling bert with adaptive inference time , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

  19. [30]

    Proceedings of the ACM SIGCOMM 2024 Conference , pages=

    Cachegen: Kv cache compression and streaming for fast large language model serving , author=. Proceedings of the ACM SIGCOMM 2024 Conference , pages=

  20. [32]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Adabits: Neural network quantization with adaptive bit-widths , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  21. [33]

    Workshop on Machine Learning and Compression, NeurIPS 2024 , year =

    AdaQuantLM: LLM Quantization with Adaptive Bit-Widths , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year =

  22. [34]

    International Conference on Machine Learning (ICML) , year=

    Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff , author=. International Conference on Machine Learning (ICML) , year=

  23. [35]

    Sensors , volume=

    Approximate nearest neighbor search by residual vector quantization , author=. Sensors , volume=. 2010 , publisher=

  24. [36]

    IEEE Transactions on information theory , volume=

    Successive refinement of information , author=. IEEE Transactions on information theory , volume=. 1991 , publisher=

  25. [37]

    2006 , publisher=

    Elements of information theory , author=. 2006 , publisher=

  26. [38]

    IEEE Transactions on Information Theory , volume=

    All Sources Are Nearly Successively Refinable , author=. IEEE Transactions on Information Theory , volume=. 2001 , publisher=

  27. [39]

    Elements of Information Theory , pages=

    Rate distortion theory , author=. Elements of Information Theory , pages=

  28. [40]

    1989 , school=

    Successive Refinement of Information , author=. 1989 , school=

  29. [41]

    , title =

    Sakrison, David J. , title =. IEEE Transactions on Information Theory , year =

  30. [43]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  31. [46]

    Journal of Machine Learning Research , volume=

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

  32. [47]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  33. [48]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  34. [49]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

  35. [51]

    2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , pages=

    Learning multi-rate vector quantization for remote deep inference , author=. 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , pages=. 2023 , organization=

  36. [53]

    2024 , howpublished =

    Tianxiang Chu , title =. 2024 , howpublished =

  37. [54]

    Advances in Neural Information Processing Systems , volume=

    Matryoshka query transformer for large vision-language models , author=. Advances in Neural Information Processing Systems , volume=

  38. [55]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Vptq: Extreme low-bit vector post-training quantization for large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  39. [56]

    Proceedings of machine learning research , volume=

    Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks , author=. Proceedings of machine learning research , volume=

  40. [59]

    Additive quantization for extreme vector compression

    Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 931--938, 2014

  41. [60]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

  42. [61]

    Adaquantlm: Llm quantization with adaptive bit-widths

    Shuangyi Chen and Ashish J Khisti. Adaquantlm: Llm quantization with adaptive bit-widths. In Workshop on Machine Learning and Compression, NeurIPS 2024, 2024

  43. [62]

    Approximate nearest neighbor search by residual vector quantization

    Yongjian Chen, Tao Guan, and Cheng Wang. Approximate nearest neighbor search by residual vector quantization. Sensors, 10 0 (12): 0 11259--11273, 2010

  44. [63]

    Quip‑for‑all: Unified quip implementation for llm quantization

    Tianxiang Chu. Quip‑for‑all: Unified quip implementation for llm quantization. https://github.com/chu‑tianxiang/QuIP‑for‑all, 2024. Accessed: 2026

  45. [64]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  46. [65]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2 edition, 2006

  47. [66]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 35: 0 30318--30332, 2022

  48. [67]

    Extreme compression of large language models via additive quantization

    Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118, 2024

  49. [68]

    Successive Refinement of Information

    William Howard Robinson Equitz. Successive Refinement of Information. PhD thesis, Stanford University, 1989

  50. [69]

    Successive refinement of information

    William HR Equitz and Thomas M Cover. Successive refinement of information. IEEE Transactions on information theory, 37 0 (2): 0 269--275, 1991

  51. [70]

    Remote inference over dynamic links via adaptive rate deep task-oriented vector quantization

    Eyal Fishel, May Malka, Shai Ginzach, and Nir Shlezinger. Remote inference over dynamic links via adaptive rate deep task-oriented vector quantization. arXiv preprint arXiv:2501.02521, 2025

  52. [71]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

  53. [72]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  54. [73]

    Matryoshka query transformer for large vision-language models

    Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems, 37: 0 50168--50188, 2024

  55. [74]

    Residual quantization with implicit neural codebooks

    Iris AM Huijben, Matthijs Douze, Matthew Muckley, Ruud JG Van Sloun, and Jakob Verbeek. Residual quantization with implicit neural codebooks. arXiv preprint arXiv:2401.14732, 2024

  56. [75]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

  57. [76]

    Adabits: Neural network quantization with adaptive bit-widths

    Qing Jin, Linjie Yang, and Zhenyu Liao. Adabits: Neural network quantization with adaptive bit-widths. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 2146--2156, 2020

  58. [77]

    Matryoshka representation learning

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35: 0 30233--30249, 2022

  59. [78]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6: 0 87--100, 2024

  60. [79]

    Fastbert: a self-distilling bert with adaptive inference time

    Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. Fastbert: a self-distilling bert with adaptive inference time. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.\ 6035--6044, 2020

  61. [80]

    Vptq: Extreme low-bit vector post-training quantization for large language models

    Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 8181--8196, 2024 a

  62. [81]

    Cachegen: Kv cache compression and streaming for fast large language model serving

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pp.\ 38--56, 2024 b

  63. [82]

    Learning multi-rate vector quantization for remote deep inference

    May Malka, Shai Ginzach, and Nir Shlezinger. Learning multi-rate vector quantization for remote deep inference. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), pp.\ 1--5. IEEE, 2023

  64. [83]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

  65. [84]

    Matryoshka quantization

    Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, and Aditya Kusupati. Matryoshka quantization. arXiv preprint arXiv:2502.06786, 2025

  66. [85]

    A comprehensive overview of large language models

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16 0 (5): 0 1--72, 2025

  67. [86]

    Anybcq: Hardware efficient flexible binary-coded quantization for multi-precision llms

    Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Anybcq: Hardware efficient flexible binary-coded quantization for multi-precision llms. arXiv preprint arXiv:2510.10467, 2025

  68. [87]

    Any-precision llm: Low-cost deployment of multiple, different-sized llms

    Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W Lee. Any-precision llm: Low-cost deployment of multiple, different-sized llms. arXiv preprint arXiv:2402.10517, 2024

  69. [88]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

  70. [89]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34: 0 13937--13949, 2021

  71. [90]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 8732--8740, 2020

  72. [91]

    Sakrison

    David J. Sakrison. The rate distortion function for a gaussian process with a weighted square error criterion. IEEE Transactions on Information Theory, 14 0 (5): 0 506–508, 1968

  73. [92]

    Resq: Mixed-precision quantization of large language models with low-rank residuals

    Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals. arXiv preprint arXiv:2412.14363, 2024

  74. [93]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  75. [94]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

  76. [95]

    Algorithmic progress in language models: A decomposition of perplexity

    Nicholas Thompson, Enric Carr, Christopher Chiang, Johannes Erner, Maarten Hobbhahn, Sara Hooker, William Mann, David Wallace, and Jason Wei. Algorithmic progress in language models: A decomposition of perplexity. arXiv preprint arXiv:2505.04075, 2024

  77. [96]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  78. [97]

    Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks

    Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks. Proceedings of machine learning research, 235: 0 48630, 2024

  79. [98]

    Qinco2: Vector compression and search with improved implicit neural codebooks

    Th \'e ophane Vallaeys, Matthew Muckley, Jakob Verbeek, and Matthijs Douze. Qinco2: Vector compression and search with improved implicit neural codebooks. arXiv preprint arXiv:2501.03078, 2025

  80. [99]

    Are we there yet? a measurement study of efficiency for llm applications on mobile devices

    Xiao Yan and Yi Ding. Are we there yet? a measurement study of efficiency for llm applications on mobile devices. arXiv preprint arXiv:2504.00002, 2025

Showing first 80 references.