EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Arnab Sanyal; Gourav Datta; Michael Orshansky; Prithwish Mukherjee; Sandeep P. Chinchali

arxiv: 2505.02380 · v4 · submitted 2025-05-05 · 💻 cs.LG

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Arnab Sanyal , Gourav Datta , Prithwish Mukherjee , Sandeep P. Chinchali , Michael Orshansky This is my paper

Pith reviewed 2026-05-22 16:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords large language modelsmodel compressionquantizationentropy codingHuffman codingedge inferenceweight compressionpost-training quantization

0 comments

The pith

Tensor-level quantization lowers the entropy of LLM weights, allowing Huffman coding to compress 8-bit models 7 times better and 4-bit models 11.3 times better than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that applying mixed unsigned and asymmetric quantization at the tensor level reduces the entropy of large language model weight values. Lower entropy makes the weights far more compressible by standard entropy coders such as Huffman, without any retraining. The resulting storage savings reach 30 percent versus uint8 baselines and 65 percent versus uint4 baselines on models up to 7 billion parameters. On memory-constrained edge hardware the compressed weights also cut inference latency by 32 to 147 percent while keeping task accuracy comparable to ordinary quantized models.

Core claim

Tensor-level mixed quantization produces an entropy-reducing effect on LLM weights that markedly improves the compression ratio achieved by subsequent Huffman encoding. The framework delivers 7 times better compression for 8-bit weights and 11.3 times better compression for 4-bit weights relative to existing post-training methods, while a parallel decoding scheme keeps retrieval latency low. These gains require no retraining and integrate directly with current quantization pipelines, yielding up to 30 percent storage reduction versus uint8 and 65 percent versus uint4 on edge-scale models together with substantial inference speed-ups on devices such as the NVIDIA Jetson.

What carries the argument

Mixed unsigned and asymmetric tensor-level quantization that lowers weight entropy before Huffman entropy coding.

If this is right

Storage requirements drop by up to 30 percent relative to uint8 models and 65 percent relative to uint4 models.
Inference runs 31.9 to 146.6 percent faster on memory-limited edge hardware such as the NVIDIA Jetson P3450.
No model retraining is required, so the method slots into existing post-training quantization flows.
Parallel decoding keeps the added latency of entropy decoding negligible during inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The entropy reduction might extend to other entropy coders beyond Huffman, potentially allowing even higher compression ratios.
Lower-entropy quantized weights could pair with pruning or sparsity techniques for further memory savings on edge devices.
Because the method is post-training and training-free, it could be applied to already-deployed models to extend their usable hardware range.
The observed compressibility gain may correlate with reduced weight variance, which could be tested directly on additional model families.

Load-bearing premise

The mixed quantization scheme preserves downstream task accuracy at levels comparable to standard uint8 and uint4 baselines across the tested models and tasks.

What would settle it

Running the same models on a new task or dataset and finding accuracy more than 1-2 percent below the corresponding uint8 or uint4 baseline would falsify the accuracy-preservation claim.

read the original abstract

Large Language Models (LLMs) achieve strong performance across tasks, but face storage and compute challenges on edge devices. We propose EntroLLM, a compression framework combining mixed quantization and entropy coding to reduce storage while preserving accuracy. We use a combination of unsigned and asymmetric quantization. Tensor-level quantization produces an entropy-reducing effect, increasing weight compressibility, and improving downstream Huffman encoding by $7\times$ (8-bit) and $11.3\times$ (4-bit) over state-of-the-art methods. Huffman coding further reduces memory bandwidth demands, while a parallel decoding strategy enables efficient weight retrieval with minimal latency. Experiments on edge-scale LLMs (smolLM-1.7B, phi3-mini-4k, mistral-7B) show up to $30\%$ storage savings over uint8 and $65\%$ over uint4 models, with $31.9-146.6\%$ faster inference on memory-limited devices like the NVIDIA JETSON P3450. EntroLLM requires no retraining and is compatible with existing post-training quantization pipelines, making it practical for edge LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes EntroLLM, a post-training compression framework for LLMs on edge devices that combines mixed unsigned/asymmetric tensor-level quantization with Huffman entropy coding. It claims this produces lower-entropy weight distributions that improve Huffman compressibility by 7× (8-bit) and 11.3× (4-bit) over SOTA methods, yielding up to 30% storage savings vs uint8 and 65% vs uint4, plus 31.9-146.6% faster inference on memory-limited hardware like NVIDIA Jetson P3450, all without retraining and while preserving downstream accuracy on models including smolLM-1.7B, phi3-mini-4k, and mistral-7B.

Significance. If the accuracy preservation and entropy-reduction claims hold under strong controls, the work offers a practical, retraining-free addition to post-training quantization pipelines that could meaningfully lower memory bandwidth for edge LLM inference. The parallel decoding strategy and empirical results across three models are positive aspects; the approach is compatible with existing PTQ methods.

major comments (2)

Abstract and Experiments: The central claim that tensor-level quantization produces an entropy-reducing effect (improving downstream Huffman encoding) rests on comparisons to uint8/uint4 baselines, but the manuscript does not specify whether these baselines use per-tensor or per-channel scaling. Standard practice (e.g., GPTQ, AWQ) employs per-channel scaling precisely to control quantization error; without explicit comparison to such strong per-channel baselines, the accuracy-preservation premise and the attribution of entropy reduction to the tensor-level choice remain unsecured.
Experiments section: No error bars, exact baseline implementation details, or full ablation tables are reported for the accuracy, storage, and speedup numbers. This makes it difficult to assess the robustness of the 30%/65% storage reductions and the 7×/11.3× Huffman gains, especially given possible post-hoc selection of quantization types.

minor comments (2)

Abstract: The reported inference speedup range (31.9-146.6%) is very broad; clarify the exact conditions, models, and hardware configurations that produce the lower and upper ends.
Notation: Define 'mixed quantization' and 'asymmetric' more precisely at first use, including how unsigned values are handled for negative weights.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications on our experimental setup and committing to specific revisions that strengthen the presentation of results without altering the core claims.

read point-by-point responses

Referee: Abstract and Experiments: The central claim that tensor-level quantization produces an entropy-reducing effect (improving downstream Huffman encoding) rests on comparisons to uint8/uint4 baselines, but the manuscript does not specify whether these baselines use per-tensor or per-channel scaling. Standard practice (e.g., GPTQ, AWQ) employs per-channel scaling precisely to control quantization error; without explicit comparison to such strong per-channel baselines, the accuracy-preservation premise and the attribution of entropy reduction to the tensor-level choice remain unsecured.

Authors: We appreciate this observation. Our uint8 and uint4 baselines were implemented using standard per-tensor uniform quantization (as in basic PTQ pipelines without per-channel scaling factors), which aligns with the tensor-level scope of our mixed unsigned/asymmetric quantization. This choice was deliberate to isolate the entropy-reduction benefit of our tensor-level approach for subsequent Huffman coding. Per-channel methods like those in GPTQ or AWQ optimize for accuracy but typically yield higher-entropy weight distributions that are less compressible by entropy coding. We will revise the manuscript to explicitly state the per-tensor nature of the baselines, add a direct comparison table against per-channel quantized versions (reporting both accuracy and post-quantization entropy), and clarify that our entropy-coding stage is orthogonal to the initial scaling choice while still preserving downstream task accuracy. revision: partial
Referee: Experiments section: No error bars, exact baseline implementation details, or full ablation tables are reported for the accuracy, storage, and speedup numbers. This makes it difficult to assess the robustness of the 30%/65% storage reductions and the 7×/11.3× Huffman gains, especially given possible post-hoc selection of quantization types.

Authors: We agree that the absence of error bars, precise implementation details, and comprehensive ablations limits evaluation of robustness. In the revised manuscript we will: (1) report error bars for inference latency and storage measurements obtained from repeated runs on the NVIDIA Jetson P3450; (2) provide exact baseline details including library versions, quantization bit-widths, and scaling methods; and (3) include full ablation tables that vary quantization type (unsigned vs. asymmetric, per-tensor vs. per-channel) and demonstrate consistent entropy reduction and compression gains across all tested configurations on smolLM-1.7B, phi3-mini-4k, and mistral-7B. These additions will eliminate any ambiguity regarding post-hoc selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical measurements against external baselines

full rationale

The paper presents EntroLLM as a practical compression pipeline whose headline gains (7×/11.3× Huffman improvement, 30%/65% storage reduction) are obtained by direct experimental comparison to state-of-the-art methods on concrete models (smolLM-1.7B, phi3-mini, mistral-7B). Tensor-level quantization is described as producing an entropy-reducing effect that is then measured, not derived by construction from any fitted parameter or self-referential definition. No uniqueness theorems, ansatzes smuggled via self-citation, or predictions that reduce to the input data appear in the abstract or method description. The work is explicitly post-training and compatible with existing pipelines, confirming that the reported improvements are externally falsifiable rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard post-training quantization assumptions and the empirical observation that tensor-level quantization lowers entropy; no new physical entities or unstated mathematical axioms are introduced beyond common compression techniques.

free parameters (1)

Quantization bit widths and asymmetry choices
Specific unsigned/asymmetric settings per tensor are selected to achieve the reported entropy reduction but are not derived from first principles.

axioms (1)

domain assumption Tensor-level mixed quantization preserves model accuracy sufficiently for the target tasks
Invoked when claiming storage savings while preserving accuracy; location is the abstract statement on compatibility with post-training pipelines.

pith-pipeline@v0.9.0 · 5753 in / 1281 out tokens · 50697 ms · 2026-05-22T16:28:31.003489+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tensor-level quantization produces an entropy-reducing effect, increasing weight compressibility, and improving downstream Huffman encoding by 7× (8-bit) and 11.3× (4-bit) over state-of-the-art methods.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

INTRODUCTION Large language models (LLMs) have demonstrated remark- able performance in various domains [1, 2], but their sub- stantial size poses challenges for deployment, especially on resource-constrained edge devices [3]. For example, even a smaller LLM such asmistral-7B-Instruct[4] requires more than14GB of memory with weights encoded in16-bit float...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

These typically require retraining, which is impractical for LLMs due to extreme memory demands

RELATED WORKS Before LLMs, compression techniques such as pruning [11], quantization [12,13], knowledge distillation [14], and alterna- tive number formats such as logarithmic [15] and posit [16] were developed to improve deep learning efficiency. These typically require retraining, which is impractical for LLMs due to extreme memory demands. Recent work ...

work page
[3]

Huffman encoding [26] optimally assigns shorter codes to frequent symbols and is the core of the beyond- quantization compression framework in our work

PROPOSED METHOD Entropy coding compresses the data by exploiting symbol fre- quencies. Huffman encoding [26] optimally assigns shorter codes to frequent symbols and is the core of the beyond- quantization compression framework in our work. Mixed Quantization Scheme: We aim to compress weights beyond post-training quantization. State-of-the-art models ofte...

work page
[4]

bucketing

EXPERIMENTS We evaluated our proposed compression scheme on three edge-based LLMs:smolLM-1.7B-Instruct(1.7 billion param- eters) [9],phi3-mini-4k-Instruct(3.8 billion parameters) [10] andmistral-7B-Instruct(7 billion parameters) [4]. Our code and models are available online [27]. Table 1 shows their baselinefp16sizes and subsequent sizes after quantizatio...

work page
[5]

Our method reduces average bit-width to 1.39 for 4-bit weights, improv- ing downstream entropy coding by7×–11.3×over state-of- the-art techniques

CONCLUSIONS We present EntroLLM, a framework combining mixed quan- tization, Huffman compression, and parallel decoding to en- able efficient LLM deployment on edge devices. Our method reduces average bit-width to 1.39 for 4-bit weights, improv- ing downstream entropy coding by7×–11.3×over state-of- the-art techniques. Parallel decoding keeps decompressio...

work page
[6]

Language models are few-shot learners,

Tom Brown, Benjamin Mann, Nick Ryder, et al., “Language models are few-shot learners,” inAdvances in Neural Information Process- ing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 1877–1901, Curran Associates, Inc

work page 2020
[7]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “Llama: Open and efficient foundation language models,”ArXiv, vol. abs/2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, et al., “A survey on efficient in- ference for large language models,”arXiv preprint arXiv:2404.14294, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Mis- tral 7b,

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed, “Mis- tral 7b,” 2023

work page 2023
[10]

W., and Keutzer, K

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael Mahoney, and Kurt Keutzer, “Squeezellm: Dense-and-sparse quantization,”arXiv preprint arXiv:2306.07629, 2023

work page arXiv 2023
[11]

Ai and memory wall,

Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer, “Ai and memory wall,”IEEE Micro, vol. 44, no. 3, pp. 33–39, 2024

work page 2024
[12]

A white paper on neural network quantization,

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bon- darenko, Mart van Baalen, and Tijmen Blankevoort, “A white paper on neural network quantization,” 2021

work page 2021
[13]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He, “Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds. 2022, vol. 35, pp. 27168–27183, Curran Associates, Inc

work page 2022
[14]

Smollm - blazingly fast and remarkably powerful,

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf, “Smollm - blazingly fast and remarkably powerful,” 2024

work page 2024
[15]

Phi-3 technical report: A highly capable language model locally on your phone,

Marah Abdin et al., “Phi-3 technical report: A highly capable language model locally on your phone,” 2024

work page 2024
[16]

Learning both weights and connections for efficient neural network,

Song Han, Jeff Pool, John Tran, and William Dally, “Learning both weights and connections for efficient neural network,” inAdvances in Neural Information Processing Systems (NIPS), 2015, pp. 1135–1143

work page 2015
[17]

Quantized neural networks: Training neural networks with low precision weights and activations,

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,”Journal of Machine Learn- ing Research, vol. 18, no. 187, pp. 1–30, 2018

work page 2018
[18]

Post- training 4-bit quantization of convolutional networks for rapid- deployment,

Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry, “Post- training 4-bit quantization of convolutional networks for rapid- deployment,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[19]

Distilling the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean, “Distilling the knowledge in a neural network,” inNIPS Deep Learning and Rep- resentation Learning Workshop, 2015

work page 2015
[20]

Neural network training with approximate logarithmic computations,

Arnab Sanyal, Peter A. Beerel, and Keith M. Chugg, “Neural network training with approximate logarithmic computations,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3122–3126

work page 2020
[21]

Deep positron: A deep neural network using the posit number system,

Zachariah Carmichael, Hamed F. Langroudi, Char Khazanov, Jeffrey Lillie, John L. Gustafson, and Dhireesha Kudithipudi, “Deep positron: A deep neural network using the posit number system,” in2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019, vol. 1, pp. 1421–1426

work page 2019
[22]

A simple and effective pruning approach for large language models,

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter, “A simple and effective pruning approach for large language models,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[23]

Llm-pruner: On the structural pruning of large language models,

Xinyin Ma, Gongfan Fang, and Xinchao Wang, “Llm-pruner: On the structural pruning of large language models,” inAdvances in Neural Information Processing Systems, 2023

work page 2023
[24]

OPTQ: Accurate quantization for generative pre-trained transform- ers,

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh, “OPTQ: Accurate quantization for generative pre-trained transform- ers,” inThe Eleventh International Conference on Learning Repre- sentations, 2023

work page 2023
[25]

Awq: Activation-aware weight quantization for llm compression and acceleration,

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” inMLSys, 2024

work page 2024
[26]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh, “Spqr: A sparse-quantized rep- resentation for near-lossless llm weight compression,”arXiv preprint arXiv:2306.03078, 2023

work page arXiv 2023
[27]

QLoRA: Efficient finetuning of quantized LLMs,

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[28]

Under- standing and overcoming the challenges of efficient transformer quanti- zation,

Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort, “Under- standing and overcoming the challenges of efficient transformer quanti- zation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 7947–7969, Association for Computational Linguistics

work page 2021
[29]

Outlier sup- pression: Pushing the limit of low-bit transformer language models,

Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shang- hang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu, “Outlier sup- pression: Pushing the limit of low-bit transformer language models,” inAdvances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, Eds., 2022

work page 2022
[30]

Ternaryllm: Ternarized large language model,

Tianqi Chen, Zhe Li, Weixiang Xu, Zeyu Zhu, Dong Li, Lu Tian, Emad Barsoum, Peisong Wang, and Jian Cheng, “Ternaryllm: Ternarized large language model,” 2024

work page 2024
[31]

A method for the construction of minimum- redundancy codes,

David A. Huffman, “A method for the construction of minimum- redundancy codes,”Proceedings of the IRE, vol. 40, no. 9, pp. 1098– 1101, 1952

work page 1952
[32]

EntroLLM: Efficient LLM with entropy-coded Weights,

Arnab Sanyal, “EntroLLM: Efficient LLM with entropy-coded Weights,”https://github.com/arnabsanyal/EntroLLM

work page
[33]

Cortex-a57 technical specifications,

ARM Holdings, “Cortex-a57 technical specifications,”https:// developer.arm.com/Processors/Cortex-A57, 2025, Ac- cessed: March 20, 2025

work page 2025
[34]

Efficiently scaling transformer inference,

Reiner Pope et al., “Efficiently scaling transformer inference,” inMa- chine Learning and Systems, 2023, vol. 5, pp. 1–19

work page 2023
[35]

Fast transformer decoding: One write-head is all you need,

Noam Shazeer, “Fast transformer decoding: One write-head is all you need,” inAdvances in Neural Information Processing Systems, 2019

work page 2019
[36]

Efficient memory management for large lan- guage model serving with pagedattention,

Woosuk Kwon et al., “Efficient memory management for large lan- guage model serving with pagedattention,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 2314–2324

work page 2023
[37]

Turbomixer: Fast and efficient transformers via low-rank adaptors and hardware specialization,

Zhenyu Fang et al., “Turbomixer: Fast and efficient transformers via low-rank adaptors and hardware specialization,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 9711– 9721

work page 2023

[1] [1]

INTRODUCTION Large language models (LLMs) have demonstrated remark- able performance in various domains [1, 2], but their sub- stantial size poses challenges for deployment, especially on resource-constrained edge devices [3]. For example, even a smaller LLM such asmistral-7B-Instruct[4] requires more than14GB of memory with weights encoded in16-bit float...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

These typically require retraining, which is impractical for LLMs due to extreme memory demands

RELATED WORKS Before LLMs, compression techniques such as pruning [11], quantization [12,13], knowledge distillation [14], and alterna- tive number formats such as logarithmic [15] and posit [16] were developed to improve deep learning efficiency. These typically require retraining, which is impractical for LLMs due to extreme memory demands. Recent work ...

work page

[3] [3]

Huffman encoding [26] optimally assigns shorter codes to frequent symbols and is the core of the beyond- quantization compression framework in our work

PROPOSED METHOD Entropy coding compresses the data by exploiting symbol fre- quencies. Huffman encoding [26] optimally assigns shorter codes to frequent symbols and is the core of the beyond- quantization compression framework in our work. Mixed Quantization Scheme: We aim to compress weights beyond post-training quantization. State-of-the-art models ofte...

work page

[4] [4]

bucketing

EXPERIMENTS We evaluated our proposed compression scheme on three edge-based LLMs:smolLM-1.7B-Instruct(1.7 billion param- eters) [9],phi3-mini-4k-Instruct(3.8 billion parameters) [10] andmistral-7B-Instruct(7 billion parameters) [4]. Our code and models are available online [27]. Table 1 shows their baselinefp16sizes and subsequent sizes after quantizatio...

work page

[5] [5]

Our method reduces average bit-width to 1.39 for 4-bit weights, improv- ing downstream entropy coding by7×–11.3×over state-of- the-art techniques

CONCLUSIONS We present EntroLLM, a framework combining mixed quan- tization, Huffman compression, and parallel decoding to en- able efficient LLM deployment on edge devices. Our method reduces average bit-width to 1.39 for 4-bit weights, improv- ing downstream entropy coding by7×–11.3×over state-of- the-art techniques. Parallel decoding keeps decompressio...

work page

[6] [6]

Language models are few-shot learners,

Tom Brown, Benjamin Mann, Nick Ryder, et al., “Language models are few-shot learners,” inAdvances in Neural Information Process- ing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 1877–1901, Curran Associates, Inc

work page 2020

[7] [7]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “Llama: Open and efficient foundation language models,”ArXiv, vol. abs/2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, et al., “A survey on efficient in- ference for large language models,”arXiv preprint arXiv:2404.14294, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Mis- tral 7b,

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed, “Mis- tral 7b,” 2023

work page 2023

[10] [10]

W., and Keutzer, K

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael Mahoney, and Kurt Keutzer, “Squeezellm: Dense-and-sparse quantization,”arXiv preprint arXiv:2306.07629, 2023

work page arXiv 2023

[11] [11]

Ai and memory wall,

Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer, “Ai and memory wall,”IEEE Micro, vol. 44, no. 3, pp. 33–39, 2024

work page 2024

[12] [12]

A white paper on neural network quantization,

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bon- darenko, Mart van Baalen, and Tijmen Blankevoort, “A white paper on neural network quantization,” 2021

work page 2021

[13] [13]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He, “Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds. 2022, vol. 35, pp. 27168–27183, Curran Associates, Inc

work page 2022

[14] [14]

Smollm - blazingly fast and remarkably powerful,

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf, “Smollm - blazingly fast and remarkably powerful,” 2024

work page 2024

[15] [15]

Phi-3 technical report: A highly capable language model locally on your phone,

Marah Abdin et al., “Phi-3 technical report: A highly capable language model locally on your phone,” 2024

work page 2024

[16] [16]

Learning both weights and connections for efficient neural network,

Song Han, Jeff Pool, John Tran, and William Dally, “Learning both weights and connections for efficient neural network,” inAdvances in Neural Information Processing Systems (NIPS), 2015, pp. 1135–1143

work page 2015

[17] [17]

Quantized neural networks: Training neural networks with low precision weights and activations,

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,”Journal of Machine Learn- ing Research, vol. 18, no. 187, pp. 1–30, 2018

work page 2018

[18] [18]

Post- training 4-bit quantization of convolutional networks for rapid- deployment,

Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry, “Post- training 4-bit quantization of convolutional networks for rapid- deployment,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019

[19] [19]

Distilling the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean, “Distilling the knowledge in a neural network,” inNIPS Deep Learning and Rep- resentation Learning Workshop, 2015

work page 2015

[20] [20]

Neural network training with approximate logarithmic computations,

Arnab Sanyal, Peter A. Beerel, and Keith M. Chugg, “Neural network training with approximate logarithmic computations,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3122–3126

work page 2020

[21] [21]

Deep positron: A deep neural network using the posit number system,

Zachariah Carmichael, Hamed F. Langroudi, Char Khazanov, Jeffrey Lillie, John L. Gustafson, and Dhireesha Kudithipudi, “Deep positron: A deep neural network using the posit number system,” in2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019, vol. 1, pp. 1421–1426

work page 2019

[22] [22]

A simple and effective pruning approach for large language models,

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter, “A simple and effective pruning approach for large language models,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[23] [23]

Llm-pruner: On the structural pruning of large language models,

Xinyin Ma, Gongfan Fang, and Xinchao Wang, “Llm-pruner: On the structural pruning of large language models,” inAdvances in Neural Information Processing Systems, 2023

work page 2023

[24] [24]

OPTQ: Accurate quantization for generative pre-trained transform- ers,

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh, “OPTQ: Accurate quantization for generative pre-trained transform- ers,” inThe Eleventh International Conference on Learning Repre- sentations, 2023

work page 2023

[25] [25]

Awq: Activation-aware weight quantization for llm compression and acceleration,

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” inMLSys, 2024

work page 2024

[26] [26]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh, “Spqr: A sparse-quantized rep- resentation for near-lossless llm weight compression,”arXiv preprint arXiv:2306.03078, 2023

work page arXiv 2023

[27] [27]

QLoRA: Efficient finetuning of quantized LLMs,

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[28] [28]

Under- standing and overcoming the challenges of efficient transformer quanti- zation,

Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort, “Under- standing and overcoming the challenges of efficient transformer quanti- zation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 7947–7969, Association for Computational Linguistics

work page 2021

[29] [29]

Outlier sup- pression: Pushing the limit of low-bit transformer language models,

Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shang- hang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu, “Outlier sup- pression: Pushing the limit of low-bit transformer language models,” inAdvances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, Eds., 2022

work page 2022

[30] [30]

Ternaryllm: Ternarized large language model,

Tianqi Chen, Zhe Li, Weixiang Xu, Zeyu Zhu, Dong Li, Lu Tian, Emad Barsoum, Peisong Wang, and Jian Cheng, “Ternaryllm: Ternarized large language model,” 2024

work page 2024

[31] [31]

A method for the construction of minimum- redundancy codes,

David A. Huffman, “A method for the construction of minimum- redundancy codes,”Proceedings of the IRE, vol. 40, no. 9, pp. 1098– 1101, 1952

work page 1952

[32] [32]

EntroLLM: Efficient LLM with entropy-coded Weights,

Arnab Sanyal, “EntroLLM: Efficient LLM with entropy-coded Weights,”https://github.com/arnabsanyal/EntroLLM

work page

[33] [33]

Cortex-a57 technical specifications,

ARM Holdings, “Cortex-a57 technical specifications,”https:// developer.arm.com/Processors/Cortex-A57, 2025, Ac- cessed: March 20, 2025

work page 2025

[34] [34]

Efficiently scaling transformer inference,

Reiner Pope et al., “Efficiently scaling transformer inference,” inMa- chine Learning and Systems, 2023, vol. 5, pp. 1–19

work page 2023

[35] [35]

Fast transformer decoding: One write-head is all you need,

Noam Shazeer, “Fast transformer decoding: One write-head is all you need,” inAdvances in Neural Information Processing Systems, 2019

work page 2019

[36] [36]

Efficient memory management for large lan- guage model serving with pagedattention,

Woosuk Kwon et al., “Efficient memory management for large lan- guage model serving with pagedattention,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 2314–2324

work page 2023

[37] [37]

Turbomixer: Fast and efficient transformers via low-rank adaptors and hardware specialization,

Zhenyu Fang et al., “Turbomixer: Fast and efficient transformers via low-rank adaptors and hardware specialization,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 9711– 9721

work page 2023