EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

· 2025 · cs.LG · arXiv 2505.02380

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large Language Models (LLMs) achieve strong performance across tasks, but face storage and compute challenges on edge devices. We propose EntroLLM, a compression framework combining mixed quantization and entropy coding to reduce storage while preserving accuracy. We use a combination of unsigned and asymmetric quantization. Tensor-level quantization produces an entropy-reducing effect, increasing weight compressibility, and improving downstream Huffman encoding by $7\times$ (8-bit) and $11.3\times$ (4-bit) over state-of-the-art methods. Huffman coding further reduces memory bandwidth demands, while a parallel decoding strategy enables efficient weight retrieval with minimal latency. Experiments on edge-scale LLMs (smolLM-1.7B, phi3-mini-4k, mistral-7B) show up to $30\%$ storage savings over uint8 and $65\%$ over uint4 models, with $31.9-146.6\%$ faster inference on memory-limited devices like the NVIDIA JETSON P3450. EntroLLM requires no retraining and is compatible with existing post-training quantization pipelines, making it practical for edge LLM deployment.

representative citing papers

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

cs.LG · 2025-05-05 · conditional · novelty 5.0

EntroLLM applies tensor-level mixed quantization to reduce weight entropy then uses Huffman coding for up to 65% storage savings and faster inference on memory-limited edge devices without retraining.

citing papers explorer

Showing 1 of 1 citing paper.

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices cs.LG · 2025-05-05 · conditional · none · ref 1 · internal anchor
EntroLLM applies tensor-level mixed quantization to reduce weight entropy then uses Huffman coding for up to 65% storage savings and faster inference on memory-limited edge devices without retraining.

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

fields

years

verdicts

representative citing papers

citing papers explorer