EntroLLM applies tensor-level mixed quantization to reduce weight entropy then uses Huffman coding for up to 65% storage savings and faster inference on memory-limited edge devices without retraining.
EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Large Language Models (LLMs) achieve strong performance across tasks, but face storage and compute challenges on edge devices. We propose EntroLLM, a compression framework combining mixed quantization and entropy coding to reduce storage while preserving accuracy. We use a combination of unsigned and asymmetric quantization. Tensor-level quantization produces an entropy-reducing effect, increasing weight compressibility, and improving downstream Huffman encoding by $7\times$ (8-bit) and $11.3\times$ (4-bit) over state-of-the-art methods. Huffman coding further reduces memory bandwidth demands, while a parallel decoding strategy enables efficient weight retrieval with minimal latency. Experiments on edge-scale LLMs (smolLM-1.7B, phi3-mini-4k, mistral-7B) show up to $30\%$ storage savings over uint8 and $65\%$ over uint4 models, with $31.9-146.6\%$ faster inference on memory-limited devices like the NVIDIA JETSON P3450. EntroLLM requires no retraining and is compatible with existing post-training quantization pipelines, making it practical for edge LLM deployment.
fields
cs.LG 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices
EntroLLM applies tensor-level mixed quantization to reduce weight entropy then uses Huffman coding for up to 65% storage savings and faster inference on memory-limited edge devices without retraining.