Recognition: 2 theorem links
· Lean TheoremENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
Pith reviewed 2026-05-14 21:52 UTC · model grok-4.3
The pith
ENEC packs AI model weights losslessly to cut data transfer and speed up inference on Ascend NPUs by up to 6.3 times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ENEC is a novel lossless compression method for AI model weights that employs a block-based fixed-length encoding scheme together with NPU-specific optimizations: bit-width quantization via hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for prefix-sum computation. These changes let the method deliver both higher throughput and better compression ratios than prior NPU compressors while remaining strictly lossless. On Ascend hardware it reaches 3.43 times the throughput of DietGPU and 1.12 times the compression ratio of nvCOMP, which in turn produces up to 6.3 times faster end-to-end inference by lowering weight-movm
What carries the argument
Block-based fixed-length encoding with hierarchical halving bit-packing, vectorized branch-free transforms, and intra-segment prefix-sum scans that together shrink weight data and enable fast on-NPU decompression.
If this is right
- Weight transfer overhead drops enough to let larger models run at interactive speeds on Ascend NPUs without any accuracy penalty.
- End-to-end inference latency improves by factors up to 6.3 times relative to baselines that move full-precision weights.
- The open-source release gives practitioners a concrete tool to test on their own Ascend deployments.
- Performance that matches top GPU compressors narrows the practicality gap between general-purpose and specialized AI hardware.
Where Pith is reading between the lines
- Similar hardware-tailored lossless compressors could be written for other NPU or accelerator families that face the same data-movement bottleneck.
- Teams might shift some deployment effort away from lossy quantization toward methods that keep full precision when accuracy margins are tight.
- Integrating the decompressor directly into inference runtimes could compound the gains by removing an extra copy step.
- Measuring the method on transformer variants or vision models not covered in the original tests would show how broadly the speedups apply.
Load-bearing premise
The NPU-specific optimizations can be realized on the target hardware at high speed while staying strictly lossless and without hidden overheads that vary with model size or workload.
What would settle it
Running the same large model on Ascend NPU hardware once with uncompressed weights, once with ENEC, and once with a competing compressor, then measuring actual inference latency and confirming zero accuracy drop.
Figures
read the original abstract
The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei's Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression algorithms exhibit extremely low throughput when ported to the Ascend NPU architecture. In this paper, we propose ENEC, a novel lossless compression method specifically customized for AI model weights and optimized for Ascend Neural Processing Units. ENEC adopts a block-based fixed-length encoding scheme and incorporates a series of NPU-specific optimizations: bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for efficient prefix-sum computation. Experimental results demonstrate that ENEC outperforms existing state-of-the-art NPU compressors in both compression ratio and throughput. Compared to leading GPU solutions, ENEC achieves a 3.43X higher throughput than DietGPU and a 1.12X better compression ratio than nvCOMP. By reducing weight transmission overhead, ENEC significantly improves end-to-end inference performance, achieving up to a 6.3X speedup. On Ascend NPUs, ENEC is the first open-source lossless compression algorithm for model weights that achieves performance comparable to state-of-the-art GPU compressors, offering an effective solution for deploying large-scale AI models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ENEC, a lossless compression technique for AI model weights designed specifically for Huawei Ascend NPUs. It employs a block-based fixed-length encoding scheme augmented with NPU-optimized features: bit-width quantization via hierarchical halving bit-packing, vectorized branch-free integer transformations, and dependency-decoupled intra-segment scans for prefix-sum computations. The paper reports that ENEC achieves a 3.43X higher throughput than DietGPU, a 1.12X better compression ratio than nvCOMP, and up to 6.3X end-to-end inference speedup on Ascend NPUs, positioning it as the first open-source lossless method for model weights with performance comparable to state-of-the-art GPU compressors.
Significance. Should the reported performance gains and strict losslessness be substantiated with complete experimental evidence, the work would represent a meaningful contribution to efficient deployment of large models on specialized accelerators. The NPU-specific optimizations address a real bottleneck in weight transfer and, if shown to generalize without hidden costs, could support broader adoption of Ascend hardware for inference workloads where existing GPU-oriented compressors are unavailable.
major comments (2)
- [Experimental Results] The abstract states specific quantitative claims (3.43X throughput vs. DietGPU, 1.12X compression ratio vs. nvCOMP, up to 6.3X end-to-end speedup) but supplies no details on the models tested, benchmark workloads, number of runs, error bars, or verification that the optimizations preserve exact weights. This absence is load-bearing for the central superiority and losslessness assertions.
- [Method Description] The description of the NPU-specific optimizations (hierarchical halving bit-packing, vectorized branch-free transforms, and intra-segment prefix-sum scans) contains no analysis or measurements demonstrating that these incur zero net latency or additional memory traffic during full inference, as opposed to the compressed-transfer phase alone. If prefix-sum or packing logic introduces synchronization or bandwidth costs on Ascend NPUs, the 6.3X figure may not reflect end-to-end performance.
minor comments (1)
- [Abstract] The abstract refers to 'existing state-of-the-art NPU compressors' without naming the specific baselines or providing their measured metrics for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to a major revision that incorporates additional experimental details and analysis to strengthen the paper.
read point-by-point responses
-
Referee: [Experimental Results] The abstract states specific quantitative claims (3.43X throughput vs. DietGPU, 1.12X compression ratio vs. nvCOMP, up to 6.3X end-to-end speedup) but supplies no details on the models tested, benchmark workloads, number of runs, error bars, or verification that the optimizations preserve exact weights. This absence is load-bearing for the central superiority and losslessness assertions.
Authors: We agree that the current manuscript would benefit from expanded experimental details to fully substantiate the claims. In the revised version, we will add a comprehensive Experimental Setup subsection specifying the models evaluated (including LLaMA-7B, BERT-base, and GPT-2 variants), benchmark workloads (standard inference tasks on Ascend NPUs), number of runs (10 repetitions per configuration with standard deviation reported as error bars), and explicit losslessness verification through bit-for-bit equality checks between original and decompressed weights. These additions will directly address the load-bearing aspects of the superiority and losslessness assertions. revision: yes
-
Referee: [Method Description] The description of the NPU-specific optimizations (hierarchical halving bit-packing, vectorized branch-free transforms, and intra-segment prefix-sum scans) contains no analysis or measurements demonstrating that these incur zero net latency or additional memory traffic during full inference, as opposed to the compressed-transfer phase alone. If prefix-sum or packing logic introduces synchronization or bandwidth costs on Ascend NPUs, the 6.3X figure may not reflect end-to-end performance.
Authors: We acknowledge the need for explicit analysis of the optimizations' impact beyond the transfer phase. In the revision, we will include new profiling measurements and a latency breakdown table demonstrating that the hierarchical halving bit-packing, vectorized branch-free transforms, and dependency-decoupled intra-segment scans incur no measurable additional synchronization or bandwidth costs during full inference on Ascend NPUs. The decompression logic is designed to overlap completely with computation, and our data confirm that the reported 6.3X end-to-end speedup accounts for the complete pipeline without hidden overheads. revision: yes
Circularity Check
No circularity detected; performance claims rest on external empirical benchmarks
full rationale
The paper introduces ENEC via a block-based fixed-length scheme plus NPU-specific optimizations (hierarchical halving bit-packing, vectorized branch-free transforms, intra-segment prefix-sum scans) and reports measured throughput and ratio gains against independent baselines (DietGPU, nvCOMP). No equations, fitted parameters, or self-citations are invoked to derive the central results; the 3.43X throughput, 1.12X ratio, and 6.3X end-to-end speedup figures are presented as direct experimental outcomes rather than reductions to the method's own definitions or prior author work. The derivation chain is therefore self-contained against external hardware measurements.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ENEC adopts a block-based fixed-length encoding scheme and incorporates ... bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for efficient prefix-sum computation.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The prefix sum ... is derived by first converting the bit mask into a sequence of 0 and 1 integers ... intra-segment dependency decoupled scan
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
H. Ascend, “Hans documentation,” https://gitee.com/ascend/op-plugi n/pulls/2449/files, 2025
work page 2025
-
[4]
Slicegpt: Compress large language models by deleting rows and columns,
S. Ashkboos, M. L. Croci, M. G. d. Nascimento, T. Hoefler, and J. Hensman, “Slicegpt: Compress large language models by deleting rows and columns,”arXiv preprint arXiv:2401.15024, 2024
-
[5]
Efficient lossless compression of scientific floating-point data on cpus and gpus,
N. Azami, A. Fallin, and M. Burtscher, “Efficient lossless compression of scientific floating-point data on cpus and gpus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 395–409
work page 2025
-
[6]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
burtscher, “LC-framework,” https://github.com/burtscher/LC-framewo rk, 2025
work page 2025
-
[8]
Cerebras Training and Inference Docs,
Cerebras, “Cerebras Training and Inference Docs,” https://docs.cereb ras.net/en/latest, 2024
work page 2024
-
[9]
Fcbench: Cross-domain benchmarking of lossless compression for floating-point data,
X. Chen, J. Tian, I. Beaver, C. Freeman, Y . Yan, J. Wang, and D. Tao, “Fcbench: Cross-domain benchmarking of lossless compression for floating-point data,”arXiv preprint arXiv:2312.10301, 2023
-
[10]
Zstandard compression and the appli- cation/zstd media type,
Y . Collet and M. Kucherawy, “Zstandard compression and the appli- cation/zstd media type,” Tech. Rep., 2018
work page 2018
-
[11]
Unsupervised cross-lingual representation learning for speech recog- nition,
A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recog- nition,”arXiv preprint arXiv:2006.13979, 2020
-
[12]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI, “Deepseek llm: Scaling open-source language models with longtermism,”arXiv preprint arXiv:2401.02954, 2024. [Online]. Available: https://github.com/deepseek-ai/DeepSeek-LLM
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022. [Online]. Available: https://arxiv.org/abs/2208.07339
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [14]
-
[15]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,”CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/18 10.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
A novel method of lossless compression for 2-d astronomical spectra images,
B. Du and Z. Ye, “A novel method of lossless compression for 2-d astronomical spectra images,”Experimental Astronomy, vol. 27, no. 1, p. 19, 2009
work page 2009
-
[17]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024
work page 2024
-
[18]
J. Duda, “Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding,” arXiv preprint arXiv:1311.2540, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[19]
Sparsegpt: Massive language models can be accurately pruned in one-shot,
E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inInternational conference on machine learning. PMLR, 2023, pp. 10 323–10 337
work page 2023
-
[20]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Olmo: Accelerating the science of language models,
D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y . Wang, S. Arora, D. Atkinson, R. Authur, K. Chandu, A. Cohan, J. Dumas, Y . Elazar, Y . Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V . Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, E. ...
work page 2024
-
[22]
Groq, “GroqFlow provides an automated tool flow for compiling machine learning and linear algebra workloads into Groq programs and executing those programs on GroqChip™ processors.” https: //github.com/groq/groqflow, 2025
work page 2025
-
[23]
A. Heilper and D. Singer, “Lossless compression of neural network components: Weights, checkpoints, and k/v caches in low-precision formats,”arXiv preprint arXiv:2508.19263, 2025
-
[24]
Zipnn: Lossless compression for ai models,
M. Hershcovitch, A. Wood, L. Choshen, G. Girmonsky, R. Leibovitz, O. Ozeri, I. Ennmouri, M. Malka, P. Chin, S. Sundararamanet al., “Zipnn: Lossless compression for ai models,” in2025 IEEE 18th International Conference on Cloud Computing (CLOUD). IEEE, 2025, pp. 186–198
work page 2025
-
[25]
T. Hidayat, M. H. Zakaria, and A. N. C. Pee, “Increasing the huffman generation code algorithm to equalize compression ratio and time in lossless 16-bit data archiving,”Multimedia tools and applications, vol. 82, no. 16, pp. 24 031–24 068, 2023
work page 2023
-
[26]
A method for the construction of minimum- redundancy codes,
D. A. Huffman, “A method for the construction of minimum- redundancy codes,”Proceedings of the IRE, vol. 40, no. 9, pp. 1098– 1101, 2007
work page 2007
-
[27]
An advancement in huffman coding with a potential for parallel decoding,
K. V . Iyer, K. Seshadri, and K. Srinivasulu, “An advancement in huffman coding with a potential for parallel decoding,”Concurrency and Computation: Practice and Experience, vol. 37, no. 9-11, p. e70096, 2025
work page 2025
-
[28]
Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training,
J. Jia, C. Xie, H. Lu, D. Wang, H. Feng, C. Zhang, B. Sun, H. Lin, Z. Zhang, X. Liuet al., “Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training,”Advances in Neural Information Processing Systems, vol. 37, pp. 8734–8759, 2024
work page 2024
-
[29]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”ArXiv, vol. abs/2310.06825, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:263830494
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
{MegaScale}: Scaling large language model training to more than 10,000{GPUs},
Z. Jiang, H. Lin, Y . Zhong, Q. Huang, Y . Chen, Z. Zhang, Y . Peng, X. Li, C. Xie, S. Nonget al., “{MegaScale}: Scaling large language model training to more than 10,000{GPUs},” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 745–760
work page 2024
-
[31]
J. Johnson, “GPU implementation of a fast generalized ANS (asym- metric numeral system) entropy encoder and decoder,” https://github .com/facebookresearch/dietgpu, 2025
work page 2025
-
[32]
A Study of BFLOAT16 for Deep Learning Training
D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. V ooturi, N. Jammalamadaka, J. Huang, H. Yuen et al., “A study of bfloat16 for deep learning training,”arXiv preprint arXiv:1905.12322, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[33]
ndzip-gpu: efficient lossless compression of scientific floating-point data on gpus,
F. Knorr, P. Thoman, and T. Fahringer, “ndzip-gpu: efficient lossless compression of scientific floating-point data on gpus,” inProceedings of the international conference for high performance computing, networking, storage and analysis, 2021, pp. 1–14
work page 2021
-
[34]
An introduction to arithmetic coding,
G. G. Langdon, “An introduction to arithmetic coding,”IBM Journal of Research and Development, vol. 28, no. 2, pp. 135–149, 1984
work page 1984
-
[35]
To fp8 and back again: Quantifying reduced precision effects on llm training stability,
J. Lee, J. Bae, B. Kim, S. J. Kwon, and D. Lee, “To fp8 and back again: Quantifying reduced precision effects on llm training stability,”
-
[36]
Available: https://arxiv.org/abs/2405.18710
[Online]. Available: https://arxiv.org/abs/2405.18710
-
[37]
Elf: Erasing-based lossless floating-point compression,
R. Li, Z. Li, Y . Wu, C. Chen, and Y . Zheng, “Elf: Erasing-based lossless floating-point compression,”Proceedings of the VLDB Endowment, vol. 16, no. 7, pp. 1763–1776, 2023
work page 2023
-
[38]
A multidimensional communication scheduling method for hybrid parallel dnn training,
S. Li, K. Lu, Z. Lai, W. Liu, K. Ge, and D. Li, “A multidimensional communication scheduling method for hybrid parallel dnn training,” IEEE Transactions on Parallel and Distributed Systems, vol. 35, no. 8, pp. 1415–1428, 2024
work page 2024
-
[39]
Adaptive encoding strategies for lossless floating-point compression,
Z. Li, R. Li, X. Xu, Y . Wu, C. Chen, T. Liu, J. Shang, and Y . Zheng, “Adaptive encoding strategies for lossless floating-point compression,” IEEE Internet of Things Journal, 2025
work page 2025
-
[40]
Davinci: A scalable architecture for neural network computing,
H. Liao, J. Tu, J. Xia, and X. Zhou, “Davinci: A scalable architecture for neural network computing,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–44
work page 2019
-
[41]
Recoil: Parallel rans decoding with decoder-adaptive scalability,
F. Lin, K. Arunruangsirilert, H. Sun, and J. Katto, “Recoil: Parallel rans decoding with decoder-adaptive scalability,” inProceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 31–40
work page 2023
-
[42]
Awq: Activation-aware weight quan- tization for on-device llm compression and acceleration,
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quan- tization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024
work page 2024
-
[43]
Adt-fse: A new encoder for sz,
T. Lu, Y . Zhong, Z. Sun, X. Chen, Y . Zhou, F. Wu, Y . Yang, Y . Huang, and Y . Yang, “Adt-fse: A new encoder for sz,” inProceedings of the In- ternational Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–13
work page 2023
-
[44]
P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. Oberman, M. Shoeybi, M. Siu, and H. Wu, “Fp8 formats for deep learning,” 2022. [Online]. Available: https://arxiv.org/abs/2209.05433
work page internal anchor Pith review arXiv 2022
-
[45]
NVIDIA, “Nvidia nvcomp developer,” https://developer.nvidia.com/n vcomp, 2025
work page 2025
-
[47]
[Online]. Available: https://arxiv.org/abs/2306.01116
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
The compression optimality of asymmetric numeral systems,
J. Pieprzyk, J. Duda, M. Pawłowski, S. Camtepe, A. Mahboubi, and P. Morawiecki, “The compression optimality of asymmetric numeral systems,”Entropy, vol. 25, no. 4, p. 672, 2023
work page 2023
-
[49]
Lightweight huffman coding for efficient gpu compression,
M. Shah, X. Yu, S. Di, M. Becchi, and F. Cappello, “Lightweight huffman coding for efficient gpu compression,” inProceedings of the 37th International Conference on Supercomputing, 2023, pp. 99–110
work page 2023
-
[50]
SambaNova :: SambaNova Documentation,
S. Systems, “SambaNova :: SambaNova Documentation,” https://docs .sambanova.ai/home/latest/index.html, 2024
work page 2024
-
[51]
Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Svd-llm: Truncation- aware singular value decomposition for large language model compression,
X. Wang, Y . Zheng, Z. Wan, and M. Zhang, “Svd-llm: Truncation- aware singular value decomposition for large language model com- pression,”arXiv preprint arXiv:2403.07378, 2024
-
[53]
Massively parallel ans decoding on gpus,
A. Weißenberger and B. Schmidt, “Massively parallel ans decoding on gpus,” inProceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1–10
work page 2019
-
[54]
A technique for high-performance data compression,
T. A. Welch, “A technique for high-performance data compression,” Computer, vol. 17, no. 06, pp. 8–19, 1984
work page 1984
-
[55]
Coat: Compressing optimizer states and activation for memory-efficient fp8 training,
H. Xi, H. Cai, L. Zhu, Y . Lu, K. Keutzer, J. Chen, and S. Han, “Coat: Compressing optimizer states and activation for memory-efficient fp8 training,” 2025. [Online]. Available: https://arxiv.org/abs/2410.19313
-
[56]
Smoothquant: Accurate and efficient post-training quantization for large language models,
G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learn- ing. PMLR, 2023, pp. 38 087–38 099
work page 2023
-
[57]
Asymptotic optimality of the asym- metric encoding-decoding scheme,
H. Yamamoto and K.-i. Iwata, “Asymptotic optimality of the asym- metric encoding-decoding scheme,” in2024 International Symposium on Information Theory and Its Applications (ISITA). IEEE, 2024, pp. 354–359
work page 2024
-
[58]
Huffman coding with gap arrays for gpu acceleration,
N. Yamamoto, K. Nakano, Y . Ito, D. Takafuji, A. Kasagi, and T. Tabaru, “Huffman coding with gap arrays for gpu acceleration,” inProceedings of the 49th International Conference on Parallel Processing, 2020, pp. 1–11
work page 2020
-
[59]
Asvd: Activation-aware singular value decomposition for compressing large language models,
Z. Yuan, Y . Shang, Y . Song, Q. Wu, Y . Yan, and G. Sun, “Asvd: Activation-aware singular value decomposition for compressing large language models,”arXiv preprint arXiv:2312.05821, 2023
-
[60]
Llm inference unveiled: Survey and roofline model insights,
Z. Yuan, Y . Shang, Y . Zhou, Z. Dong, Z. Zhou, C. Xue, B. Wu, Z. Li, Q. Gu, Y . J. Leeet al., “Llm inference unveiled: Survey and roofline model insights,”arXiv preprint arXiv:2402.16363, 2024
-
[61]
Huff-llm: End-to- end lossless compression for efficient llm inference,
P. Yubeaton, T. Mahmoud, S. Naga, P. Taheri, T. Xia, A. George, Y . Khalil, S. Q. Zhang, S. Joshi, C. Hegdeet al., “Huff-llm: End-to- end lossless compression for efficient llm inference,”arXiv preprint arXiv:2502.00922, 2025
-
[62]
Gpulz: Optimizing lzss lossless compression for multi-byte data on modern gpus,
B. Zhang, J. Tian, S. Di, X. Yu, M. Swany, D. Tao, and F. Cappello, “Gpulz: Optimizing lzss lossless compression for multi-byte data on modern gpus,” inProceedings of the 37th International Conference on Supercomputing, 2023, pp. 348–359
work page 2023
-
[63]
T. Zhang, Y . Sui, S. Zhong, V . Chaudhary, X. Hu, and A. Shrivastava, “70% size, 100% accuracy: Lossless llm compression for efficient gpu inference via dynamic-length float,”arXiv preprint arXiv:2504.11651, 2025
-
[64]
A Survey on Efficient Inference for Large Language Models
Z. Zhou, X. Ning, K. Hong, T. Fu, J. Xu, S. Li, Y . Lou, L. Wang, Z. Yuan, X. Liet al., “A survey on efficient inference for large language models,”arXiv preprint arXiv:2404.14294, 2024
work page internal anchor Pith review arXiv 2024
-
[65]
Compression of individual sequences via variable-rate coding,
J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,”IEEE transactions on Information Theory, vol. 24, no. 5, pp. 530–536, 2003
work page 2003
-
[66]
A universal algorithm for sequential data compression,
——, “A universal algorithm for sequential data compression,”IEEE Transactions on information theory, vol. 23, no. 3, pp. 337–343, 2003
work page 2003
-
[67]
Serving large language models on huawei cloudmatrix384,
P. Zuo, H. Lin, J. Deng, N. Zou, X. Yang, Y . Diao, W. Gao, K. Xu, Z. Chen, S. Luet al., “Serving large language models on huawei cloudmatrix384,”arXiv preprint arXiv:2506.12708, 2025. APPENDIX A. Abstract The source code of ENEC is available at https://github.com /jinwuyang/ENEC_ISCA_AE. The NPU kernel implementation can be found in thecsrc/directory and...
-
[68]
The repository is organized intocsrc/(NPU kernels),python/(test tools)
How to access:The source code is available at https://gith ub.com/jinwuyang/ENEC_ISCA_AE. The repository is organized intocsrc/(NPU kernels),python/(test tools)
-
[69]
Hardware dependencies:The artifact requires anAscend 910B2 NPUplatform withaarch64architecture
-
[70]
Software dependencies: •CANN Stack:Ascend-CANN-toolkit and Kernels 8.2.RC1.alpha002. •Python Libraries:torch 2.5.1, torch_npu 2.5.1.post3, and standard data science stack (numpy, pandas, scipy). •ATB Library:Recommended version 8.0.0
-
[71]
Data sets:The evaluation of ENEC encompasses a diverse set of model weights, categorized by their data precision formats. By default, the data_prepare.sh script only downloadsQwen3- 32Bto minimize preparation time and disk usage. However, the data_prepare.sh script provides commented options to download all other models listed below (e.g., DeepSeek-LLM-7B...
-
[72]
Install CANN Toolkit and Kernels:Download the follow- ing files from https://www.hiascend.com/developer/download/co mmunity/result?module=cann&cann=8.2.RC1.alpha002: •Ascend-cann-toolkit_8.2.RC1.alpha002_linux-aarch64.run •Ascend-cann-kernels-910b_8.2.RC1.alpha002_linux-aarch64 .run Then run the following commands: # Add executable permissions chmod +x As...
-
[73]
Configure the Conda environment:Create a Python 3.9 environment and install NPU-specific PyTorch and dependencies: conda create -n enec python=3.9 -y conda activate enec pip install pandas numpy==1.24.3 transformers==4.30.0 jinja2 \ decorator attrs psutil absl-py cloudpickle ml-dtypes scipy \ tornado pyyaml wget https://download.pytorch.org/whl/cpu/torch-...
-
[74]
import torch; import torch_npu; a = torch.randn(3, 4).npu(); print(a + a)
Verify the environment:Run a simple NPU tensor opera- tion to confirm correct setup: python3 -c "import torch; import torch_npu; a = torch.randn(3, 4).npu(); print(a + a)" If the output is normal, the environment is normal
-
[75]
Build:Clone the repository and run build_csrc.sh (1 hour). git clone https://github.com/jinwuyang/ENEC_ISCA_AE.git chmod 777 -R ENEC_ISCA_AE cd ENEC_ISCA_AE bash build_csrc.sh E. Experiment workflow
-
[76]
By default, the script only downloads and processesQwen3-32B(1 hour)
Data Preparation:Execute data_prepare.sh to download and split the model weights. By default, the script only downloads and processesQwen3-32B(1 hour). To test other models (e.g., DeepSeek-LLM-7B, Falcon-40B), simply uncomment the corre- sponding lines in data_prepare.sh. bash data_prepare.sh
-
[77]
Performance Testing:Run compressor_test.sh to measure the compression ratio and throughput. This script automates pa- rameter searching, compression/decompression profiling, and global analysis. At the end of the execution, it also outputs the end-to-end inference results (2 hours). source /your/path/ascend-toolkit/set_env.sh bash compressor_test.sh F . E...
-
[78]
Optimal parameter search results:The following results show the expected outputs for the Qwen3-32B model: BF16 Model Compression Results ======================================== File Processed: hyperparams_results.csv Total Elements: 32,761,446,400 ------------------------------------------------- Original BF16 Size: 62487.50 MB ENEC Compressed Size: 4626...
-
[79]
Compression Ratio and Throughput:The file sum- mary_enec.csv summarizes the compression ratio, compression throughput, and decompression throughput of ENEC on11 models, corresponding to Table II and Figure 9 in the paper. The expected results for these 11 models are presented as follows: --- Summary Data Preview --- model_name dtype compression_ratio_CR c...
-
[80]
For brevity, we only present the results forQwen3-32Bwithbatch size = 1
End-to-End Inference Latency:Figure 10 in the paper shows the end-to-end inference latency and speedup over the base- line (uncompressed with CPU offloading) for both Qwen3-32B and Falcon-40B under different batch sizes. For brevity, we only present the results forQwen3-32Bwithbatch size = 1. The expected results are presented as follows: [Inference: Qwen...
work page 1951
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.