pith. machine review for the scientific record. sign in

arxiv: 2604.03298 · v2 · submitted 2026-03-28 · 💻 cs.AR · cs.DC· cs.LG

Recognition: 2 theorem links

· Lean Theorem

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:52 UTC · model grok-4.3

classification 💻 cs.AR cs.DCcs.LG
keywords lossless compressionmodel weightsAscend NPUinference speedupblock-based encodingdata transferhardware optimization
0
0 comments X

The pith

ENEC packs AI model weights losslessly to cut data transfer and speed up inference on Ascend NPUs by up to 6.3 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ENEC as a lossless compression method built for the weights inside large AI models when they run on Ascend NPUs. It starts from the observation that moving those weights across the chip has become the main slowdown, and that general-purpose lossless compressors slow down even more when moved to this hardware. ENEC uses a block-based fixed-length scheme plus three NPU-specific tricks to shrink the data without changing any values. A sympathetic reader would care because the approach keeps full model accuracy while shrinking the transfer cost, which directly shortens the time needed to run each inference step. The results claim this yields higher throughput than other NPU compressors and even beats leading GPU compressors on both speed and ratio.

Core claim

ENEC is a novel lossless compression method for AI model weights that employs a block-based fixed-length encoding scheme together with NPU-specific optimizations: bit-width quantization via hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for prefix-sum computation. These changes let the method deliver both higher throughput and better compression ratios than prior NPU compressors while remaining strictly lossless. On Ascend hardware it reaches 3.43 times the throughput of DietGPU and 1.12 times the compression ratio of nvCOMP, which in turn produces up to 6.3 times faster end-to-end inference by lowering weight-movm

What carries the argument

Block-based fixed-length encoding with hierarchical halving bit-packing, vectorized branch-free transforms, and intra-segment prefix-sum scans that together shrink weight data and enable fast on-NPU decompression.

If this is right

  • Weight transfer overhead drops enough to let larger models run at interactive speeds on Ascend NPUs without any accuracy penalty.
  • End-to-end inference latency improves by factors up to 6.3 times relative to baselines that move full-precision weights.
  • The open-source release gives practitioners a concrete tool to test on their own Ascend deployments.
  • Performance that matches top GPU compressors narrows the practicality gap between general-purpose and specialized AI hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hardware-tailored lossless compressors could be written for other NPU or accelerator families that face the same data-movement bottleneck.
  • Teams might shift some deployment effort away from lossy quantization toward methods that keep full precision when accuracy margins are tight.
  • Integrating the decompressor directly into inference runtimes could compound the gains by removing an extra copy step.
  • Measuring the method on transformer variants or vision models not covered in the original tests would show how broadly the speedups apply.

Load-bearing premise

The NPU-specific optimizations can be realized on the target hardware at high speed while staying strictly lossless and without hidden overheads that vary with model size or workload.

What would settle it

Running the same large model on Ascend NPU hardware once with uncompressed weights, once with ENEC, and once with a competing compressor, then measuring actual inference latency and confirming zero accuracy drop.

Figures

Figures reproduced from arXiv: 2604.03298 by Baoyi An, Dingwen Tao, Guangming Tan, Hairui Zhao, Jiaan Wu, Jiaxun Lu, Jing Xing, Jinwu Yang, Qingyi Zhang, Shaoteng Liu, Wenjing Huang, Xia Zhu, Xingchen Liu, Xinyang Ma, Yida Gu, Yili Ma, Yuanhong Huang, Zedong Liu, Zheng Wei, Zhongzhe Hu.

Figure 1
Figure 1. Figure 1: Time breakdown of Qwen3-32B Inference on NPU 910B2 (left) and throughput of ANS across multiple platforms (right). Despite the hardware advances, the rapid growth in model size poses severe challenges for deployment and inference efficiency [58], [62], especially in resource-constrained envi￾ronments. For instance, the Llama-3.1-405B model [17] con￾tains 405 billion parameters and requires approximately 91… view at source ↗
Figure 2
Figure 2. Figure 2: DaVinci architecture (decoupled AIC and AIV). decoupled scan for efficient prefix sum. ENEC attains a compression throughput ranging from 263 to 523 GB/s and a decompression throughput of 188 to 336 GB/s, outperforming state-of-the-art NPU compressors by up to 2.47× and 2.11× while maintaining high compression ratio. This breakthrough substantially reduces data transmission overhead, enabling up to a 6.3× … view at source ↗
Figure 3
Figure 3. Figure 3: Linear relationship between exponent values and frequency rankings in model weights. The red circle indicates outliers. distribution of weight exponent values and observed a distinct negative linear relationship between the exponent values and their frequency ranking. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Execution challenges and constraints on Ascend NPUs. IV. ARCHITECTURE ANALYSIS AND BASIC DESIGN A. Architecture Analysis of Lossless Compression Incompat￾ibility on Ascend NPUs As [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of optimized ENEC compression design on Ascend NPUs. Header 32 Bytes Mt S Mt S · · · Prefix Sum M M · · · Block 0 Block 1 File Header Uncompressed Part Mask Part Compressed Exponent S Sign Mt Mantissa M Mask E Compressed Exponent EN-1 E2N-1 Thread N-1 E0 EN Thread 0 · · · · · · · · · Block 0 Block 1 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Compressed stream layout. The bit mask is used to distin￾guish anomalous groups within each data block. Prefix sums provide direct starting positions for each thread’s compressed data. design does not deliver optimal performance. Performance analysis of the compression kernel reveals that gather ta￾ble lookups account for 35% of the time (Line 3), while reduction max operations consume 40% (Line 7). Simila… view at source ↗
Figure 7
Figure 7. Figure 7: Vectorized branch-free integer transformation. to the nearest multiple of 2 (Lines 23-26). The normalized bytes are consolidated in a final folding pass to form the output stream (Lines 28-32). The bit-unpacking process is precisely the inverse of the bit-packing process. Consequently, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prefix sum process. This figure illustrates the step-by-step implementation of the prefix sum algorithm, using an 8 × 8 binary local tensor M as a working example. modulo 2 n+1 and clear higher-bit information, restricting all output values to the predefined range [0, 2 n+1 − 1]. Negative values −c are converted to 2 n+1 − c to achieve a wrapping effect. For instance, in the case where n = 5, the previous … view at source ↗
Figure 9
Figure 9. Figure 9: Throughput of compression (upper) and decompression (lower) across different datasets and methods. The Y-axis (Throughput) is on a logarithmic scale to visualize performance differences spanning several orders of magnitude. ENEC into the HuggingFace Transformers inference frame￾work and evaluate it on two mainstream large language models, Qwen3-32B and Falcon-40B. The evaluation metrics are Time-To-First-T… view at source ↗
Figure 10
Figure 10. Figure 10: End-to-end inference performance comparison of ENEC across different models and batch sizes. Latency metrics include Time￾To-First-Token (TTFT) and Time-Per-Output-Token (TPOT). Moreover, despite FP16’s 5-bit exponent limiting compres￾sion and adding design challenges, our ENEC still performs nearly on par with BF16, reaching up to 317 GB/s, show￾casing its efficiency across data types. Furthermore, ENEC … view at source ↗
Figure 11
Figure 11. Figure 11: Performance of several common operations under fixed input conditions across different data block sizes. Data Block Size: This parameter primarily dictates system throughput with negligible compression impact. Throughput scales with block size ( [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of throughput performance across various methods under different input file sizes. achieves a high ratio via frequency-based statistical mapping. V1 slightly improves on V0 by reducing per-group metadata from 4 bits to 1 bit. V2 and V3 adopt branchless integer transform; its linear mapping approximates the frequency distribution, slightly reducing the ratio. For throughput, V1 improves compress… view at source ↗
Figure 13
Figure 13. Figure 13: Performance of several ENEC versions. TABLE VII: Performance comparison of ENEC and its implemen￾tations on other platforms using Qwen-32B. Hardware Compressors CR Comp. / Decomp. Thr. (GB/s) CPU ZipNN 1.50 0.4 / 1.0 ENEC-CPU 1.35 3.91 / 1.56 GPU NV-ANS 1.24 139.7 / 176.7 NV-Bitcomp 1.32 291.8 / 155.2 Diet-ANS 1.23 154.4 / 194.5 Diet-Float 1.47 271.9 / 271.3 ENEC-GPU-V0 1.34 291.3 / 269.4 ENEC-GPU-V1 1.35… view at source ↗
read the original abstract

The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei's Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression algorithms exhibit extremely low throughput when ported to the Ascend NPU architecture. In this paper, we propose ENEC, a novel lossless compression method specifically customized for AI model weights and optimized for Ascend Neural Processing Units. ENEC adopts a block-based fixed-length encoding scheme and incorporates a series of NPU-specific optimizations: bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for efficient prefix-sum computation. Experimental results demonstrate that ENEC outperforms existing state-of-the-art NPU compressors in both compression ratio and throughput. Compared to leading GPU solutions, ENEC achieves a 3.43X higher throughput than DietGPU and a 1.12X better compression ratio than nvCOMP. By reducing weight transmission overhead, ENEC significantly improves end-to-end inference performance, achieving up to a 6.3X speedup. On Ascend NPUs, ENEC is the first open-source lossless compression algorithm for model weights that achieves performance comparable to state-of-the-art GPU compressors, offering an effective solution for deploying large-scale AI models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ENEC, a lossless compression technique for AI model weights designed specifically for Huawei Ascend NPUs. It employs a block-based fixed-length encoding scheme augmented with NPU-optimized features: bit-width quantization via hierarchical halving bit-packing, vectorized branch-free integer transformations, and dependency-decoupled intra-segment scans for prefix-sum computations. The paper reports that ENEC achieves a 3.43X higher throughput than DietGPU, a 1.12X better compression ratio than nvCOMP, and up to 6.3X end-to-end inference speedup on Ascend NPUs, positioning it as the first open-source lossless method for model weights with performance comparable to state-of-the-art GPU compressors.

Significance. Should the reported performance gains and strict losslessness be substantiated with complete experimental evidence, the work would represent a meaningful contribution to efficient deployment of large models on specialized accelerators. The NPU-specific optimizations address a real bottleneck in weight transfer and, if shown to generalize without hidden costs, could support broader adoption of Ascend hardware for inference workloads where existing GPU-oriented compressors are unavailable.

major comments (2)
  1. [Experimental Results] The abstract states specific quantitative claims (3.43X throughput vs. DietGPU, 1.12X compression ratio vs. nvCOMP, up to 6.3X end-to-end speedup) but supplies no details on the models tested, benchmark workloads, number of runs, error bars, or verification that the optimizations preserve exact weights. This absence is load-bearing for the central superiority and losslessness assertions.
  2. [Method Description] The description of the NPU-specific optimizations (hierarchical halving bit-packing, vectorized branch-free transforms, and intra-segment prefix-sum scans) contains no analysis or measurements demonstrating that these incur zero net latency or additional memory traffic during full inference, as opposed to the compressed-transfer phase alone. If prefix-sum or packing logic introduces synchronization or bandwidth costs on Ascend NPUs, the 6.3X figure may not reflect end-to-end performance.
minor comments (1)
  1. [Abstract] The abstract refers to 'existing state-of-the-art NPU compressors' without naming the specific baselines or providing their measured metrics for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to a major revision that incorporates additional experimental details and analysis to strengthen the paper.

read point-by-point responses
  1. Referee: [Experimental Results] The abstract states specific quantitative claims (3.43X throughput vs. DietGPU, 1.12X compression ratio vs. nvCOMP, up to 6.3X end-to-end speedup) but supplies no details on the models tested, benchmark workloads, number of runs, error bars, or verification that the optimizations preserve exact weights. This absence is load-bearing for the central superiority and losslessness assertions.

    Authors: We agree that the current manuscript would benefit from expanded experimental details to fully substantiate the claims. In the revised version, we will add a comprehensive Experimental Setup subsection specifying the models evaluated (including LLaMA-7B, BERT-base, and GPT-2 variants), benchmark workloads (standard inference tasks on Ascend NPUs), number of runs (10 repetitions per configuration with standard deviation reported as error bars), and explicit losslessness verification through bit-for-bit equality checks between original and decompressed weights. These additions will directly address the load-bearing aspects of the superiority and losslessness assertions. revision: yes

  2. Referee: [Method Description] The description of the NPU-specific optimizations (hierarchical halving bit-packing, vectorized branch-free transforms, and intra-segment prefix-sum scans) contains no analysis or measurements demonstrating that these incur zero net latency or additional memory traffic during full inference, as opposed to the compressed-transfer phase alone. If prefix-sum or packing logic introduces synchronization or bandwidth costs on Ascend NPUs, the 6.3X figure may not reflect end-to-end performance.

    Authors: We acknowledge the need for explicit analysis of the optimizations' impact beyond the transfer phase. In the revision, we will include new profiling measurements and a latency breakdown table demonstrating that the hierarchical halving bit-packing, vectorized branch-free transforms, and dependency-decoupled intra-segment scans incur no measurable additional synchronization or bandwidth costs during full inference on Ascend NPUs. The decompression logic is designed to overlap completely with computation, and our data confirm that the reported 6.3X end-to-end speedup accounts for the complete pipeline without hidden overheads. revision: yes

Circularity Check

0 steps flagged

No circularity detected; performance claims rest on external empirical benchmarks

full rationale

The paper introduces ENEC via a block-based fixed-length scheme plus NPU-specific optimizations (hierarchical halving bit-packing, vectorized branch-free transforms, intra-segment prefix-sum scans) and reports measured throughput and ratio gains against independent baselines (DietGPU, nvCOMP). No equations, fitted parameters, or self-citations are invoked to derive the central results; the 3.43X throughput, 1.12X ratio, and 6.3X end-to-end speedup figures are presented as direct experimental outcomes rather than reductions to the method's own definitions or prior author work. The derivation chain is therefore self-contained against external hardware measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an engineering adaptation of standard compression ideas to Ascend hardware without introducing new mathematical axioms, fitted free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5633 in / 1069 out tokens · 54874 ms · 2026-05-14T21:52:50.152243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 14 internal anchors

  1. [1]

    Phi-4 Technical Report

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Hans documentation,

    H. Ascend, “Hans documentation,” https://gitee.com/ascend/op-plugi n/pulls/2449/files, 2025

  4. [4]

    Slicegpt: Compress large language models by deleting rows and columns,

    S. Ashkboos, M. L. Croci, M. G. d. Nascimento, T. Hoefler, and J. Hensman, “Slicegpt: Compress large language models by deleting rows and columns,”arXiv preprint arXiv:2401.15024, 2024

  5. [5]

    Efficient lossless compression of scientific floating-point data on cpus and gpus,

    N. Azami, A. Fallin, and M. Burtscher, “Efficient lossless compression of scientific floating-point data on cpus and gpus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 395–409

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023

  7. [7]

    LC-framework,

    burtscher, “LC-framework,” https://github.com/burtscher/LC-framewo rk, 2025

  8. [8]

    Cerebras Training and Inference Docs,

    Cerebras, “Cerebras Training and Inference Docs,” https://docs.cereb ras.net/en/latest, 2024

  9. [9]

    Fcbench: Cross-domain benchmarking of lossless compression for floating-point data,

    X. Chen, J. Tian, I. Beaver, C. Freeman, Y . Yan, J. Wang, and D. Tao, “Fcbench: Cross-domain benchmarking of lossless compression for floating-point data,”arXiv preprint arXiv:2312.10301, 2023

  10. [10]

    Zstandard compression and the appli- cation/zstd media type,

    Y . Collet and M. Kucherawy, “Zstandard compression and the appli- cation/zstd media type,” Tech. Rep., 2018

  11. [11]

    Unsupervised cross-lingual representation learning for speech recog- nition,

    A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recog- nition,”arXiv preprint arXiv:2006.13979, 2020

  12. [12]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    DeepSeek-AI, “Deepseek llm: Scaling open-source language models with longtermism,”arXiv preprint arXiv:2401.02954, 2024. [Online]. Available: https://github.com/deepseek-ai/DeepSeek-LLM

  13. [13]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022. [Online]. Available: https://arxiv.org/abs/2208.07339

  14. [14]

    Document,

    H. Developers, “Document,” https://developer.huawei.com/consumer/ en/doc/hiai-guides/introduction-0000001051486804, 2025

  15. [15]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,”CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/18 10.04805

  16. [16]

    A novel method of lossless compression for 2-d astronomical spectra images,

    B. Du and Z. Ye, “A novel method of lossless compression for 2-d astronomical spectra images,”Experimental Astronomy, vol. 27, no. 1, p. 19, 2009

  17. [17]

    The llama 3 herd of models,

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024

  18. [18]

    Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding

    J. Duda, “Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding,” arXiv preprint arXiv:1311.2540, 2013

  19. [19]

    Sparsegpt: Massive language models can be accurately pruned in one-shot,

    E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inInternational conference on machine learning. PMLR, 2023, pp. 10 323–10 337

  20. [20]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022

  21. [21]

    Olmo: Accelerating the science of language models,

    D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y . Wang, S. Arora, D. Atkinson, R. Authur, K. Chandu, A. Cohan, J. Dumas, Y . Elazar, Y . Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V . Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, E. ...

  22. [22]

    GroqFlow provides an automated tool flow for compiling machine learning and linear algebra workloads into Groq programs and executing those programs on GroqChip™ processors

    Groq, “GroqFlow provides an automated tool flow for compiling machine learning and linear algebra workloads into Groq programs and executing those programs on GroqChip™ processors.” https: //github.com/groq/groqflow, 2025

  23. [23]

    Lossless compression of neural network components: Weights, checkpoints, and k/v caches in low-precision formats,

    A. Heilper and D. Singer, “Lossless compression of neural network components: Weights, checkpoints, and k/v caches in low-precision formats,”arXiv preprint arXiv:2508.19263, 2025

  24. [24]

    Zipnn: Lossless compression for ai models,

    M. Hershcovitch, A. Wood, L. Choshen, G. Girmonsky, R. Leibovitz, O. Ozeri, I. Ennmouri, M. Malka, P. Chin, S. Sundararamanet al., “Zipnn: Lossless compression for ai models,” in2025 IEEE 18th International Conference on Cloud Computing (CLOUD). IEEE, 2025, pp. 186–198

  25. [25]

    Increasing the huffman generation code algorithm to equalize compression ratio and time in lossless 16-bit data archiving,

    T. Hidayat, M. H. Zakaria, and A. N. C. Pee, “Increasing the huffman generation code algorithm to equalize compression ratio and time in lossless 16-bit data archiving,”Multimedia tools and applications, vol. 82, no. 16, pp. 24 031–24 068, 2023

  26. [26]

    A method for the construction of minimum- redundancy codes,

    D. A. Huffman, “A method for the construction of minimum- redundancy codes,”Proceedings of the IRE, vol. 40, no. 9, pp. 1098– 1101, 2007

  27. [27]

    An advancement in huffman coding with a potential for parallel decoding,

    K. V . Iyer, K. Seshadri, and K. Srinivasulu, “An advancement in huffman coding with a potential for parallel decoding,”Concurrency and Computation: Practice and Experience, vol. 37, no. 9-11, p. e70096, 2025

  28. [28]

    Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training,

    J. Jia, C. Xie, H. Lu, D. Wang, H. Feng, C. Zhang, B. Sun, H. Lin, Z. Zhang, X. Liuet al., “Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training,”Advances in Neural Information Processing Systems, vol. 37, pp. 8734–8759, 2024

  29. [29]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”ArXiv, vol. abs/2310.06825, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:263830494

  30. [30]

    {MegaScale}: Scaling large language model training to more than 10,000{GPUs},

    Z. Jiang, H. Lin, Y . Zhong, Q. Huang, Y . Chen, Z. Zhang, Y . Peng, X. Li, C. Xie, S. Nonget al., “{MegaScale}: Scaling large language model training to more than 10,000{GPUs},” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 745–760

  31. [31]

    GPU implementation of a fast generalized ANS (asym- metric numeral system) entropy encoder and decoder,

    J. Johnson, “GPU implementation of a fast generalized ANS (asym- metric numeral system) entropy encoder and decoder,” https://github .com/facebookresearch/dietgpu, 2025

  32. [32]

    A Study of BFLOAT16 for Deep Learning Training

    D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. V ooturi, N. Jammalamadaka, J. Huang, H. Yuen et al., “A study of bfloat16 for deep learning training,”arXiv preprint arXiv:1905.12322, 2019

  33. [33]

    ndzip-gpu: efficient lossless compression of scientific floating-point data on gpus,

    F. Knorr, P. Thoman, and T. Fahringer, “ndzip-gpu: efficient lossless compression of scientific floating-point data on gpus,” inProceedings of the international conference for high performance computing, networking, storage and analysis, 2021, pp. 1–14

  34. [34]

    An introduction to arithmetic coding,

    G. G. Langdon, “An introduction to arithmetic coding,”IBM Journal of Research and Development, vol. 28, no. 2, pp. 135–149, 1984

  35. [35]

    To fp8 and back again: Quantifying reduced precision effects on llm training stability,

    J. Lee, J. Bae, B. Kim, S. J. Kwon, and D. Lee, “To fp8 and back again: Quantifying reduced precision effects on llm training stability,”

  36. [36]

    Available: https://arxiv.org/abs/2405.18710

    [Online]. Available: https://arxiv.org/abs/2405.18710

  37. [37]

    Elf: Erasing-based lossless floating-point compression,

    R. Li, Z. Li, Y . Wu, C. Chen, and Y . Zheng, “Elf: Erasing-based lossless floating-point compression,”Proceedings of the VLDB Endowment, vol. 16, no. 7, pp. 1763–1776, 2023

  38. [38]

    A multidimensional communication scheduling method for hybrid parallel dnn training,

    S. Li, K. Lu, Z. Lai, W. Liu, K. Ge, and D. Li, “A multidimensional communication scheduling method for hybrid parallel dnn training,” IEEE Transactions on Parallel and Distributed Systems, vol. 35, no. 8, pp. 1415–1428, 2024

  39. [39]

    Adaptive encoding strategies for lossless floating-point compression,

    Z. Li, R. Li, X. Xu, Y . Wu, C. Chen, T. Liu, J. Shang, and Y . Zheng, “Adaptive encoding strategies for lossless floating-point compression,” IEEE Internet of Things Journal, 2025

  40. [40]

    Davinci: A scalable architecture for neural network computing,

    H. Liao, J. Tu, J. Xia, and X. Zhou, “Davinci: A scalable architecture for neural network computing,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–44

  41. [41]

    Recoil: Parallel rans decoding with decoder-adaptive scalability,

    F. Lin, K. Arunruangsirilert, H. Sun, and J. Katto, “Recoil: Parallel rans decoding with decoder-adaptive scalability,” inProceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 31–40

  42. [42]

    Awq: Activation-aware weight quan- tization for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quan- tization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

  43. [43]

    Adt-fse: A new encoder for sz,

    T. Lu, Y . Zhong, Z. Sun, X. Chen, Y . Zhou, F. Wu, Y . Yang, Y . Huang, and Y . Yang, “Adt-fse: A new encoder for sz,” inProceedings of the In- ternational Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–13

  44. [44]

    FP8 Formats for Deep Learning

    P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. Oberman, M. Shoeybi, M. Siu, and H. Wu, “Fp8 formats for deep learning,” 2022. [Online]. Available: https://arxiv.org/abs/2209.05433

  45. [45]

    Nvidia nvcomp developer,

    NVIDIA, “Nvidia nvcomp developer,” https://developer.nvidia.com/n vcomp, 2025

  46. [47]
  47. [48]

    The compression optimality of asymmetric numeral systems,

    J. Pieprzyk, J. Duda, M. Pawłowski, S. Camtepe, A. Mahboubi, and P. Morawiecki, “The compression optimality of asymmetric numeral systems,”Entropy, vol. 25, no. 4, p. 672, 2023

  48. [49]

    Lightweight huffman coding for efficient gpu compression,

    M. Shah, X. Yu, S. Di, M. Becchi, and F. Cappello, “Lightweight huffman coding for efficient gpu compression,” inProceedings of the 37th International Conference on Supercomputing, 2023, pp. 99–110

  49. [50]

    SambaNova :: SambaNova Documentation,

    S. Systems, “SambaNova :: SambaNova Documentation,” https://docs .sambanova.ai/home/latest/index.html, 2024

  50. [51]

    Qwen3 Technical Report

    Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

  51. [52]

    Svd-llm: Truncation- aware singular value decomposition for large language model compression,

    X. Wang, Y . Zheng, Z. Wan, and M. Zhang, “Svd-llm: Truncation- aware singular value decomposition for large language model com- pression,”arXiv preprint arXiv:2403.07378, 2024

  52. [53]

    Massively parallel ans decoding on gpus,

    A. Weißenberger and B. Schmidt, “Massively parallel ans decoding on gpus,” inProceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1–10

  53. [54]

    A technique for high-performance data compression,

    T. A. Welch, “A technique for high-performance data compression,” Computer, vol. 17, no. 06, pp. 8–19, 1984

  54. [55]

    Coat: Compressing optimizer states and activation for memory-efficient fp8 training,

    H. Xi, H. Cai, L. Zhu, Y . Lu, K. Keutzer, J. Chen, and S. Han, “Coat: Compressing optimizer states and activation for memory-efficient fp8 training,” 2025. [Online]. Available: https://arxiv.org/abs/2410.19313

  55. [56]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learn- ing. PMLR, 2023, pp. 38 087–38 099

  56. [57]

    Asymptotic optimality of the asym- metric encoding-decoding scheme,

    H. Yamamoto and K.-i. Iwata, “Asymptotic optimality of the asym- metric encoding-decoding scheme,” in2024 International Symposium on Information Theory and Its Applications (ISITA). IEEE, 2024, pp. 354–359

  57. [58]

    Huffman coding with gap arrays for gpu acceleration,

    N. Yamamoto, K. Nakano, Y . Ito, D. Takafuji, A. Kasagi, and T. Tabaru, “Huffman coding with gap arrays for gpu acceleration,” inProceedings of the 49th International Conference on Parallel Processing, 2020, pp. 1–11

  58. [59]

    Asvd: Activation-aware singular value decomposition for compressing large language models,

    Z. Yuan, Y . Shang, Y . Song, Q. Wu, Y . Yan, and G. Sun, “Asvd: Activation-aware singular value decomposition for compressing large language models,”arXiv preprint arXiv:2312.05821, 2023

  59. [60]

    Llm inference unveiled: Survey and roofline model insights,

    Z. Yuan, Y . Shang, Y . Zhou, Z. Dong, Z. Zhou, C. Xue, B. Wu, Z. Li, Q. Gu, Y . J. Leeet al., “Llm inference unveiled: Survey and roofline model insights,”arXiv preprint arXiv:2402.16363, 2024

  60. [61]

    Huff-llm: End-to- end lossless compression for efficient llm inference,

    P. Yubeaton, T. Mahmoud, S. Naga, P. Taheri, T. Xia, A. George, Y . Khalil, S. Q. Zhang, S. Joshi, C. Hegdeet al., “Huff-llm: End-to- end lossless compression for efficient llm inference,”arXiv preprint arXiv:2502.00922, 2025

  61. [62]

    Gpulz: Optimizing lzss lossless compression for multi-byte data on modern gpus,

    B. Zhang, J. Tian, S. Di, X. Yu, M. Swany, D. Tao, and F. Cappello, “Gpulz: Optimizing lzss lossless compression for multi-byte data on modern gpus,” inProceedings of the 37th International Conference on Supercomputing, 2023, pp. 348–359

  62. [63]

    70% size, 100% accuracy: Lossless llm compression for efficient gpu inference via dynamic-length float,

    T. Zhang, Y . Sui, S. Zhong, V . Chaudhary, X. Hu, and A. Shrivastava, “70% size, 100% accuracy: Lossless llm compression for efficient gpu inference via dynamic-length float,”arXiv preprint arXiv:2504.11651, 2025

  63. [64]

    A Survey on Efficient Inference for Large Language Models

    Z. Zhou, X. Ning, K. Hong, T. Fu, J. Xu, S. Li, Y . Lou, L. Wang, Z. Yuan, X. Liet al., “A survey on efficient inference for large language models,”arXiv preprint arXiv:2404.14294, 2024

  64. [65]

    Compression of individual sequences via variable-rate coding,

    J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,”IEEE transactions on Information Theory, vol. 24, no. 5, pp. 530–536, 2003

  65. [66]

    A universal algorithm for sequential data compression,

    ——, “A universal algorithm for sequential data compression,”IEEE Transactions on information theory, vol. 23, no. 3, pp. 337–343, 2003

  66. [67]

    Serving large language models on huawei cloudmatrix384,

    P. Zuo, H. Lin, J. Deng, N. Zou, X. Yang, Y . Diao, W. Gao, K. Xu, Z. Chen, S. Luet al., “Serving large language models on huawei cloudmatrix384,”arXiv preprint arXiv:2506.12708, 2025. APPENDIX A. Abstract The source code of ENEC is available at https://github.com /jinwuyang/ENEC_ISCA_AE. The NPU kernel implementation can be found in thecsrc/directory and...

  67. [68]

    The repository is organized intocsrc/(NPU kernels),python/(test tools)

    How to access:The source code is available at https://gith ub.com/jinwuyang/ENEC_ISCA_AE. The repository is organized intocsrc/(NPU kernels),python/(test tools)

  68. [69]

    Hardware dependencies:The artifact requires anAscend 910B2 NPUplatform withaarch64architecture

  69. [70]

    •Python Libraries:torch 2.5.1, torch_npu 2.5.1.post3, and standard data science stack (numpy, pandas, scipy)

    Software dependencies: •CANN Stack:Ascend-CANN-toolkit and Kernels 8.2.RC1.alpha002. •Python Libraries:torch 2.5.1, torch_npu 2.5.1.post3, and standard data science stack (numpy, pandas, scipy). •ATB Library:Recommended version 8.0.0

  70. [71]

    By default, the data_prepare.sh script only downloadsQwen3- 32Bto minimize preparation time and disk usage

    Data sets:The evaluation of ENEC encompasses a diverse set of model weights, categorized by their data precision formats. By default, the data_prepare.sh script only downloadsQwen3- 32Bto minimize preparation time and disk usage. However, the data_prepare.sh script provides commented options to download all other models listed below (e.g., DeepSeek-LLM-7B...

  71. [72]

    Install CANN Toolkit and Kernels:Download the follow- ing files from https://www.hiascend.com/developer/download/co mmunity/result?module=cann&cann=8.2.RC1.alpha002: •Ascend-cann-toolkit_8.2.RC1.alpha002_linux-aarch64.run •Ascend-cann-kernels-910b_8.2.RC1.alpha002_linux-aarch64 .run Then run the following commands: # Add executable permissions chmod +x As...

  72. [73]

    Configure the Conda environment:Create a Python 3.9 environment and install NPU-specific PyTorch and dependencies: conda create -n enec python=3.9 -y conda activate enec pip install pandas numpy==1.24.3 transformers==4.30.0 jinja2 \ decorator attrs psutil absl-py cloudpickle ml-dtypes scipy \ tornado pyyaml wget https://download.pytorch.org/whl/cpu/torch-...

  73. [74]

    import torch; import torch_npu; a = torch.randn(3, 4).npu(); print(a + a)

    Verify the environment:Run a simple NPU tensor opera- tion to confirm correct setup: python3 -c "import torch; import torch_npu; a = torch.randn(3, 4).npu(); print(a + a)" If the output is normal, the environment is normal

  74. [75]

    git clone https://github.com/jinwuyang/ENEC_ISCA_AE.git chmod 777 -R ENEC_ISCA_AE cd ENEC_ISCA_AE bash build_csrc.sh E

    Build:Clone the repository and run build_csrc.sh (1 hour). git clone https://github.com/jinwuyang/ENEC_ISCA_AE.git chmod 777 -R ENEC_ISCA_AE cd ENEC_ISCA_AE bash build_csrc.sh E. Experiment workflow

  75. [76]

    By default, the script only downloads and processesQwen3-32B(1 hour)

    Data Preparation:Execute data_prepare.sh to download and split the model weights. By default, the script only downloads and processesQwen3-32B(1 hour). To test other models (e.g., DeepSeek-LLM-7B, Falcon-40B), simply uncomment the corre- sponding lines in data_prepare.sh. bash data_prepare.sh

  76. [77]

    This script automates pa- rameter searching, compression/decompression profiling, and global analysis

    Performance Testing:Run compressor_test.sh to measure the compression ratio and throughput. This script automates pa- rameter searching, compression/decompression profiling, and global analysis. At the end of the execution, it also outputs the end-to-end inference results (2 hours). source /your/path/ascend-toolkit/set_env.sh bash compressor_test.sh F . E...

  77. [78]

    Each model subfolder (e.g., BF16/Qwen3-32B) provides: •hyperparams_results.csv: An exhaustive list of optimal pa- rameters for every model tensor

    Optimal parameter search results:The following results show the expected outputs for the Qwen3-32B model: BF16 Model Compression Results ======================================== File Processed: hyperparams_results.csv Total Elements: 32,761,446,400 ------------------------------------------------- Original BF16 Size: 62487.50 MB ENEC Compressed Size: 4626...

  78. [79]

    Compression Ratio and Throughput:The file sum- mary_enec.csv summarizes the compression ratio, compression throughput, and decompression throughput of ENEC on11 models, corresponding to Table II and Figure 9 in the paper. The expected results for these 11 models are presented as follows: --- Summary Data Preview --- model_name dtype compression_ratio_CR c...

  79. [80]

    For brevity, we only present the results forQwen3-32Bwithbatch size = 1

    End-to-End Inference Latency:Figure 10 in the paper shows the end-to-end inference latency and speedup over the base- line (uncompressed with CPU offloading) for both Qwen3-32B and Falcon-40B under different batch sizes. For brevity, we only present the results forQwen3-32Bwithbatch size = 1. The expected results are presented as follows: [Inference: Qwen...