pith. sign in

arxiv: 2601.21198 · v2 · pith:Y65ZRN7Mnew · submitted 2026-01-29 · 💻 cs.DC · cs.AI· cs.LG

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

Pith reviewed 2026-05-25 07:41 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG
keywords Mixture-of-Expertson-device inferencelossless compressioncache-affinity schedulingedge computingMoE servinginference optimization
0
0 comments X

The pith

ZipMoE reduces on-device MoE inference latency by up to 72.77% using lossless compression and cache-affinity scheduling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZipMoE, a system for serving Mixture-of-Experts models on edge devices without accuracy loss. It pairs a compression method that removes statistical redundancy in MoE parameters with a scheduling strategy that keeps data close to the processor. The co-design carries a provable performance bound and moves inference from heavy memory transfers to faster computation. This matters for running powerful sparse models on phones and other constrained hardware where full models exceed available memory. Experiments on real edge platforms confirm large gains in speed and throughput over prior systems.

Core claim

ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. The design shifts on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization.

What carries the argument

The caching-scheduling co-design that pairs semantically lossless compression of MoE parameters with cache-affinity scheduling to minimize data movement.

If this is right

  • Inference latency drops by as much as 72.77 percent on representative edge platforms.
  • Throughput rises by as much as 6.76 times compared with current state-of-the-art on-device systems.
  • MoE models can run without lossy quantization while fitting within edge memory limits.
  • The scheduling component supplies a formal performance guarantee for the combined system.
  • Inference workload moves from memory-bound transfers to parallel compute on the device.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same redundancy-plus-affinity pattern may extend to other sparse architectures beyond MoE.
  • Edge hardware vendors could add explicit cache-affinity primitives to further amplify the gains.
  • Wider testing across additional MoE variants would show whether the compression remains lossless under distribution shift.

Load-bearing premise

The statistical redundancy present in MoE parameters permits a semantically lossless compression scheme that preserves model behavior across real-world workloads on edge hardware.

What would settle it

Running the compressed MoE model on a held-out real-world workload and observing any change in output distributions or task accuracy relative to the uncompressed model.

Figures

Figures reproduced from arXiv: 2601.21198 by Pu Yang, Shaowei Wang, Yaru Zhao, Yuchen Yang, Zhi-Hua Zhou.

Figure 1
Figure 1. Figure 1: Latency break-down of decoding layers in representi￾tive MoE models on (a) Server environment, where experts are offloaded to CPU with 512GB RAM; and (b) Edge environment, where experts are offloaded to NVMe SSD (Aigo DP35) with 2GB/s read speed. to perform within a constant factor of the global optimum. Ultimately, our design rethinks expert loading in MoE infer￾ence by computing the required expert tenso… view at source ↗
Figure 2
Figure 2. Figure 2: The probability mass heat maps of integer representations of the exponent bits extracted from different MoE parameters. The Shannon entropy values for the three models are 2.651 bits, 2.563, and 2.554 bits, respectively. DeepSeekV2-Lite Qwen1.5-MoE SwitchTransformers-Large-128 60 80 100 Compression Ratio (%) ZSTD LZ4HC Uncompressed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: ZIPMOE System Overview are equipped with multi-core CPUs, we benchmark the de￾compression throughput of LZ4HC and ZSTD against raw tensor I/O. Our results in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DAG structures of expert requests with different com￾pression states. model inference ( [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of system throughput on diverse MoE mod￾els under different batch sizes. BS denotes batch size. ware 1) and 32GB (Hardware 2) as our edge computing testbeds. Both devices operate on Jetpack 6.2.1 with Ubuntu 22.04, and are equipped with a Samsung 970 EVO SSD that provides a disk read speed of 3.5 GB/s. Baselines. We compare ZIPMOE with the following state￾of-the-art LLM serving systems: (1) MoE-… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of TPOT and TTFT performances on diverse MoE models across different memory budgets. ZIPMOE pre-allocates cache pools as contiguous mem￾ory regions to reduce memory fragmentation. To bene￾fit from UMA, we adopt a zero-copy paradigm that reads SM-chunks directly into host-pinned memory registered for direct GPU access, avoiding redundant data transfers. Motivated by operating system (OS) consider… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of end-to-end latency on diverse MoE models across different output lengths. 4.0 4.5 5.0 5.5 6.0 Throughput (token/s) 140 160 180 200 220 Latency (s) DeepSeekV2-Lite 6.0 6.5 7.0 7.5 8.0 8.5 Throughput (token/s) 120 130 140 150 160 Latency (s) Qwen1.5-MoE FIFO Marking LRU ZipMoE w/o Cache Planning ZipMoE [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of cache management strategies on the trade-off between latency and throughput. memory and I/O efficiency. The results highlight the ZIP￾MOE’s superiority in real-time and interactive tasks. System Throughput [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.Our code is available at: https://github.com/npnothard/ZipMoE-ICML26.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper claims to introduce ZipMoE, a system for on-device serving of Mixture-of-Experts models using a caching-scheduling co-design with lossless compression and cache-affinity scheduling that has a provable performance guarantee. It reports up to 72.77% reduction in inference latency and 6.76× higher throughput than state-of-the-art on edge platforms with real-world workloads.

Significance. Should the claims be substantiated, particularly the semantically lossless nature of the compression and the provable guarantee, this could have substantial impact on enabling large MoE models on edge devices without relying on quantization, by exploiting statistical redundancy and hardware properties to shift to compute-centric inference. Code availability supports reproducibility.

minor comments (1)
  1. [Abstract] The abstract asserts specific quantitative gains and a 'provable performance guarantee' without any supporting derivation, controls, or error analysis visible, which is typical for an abstract but requires the full text for evaluation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review of our manuscript on ZipMoE. We appreciate the recognition of the potential substantial impact on enabling large MoE models on edge devices. The recommendation is listed as uncertain, with emphasis on substantiating the semantically lossless compression and provable guarantee. The manuscript provides formal proofs for the performance guarantee along with empirical evaluations confirming semantic losslessness via preserved model accuracy on real workloads. As the report contains no enumerated major comments, we have no specific points to address below.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and available context present ZipMoE as an implemented system with experimental results on latency and throughput, plus a claimed provable guarantee on the co-design. No equations, fitted parameters, self-citations, or derivation steps are supplied that would allow any reduction of outputs to inputs by construction. The central performance claims are framed as measured outcomes from prototype evaluation on edge platforms, not as quantities defined from the same data or prior self-work. This is the expected non-finding for a systems paper whose abstract contains no load-bearing mathematical steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that MoE parameter redundancy is both present and exploitable in a lossless manner on target hardware.

pith-pipeline@v0.9.0 · 5755 in / 1034 out tokens · 18775 ms · 2026-05-25T07:41:10.683026+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    URL http: //www.jstor.org/stable/2337119

    ISSN 00063444. URL http: //www.jstor.org/stable/2337119. Collet, Y . et al. LZ4: Extremely fast compression algorithm. https://github.com/lz4/lz4,

  2. [2]

    and Mazur, D

    Eliseev, A. and Mazur, D. Fast inference of mixture-of- experts language models with offloading.arXiv preprint arXiv:2312.17238,

  3. [3]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training compression for generative pretrained transformers.arXiv preprint arXiv:2210.17323,

  4. [4]

    Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

    Hao, Y ., Cao, Y ., and Mou, L. Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

  5. [5]

    A Study of BFLOAT16 for Deep Learning Training

    Kalamkar, D., Mudigere, D., Mellempudi, N., Das, D., Banerjee, K., Avancha, S., V ooturi, D. T., Jammala- madaka, N., Huang, J., Yuen, H., et al. A study of 9 ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling bfloat16 for deep learning training.arXiv preprint arXiv:1905.12322,

  6. [6]

    Fid- dler: Cpu-gpu orchestration for fast inference of mixture- of-experts models.arXiv preprint arXiv:2402.07033,

    Kamahori, K., Tang, T., Gu, Y ., Zhu, K., and Kasikci, B. Fid- dler: Cpu-gpu orchestration for fast inference of mixture- of-experts models.arXiv preprint arXiv:2402.07033,

  7. [7]

    In- creased llm vulnerabilities from fine-tuning and quantiza- tion.arXiv preprint arXiv:2404.04392,

    Kumar, D., Kumar, A., Agarwal, S., and Harshangi, P. In- creased llm vulnerabilities from fine-tuning and quantiza- tion.arXiv preprint arXiv:2404.04392,

  8. [8]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

  9. [9]

    Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823,

    Liu, R., Sun, Y ., Zhang, M., Bai, H., Yu, X., Yu, T., Yuan, C., and Hou, L. Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823,

  10. [10]

    doi: 10.1109/TE.2024. 3467912. NVIDIA. nvCOMP: GPU-accelerated compression and decompression library. https://developer. nvidia.com/nvcomp,

  11. [11]

    2025.3527641

    doi: 10.1109/COMST. 2025.3527641. Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R´e, C., Stoica, I., and Zhang, C. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pp. 31094–31116. PMLR,

  12. [12]

    Promoe: Fast moe-based llm serving using proactive caching.arXiv preprint arXiv:2410.22134,

    Song, X., Zhong, Z., Chen, R., and Chen, H. Promoe: Fast moe-based llm serving using proactive caching.arXiv preprint arXiv:2410.22134,

  13. [13]

    10610948

    doi: 10.1109/ICRA57147.2024. 10610948. Wang, H., Zhou, Q., Hong, Z., and Guo, S. D2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving. InProceedings of the 31st An- nual International Conference on Mobile Computing and Networking, pp. 574–588,

  14. [14]

    L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

    Wolf, T., Debut, L., Sanh, V ., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jer- nite, Y ., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State- of-the-art natural language processing. InProceedings of the 2020 Confe...

  15. [15]

    URL https://www.aclweb

    Association for Compu- tational Linguistics. URL https://www.aclweb. org/anthology/2020.emnlp-demos.6. Xu, M. Sharegpt-gpt4. https://huggingface.co/ datasets/shibing624/sharegpt_gpt4,

  16. [16]

    Xue, L., Fu, Y ., Lu, Z., Mai, L., and Marina, M

    Hugging Face dataset. Xue, L., Fu, Y ., Lu, Z., Mai, L., and Marina, M. Moe- infinity: Offloading-efficient moe model serving.arXiv preprint arXiv:2401.14361,

  17. [17]

    Qwen2 Technical Report

    Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., Bai...

  18. [18]

    doi: 10.1109/TMC.2025. 3546466. Yu, H., Cui, X., Zhang, H., and Wang, H. Taming latency- memory trade-off in moe-based llm serving via fine- grained expert offloading

  19. [19]

    Q., Joshi, S., Hegde, C., et al

    Yubeaton, P., Mahmoud, T., Naga, S., Taheri, P., Xia, T., George, A., Khalil, Y ., Zhang, S. Q., Joshi, S., Hegde, C., et al. Huff-llm: End-to-end lossless compression for efficient llm inference.arXiv preprint arXiv:2502.00922,

  20. [20]

    Such set of w∗ i can be found via amodified iterative proportional fitting algorithm(Chen et al., 1994), which we restate as follows

    that for any feasible set of inclusion probabilities {fi}N i=1, there exists a unique set of positive weights {w∗ i }N i=1 (up to a scaling factor) such that the resulting distribution satisfies:(i) P(S)∝ Q i∈S w∗ i ;(ii)P S∋i,|S|=k P(S) =f i,∀i∈ N ;and (iii)P i∈N fi =k . Such set of w∗ i can be found via amodified iterative proportional fitting algorithm...