ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

Pu Yang; Shaowei Wang; Yaru Zhao; Yuchen Yang; Zhi-Hua Zhou

arxiv: 2601.21198 · v2 · pith:Y65ZRN7Mnew · submitted 2026-01-29 · 💻 cs.DC · cs.AI· cs.LG

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

Yuchen Yang , Yaru Zhao , Pu Yang , Shaowei Wang , Zhi-Hua Zhou This is my paper

Pith reviewed 2026-05-25 07:41 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG

keywords Mixture-of-Expertson-device inferencelossless compressioncache-affinity schedulingedge computingMoE servinginference optimization

0 comments

The pith

ZipMoE reduces on-device MoE inference latency by up to 72.77% using lossless compression and cache-affinity scheduling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZipMoE, a system for serving Mixture-of-Experts models on edge devices without accuracy loss. It pairs a compression method that removes statistical redundancy in MoE parameters with a scheduling strategy that keeps data close to the processor. The co-design carries a provable performance bound and moves inference from heavy memory transfers to faster computation. This matters for running powerful sparse models on phones and other constrained hardware where full models exceed available memory. Experiments on real edge platforms confirm large gains in speed and throughput over prior systems.

Core claim

ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. The design shifts on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization.

What carries the argument

The caching-scheduling co-design that pairs semantically lossless compression of MoE parameters with cache-affinity scheduling to minimize data movement.

If this is right

Inference latency drops by as much as 72.77 percent on representative edge platforms.
Throughput rises by as much as 6.76 times compared with current state-of-the-art on-device systems.
MoE models can run without lossy quantization while fitting within edge memory limits.
The scheduling component supplies a formal performance guarantee for the combined system.
Inference workload moves from memory-bound transfers to parallel compute on the device.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same redundancy-plus-affinity pattern may extend to other sparse architectures beyond MoE.
Edge hardware vendors could add explicit cache-affinity primitives to further amplify the gains.
Wider testing across additional MoE variants would show whether the compression remains lossless under distribution shift.

Load-bearing premise

The statistical redundancy present in MoE parameters permits a semantically lossless compression scheme that preserves model behavior across real-world workloads on edge hardware.

What would settle it

Running the compressed MoE model on a held-out real-world workload and observing any change in output distributions or task accuracy relative to the uncompressed model.

Figures

Figures reproduced from arXiv: 2601.21198 by Pu Yang, Shaowei Wang, Yaru Zhao, Yuchen Yang, Zhi-Hua Zhou.

**Figure 1.** Figure 1: Latency break-down of decoding layers in representitive MoE models on (a) Server environment, where experts are offloaded to CPU with 512GB RAM; and (b) Edge environment, where experts are offloaded to NVMe SSD (Aigo DP35) with 2GB/s read speed. to perform within a constant factor of the global optimum. Ultimately, our design rethinks expert loading in MoE inference by computing the required expert tenso… view at source ↗

**Figure 2.** Figure 2: The probability mass heat maps of integer representations of the exponent bits extracted from different MoE parameters. The Shannon entropy values for the three models are 2.651 bits, 2.563, and 2.554 bits, respectively. DeepSeekV2-Lite Qwen1.5-MoE SwitchTransformers-Large-128 60 80 100 Compression Ratio (%) ZSTD LZ4HC Uncompressed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 5.** Figure 5: ZIPMOE System Overview are equipped with multi-core CPUs, we benchmark the decompression throughput of LZ4HC and ZSTD against raw tensor I/O. Our results in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: DAG structures of expert requests with different compression states. model inference ( [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 8.** Figure 8: Comparison of system throughput on diverse MoE models under different batch sizes. BS denotes batch size. ware 1) and 32GB (Hardware 2) as our edge computing testbeds. Both devices operate on Jetpack 6.2.1 with Ubuntu 22.04, and are equipped with a Samsung 970 EVO SSD that provides a disk read speed of 3.5 GB/s. Baselines. We compare ZIPMOE with the following stateof-the-art LLM serving systems: (1) MoE-… view at source ↗

**Figure 7.** Figure 7: Comparison of TPOT and TTFT performances on diverse MoE models across different memory budgets. ZIPMOE pre-allocates cache pools as contiguous memory regions to reduce memory fragmentation. To benefit from UMA, we adopt a zero-copy paradigm that reads SM-chunks directly into host-pinned memory registered for direct GPU access, avoiding redundant data transfers. Motivated by operating system (OS) consider… view at source ↗

**Figure 9.** Figure 9: Comparison of end-to-end latency on diverse MoE models across different output lengths. 4.0 4.5 5.0 5.5 6.0 Throughput (token/s) 140 160 180 200 220 Latency (s) DeepSeekV2-Lite 6.0 6.5 7.0 7.5 8.0 8.5 Throughput (token/s) 120 130 140 150 160 Latency (s) Qwen1.5-MoE FIFO Marking LRU ZipMoE w/o Cache Planning ZipMoE [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Impact of cache management strategies on the trade-off between latency and throughput. memory and I/O efficiency. The results highlight the ZIPMOE’s superiority in real-time and interactive tasks. System Throughput [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

read the original abstract

While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.Our code is available at: https://github.com/npnothard/ZipMoE-ICML26.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZipMoE pairs lossless MoE compression with cache-affinity scheduling for edge inference and reports large gains, but the abstract leaves the actual compression method and proof details uncheckable.

read the letter

The main takeaway is that this paper describes a system called ZipMoE that tries to make Mixture-of-Experts models runnable on edge hardware without quantization by compressing parameters in a way claimed to be semantically lossless, then pairing that with a scheduler that keeps relevant experts in cache. The reported numbers are 72.77% lower latency and 6.76x throughput versus prior systems, plus a provable performance guarantee from the co-design. That combination of techniques aimed at the specific memory and I/O constraints of edge devices is the concrete new piece here. The work also ships a prototype, runs it on representative platforms with real workloads and open MoE models, and releases the code, which is useful for anyone who wants to reproduce or extend it. Those are the parts that stand on their own from the abstract. The soft spots are straightforward. The abstract states the compression is lossless and the guarantee is provable, yet gives no derivation, no description of how redundancy is exploited without changing behavior, and no error analysis or controls. Without those, it is impossible to tell whether the speedups come from the claimed mechanism or from other factors, and whether the lossless property actually holds across the workloads tested. The assumption that statistical redundancy in MoE weights permits a general lossless scheme is the load-bearing one, and it is not demonstrated in the provided text. This paper is aimed at researchers and engineers working on on-device LLM serving, especially those already looking at MoE architectures for mobile or IoT. A reader who needs concrete system ideas and numbers to try will get value from the prototype and the reported deltas even if the deeper claims require the full manuscript. It is worth sending to peer review because the problem is real, the co-design direction is specific, and the code release makes verification possible; a referee can check the compression details and the proof directly.

Referee Report

0 major / 1 minor

Summary. The paper claims to introduce ZipMoE, a system for on-device serving of Mixture-of-Experts models using a caching-scheduling co-design with lossless compression and cache-affinity scheduling that has a provable performance guarantee. It reports up to 72.77% reduction in inference latency and 6.76× higher throughput than state-of-the-art on edge platforms with real-world workloads.

Significance. Should the claims be substantiated, particularly the semantically lossless nature of the compression and the provable guarantee, this could have substantial impact on enabling large MoE models on edge devices without relying on quantization, by exploiting statistical redundancy and hardware properties to shift to compute-centric inference. Code availability supports reproducibility.

minor comments (1)

[Abstract] The abstract asserts specific quantitative gains and a 'provable performance guarantee' without any supporting derivation, controls, or error analysis visible, which is typical for an abstract but requires the full text for evaluation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review of our manuscript on ZipMoE. We appreciate the recognition of the potential substantial impact on enabling large MoE models on edge devices. The recommendation is listed as uncertain, with emphasis on substantiating the semantically lossless compression and provable guarantee. The manuscript provides formal proofs for the performance guarantee along with empirical evaluations confirming semantic losslessness via preserved model accuracy on real workloads. As the report contains no enumerated major comments, we have no specific points to address below.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and available context present ZipMoE as an implemented system with experimental results on latency and throughput, plus a claimed provable guarantee on the co-design. No equations, fitted parameters, self-citations, or derivation steps are supplied that would allow any reduction of outputs to inputs by construction. The central performance claims are framed as measured outcomes from prototype evaluation on edge platforms, not as quantities defined from the same data or prior self-work. This is the expected non-finding for a systems paper whose abstract contains no load-bearing mathematical steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that MoE parameter redundancy is both present and exploitable in a lossless manner on target hardware.

pith-pipeline@v0.9.0 · 5755 in / 1034 out tokens · 18775 ms · 2026-05-25T07:41:10.683026+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving
cs.LG 2026-04 unverdicted novelty 6.0

FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

URL http: //www.jstor.org/stable/2337119

ISSN 00063444. URL http: //www.jstor.org/stable/2337119. Collet, Y . et al. LZ4: Extremely fast compression algorithm. https://github.com/lz4/lz4,

work page arXiv
[2]

and Mazur, D

Eliseev, A. and Mazur, D. Fast inference of mixture-of- experts language models with offloading.arXiv preprint arXiv:2312.17238,

work page arXiv
[3]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training compression for generative pretrained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

Hao, Y ., Cao, Y ., and Mou, L. Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

work page arXiv
[5]

A Study of BFLOAT16 for Deep Learning Training

Kalamkar, D., Mudigere, D., Mellempudi, N., Das, D., Banerjee, K., Avancha, S., V ooturi, D. T., Jammala- madaka, N., Huang, J., Yuen, H., et al. A study of 9 ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling bfloat16 for deep learning training.arXiv preprint arXiv:1905.12322,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[6]

Fid- dler: Cpu-gpu orchestration for fast inference of mixture- of-experts models.arXiv preprint arXiv:2402.07033,

Kamahori, K., Tang, T., Gu, Y ., Zhu, K., and Kasikci, B. Fid- dler: Cpu-gpu orchestration for fast inference of mixture- of-experts models.arXiv preprint arXiv:2402.07033,

work page arXiv
[7]

In- creased llm vulnerabilities from fine-tuning and quantiza- tion.arXiv preprint arXiv:2404.04392,

Kumar, D., Kumar, A., Agarwal, S., and Harshangi, P. In- creased llm vulnerabilities from fine-tuning and quantiza- tion.arXiv preprint arXiv:2404.04392,

work page arXiv
[8]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823,

Liu, R., Sun, Y ., Zhang, M., Bai, H., Yu, X., Yu, T., Yuan, C., and Hou, L. Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823,

work page arXiv
[10]

doi: 10.1109/TE.2024. 3467912. NVIDIA. nvCOMP: GPU-accelerated compression and decompression library. https://developer. nvidia.com/nvcomp,

work page doi:10.1109/te.2024 2024
[11]

2025.3527641

doi: 10.1109/COMST. 2025.3527641. Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R´e, C., Stoica, I., and Zhang, C. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pp. 31094–31116. PMLR,

work page doi:10.1109/comst 2025
[12]

Promoe: Fast moe-based llm serving using proactive caching.arXiv preprint arXiv:2410.22134,

Song, X., Zhong, Z., Chen, R., and Chen, H. Promoe: Fast moe-based llm serving using proactive caching.arXiv preprint arXiv:2410.22134,

work page arXiv
[13]

10610948

doi: 10.1109/ICRA57147.2024. 10610948. Wang, H., Zhou, Q., Hong, Z., and Guo, S. D2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving. InProceedings of the 31st An- nual International Conference on Mobile Computing and Networking, pp. 574–588,

work page doi:10.1109/icra57147.2024 2024
[14]

L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

Wolf, T., Debut, L., Sanh, V ., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jer- nite, Y ., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State- of-the-art natural language processing. InProceedings of the 2020 Confe...

work page 2020
[15]

URL https://www.aclweb

Association for Compu- tational Linguistics. URL https://www.aclweb. org/anthology/2020.emnlp-demos.6. Xu, M. Sharegpt-gpt4. https://huggingface.co/ datasets/shibing624/sharegpt_gpt4,

work page 2020
[16]

Xue, L., Fu, Y ., Lu, Z., Mai, L., and Marina, M

Hugging Face dataset. Xue, L., Fu, Y ., Lu, Z., Mai, L., and Marina, M. Moe- infinity: Offloading-efficient moe model serving.arXiv preprint arXiv:2401.14361,

work page arXiv
[17]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., Bai...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

doi: 10.1109/TMC.2025. 3546466. Yu, H., Cui, X., Zhang, H., and Wang, H. Taming latency- memory trade-off in moe-based llm serving via fine- grained expert offloading

work page doi:10.1109/tmc.2025 2025
[19]

Q., Joshi, S., Hegde, C., et al

Yubeaton, P., Mahmoud, T., Naga, S., Taheri, P., Xia, T., George, A., Khalil, Y ., Zhang, S. Q., Joshi, S., Hegde, C., et al. Huff-llm: End-to-end lossless compression for efficient llm inference.arXiv preprint arXiv:2502.00922,

work page arXiv
[20]

Such set of w∗ i can be found via amodified iterative proportional fitting algorithm(Chen et al., 1994), which we restate as follows

that for any feasible set of inclusion probabilities {fi}N i=1, there exists a unique set of positive weights {w∗ i }N i=1 (up to a scaling factor) such that the resulting distribution satisfies:(i) P(S)∝ Q i∈S w∗ i ;(ii)P S∋i,|S|=k P(S) =f i,∀i∈ N ;and (iii)P i∈N fi =k . Such set of w∗ i can be found via amodified iterative proportional fitting algorithm...

work page 1994

[1] [1]

URL http: //www.jstor.org/stable/2337119

ISSN 00063444. URL http: //www.jstor.org/stable/2337119. Collet, Y . et al. LZ4: Extremely fast compression algorithm. https://github.com/lz4/lz4,

work page arXiv

[2] [2]

and Mazur, D

Eliseev, A. and Mazur, D. Fast inference of mixture-of- experts language models with offloading.arXiv preprint arXiv:2312.17238,

work page arXiv

[3] [3]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training compression for generative pretrained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

Hao, Y ., Cao, Y ., and Mou, L. Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

work page arXiv

[5] [5]

A Study of BFLOAT16 for Deep Learning Training

Kalamkar, D., Mudigere, D., Mellempudi, N., Das, D., Banerjee, K., Avancha, S., V ooturi, D. T., Jammala- madaka, N., Huang, J., Yuen, H., et al. A study of 9 ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling bfloat16 for deep learning training.arXiv preprint arXiv:1905.12322,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[6] [6]

Fid- dler: Cpu-gpu orchestration for fast inference of mixture- of-experts models.arXiv preprint arXiv:2402.07033,

Kamahori, K., Tang, T., Gu, Y ., Zhu, K., and Kasikci, B. Fid- dler: Cpu-gpu orchestration for fast inference of mixture- of-experts models.arXiv preprint arXiv:2402.07033,

work page arXiv

[7] [7]

In- creased llm vulnerabilities from fine-tuning and quantiza- tion.arXiv preprint arXiv:2404.04392,

Kumar, D., Kumar, A., Agarwal, S., and Harshangi, P. In- creased llm vulnerabilities from fine-tuning and quantiza- tion.arXiv preprint arXiv:2404.04392,

work page arXiv

[8] [8]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823,

Liu, R., Sun, Y ., Zhang, M., Bai, H., Yu, X., Yu, T., Yuan, C., and Hou, L. Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823,

work page arXiv

[10] [10]

doi: 10.1109/TE.2024. 3467912. NVIDIA. nvCOMP: GPU-accelerated compression and decompression library. https://developer. nvidia.com/nvcomp,

work page doi:10.1109/te.2024 2024

[11] [11]

2025.3527641

doi: 10.1109/COMST. 2025.3527641. Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R´e, C., Stoica, I., and Zhang, C. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pp. 31094–31116. PMLR,

work page doi:10.1109/comst 2025

[12] [12]

Promoe: Fast moe-based llm serving using proactive caching.arXiv preprint arXiv:2410.22134,

Song, X., Zhong, Z., Chen, R., and Chen, H. Promoe: Fast moe-based llm serving using proactive caching.arXiv preprint arXiv:2410.22134,

work page arXiv

[13] [13]

10610948

doi: 10.1109/ICRA57147.2024. 10610948. Wang, H., Zhou, Q., Hong, Z., and Guo, S. D2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving. InProceedings of the 31st An- nual International Conference on Mobile Computing and Networking, pp. 574–588,

work page doi:10.1109/icra57147.2024 2024

[14] [14]

L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

Wolf, T., Debut, L., Sanh, V ., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jer- nite, Y ., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State- of-the-art natural language processing. InProceedings of the 2020 Confe...

work page 2020

[15] [15]

URL https://www.aclweb

Association for Compu- tational Linguistics. URL https://www.aclweb. org/anthology/2020.emnlp-demos.6. Xu, M. Sharegpt-gpt4. https://huggingface.co/ datasets/shibing624/sharegpt_gpt4,

work page 2020

[16] [16]

Xue, L., Fu, Y ., Lu, Z., Mai, L., and Marina, M

Hugging Face dataset. Xue, L., Fu, Y ., Lu, Z., Mai, L., and Marina, M. Moe- infinity: Offloading-efficient moe model serving.arXiv preprint arXiv:2401.14361,

work page arXiv

[17] [17]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., Bai...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

doi: 10.1109/TMC.2025. 3546466. Yu, H., Cui, X., Zhang, H., and Wang, H. Taming latency- memory trade-off in moe-based llm serving via fine- grained expert offloading

work page doi:10.1109/tmc.2025 2025

[19] [19]

Q., Joshi, S., Hegde, C., et al

Yubeaton, P., Mahmoud, T., Naga, S., Taheri, P., Xia, T., George, A., Khalil, Y ., Zhang, S. Q., Joshi, S., Hegde, C., et al. Huff-llm: End-to-end lossless compression for efficient llm inference.arXiv preprint arXiv:2502.00922,

work page arXiv

[20] [20]

Such set of w∗ i can be found via amodified iterative proportional fitting algorithm(Chen et al., 1994), which we restate as follows

that for any feasible set of inclusion probabilities {fi}N i=1, there exists a unique set of positive weights {w∗ i }N i=1 (up to a scaling factor) such that the resulting distribution satisfies:(i) P(S)∝ Q i∈S w∗ i ;(ii)P S∋i,|S|=k P(S) =f i,∀i∈ N ;and (iii)P i∈N fi =k . Such set of w∗ i can be found via amodified iterative proportional fitting algorithm...

work page 1994