ZipMoE delivers up to 72.77% lower inference latency and 6.76x higher throughput for on-device MoE models via lossless compression and cache-affinity scheduling with a claimed provable guarantee.
Such set of w∗ i can be found via amodified iterative proportional fitting algorithm(Chen et al., 1994), which we restate as follows
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
ZipMoE delivers up to 72.77% lower inference latency and 6.76x higher throughput for on-device MoE models via lossless compression and cache-affinity scheduling with a claimed provable guarantee.