MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
arXiv preprint arXiv:2411.16102 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.
citing papers explorer
-
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
-
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.