Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
hub
In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pp
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.
CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
Performance analysis of multi-node LLM inference identifies all-reduce bottlenecks and introduces NVRAR hierarchical all-reduce achieving 1.9-3.6x lower latency than NCCL and up to 1.72x end-to-end batch latency reduction for Llama 3.1 405B in decode-heavy tensor-parallel workloads.
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
citing papers explorer
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
Frontier: Towards Comprehensive and Accurate LLM Inference Simulation
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
-
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
-
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
-
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
-
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.
-
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
-
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
-
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
-
Understanding and Improving Communication Performance in Multi-node LLM Inference
Performance analysis of multi-node LLM inference identifies all-reduce bottlenecks and introduces NVRAR hierarchical all-reduce achieving 1.9-3.6x lower latency than NCCL and up to 1.72x end-to-end batch latency reduction for Llama 3.1 405B in decode-heavy tensor-parallel workloads.
-
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.