UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
Serving large lan- guage models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
ASAP is a disaggregated asynchronous inference system for the prefill phase of MoE models that removes DP-EP synchronization barriers and reports 90% higher SLO-compliant throughput than synchronous baselines.
A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.
Ascend-RaBitQ is the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search that decouples coarse ranking on NPU from fine ranking on CPU to leverage optimal hardware per stage.
The paper introduces an aging-based scheduler with LPRS and APC for chunked-prefill LLM engines that cuts mean end-to-end latency by over 10% and lowers P99 tail latency versus FCFS on real hardware.
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
SMC-AI scales Monte Carlo simulations to 4 trillion atoms on AI hardware clusters, achieving 32 times larger systems and 1.3 times higher throughput than prior records while decoupling ML models from the simulation core.
citing papers explorer
-
UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
-
Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference
Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
-
ASAP: A Disaggregated and Asynchronous Inference System for MoE Prefill
ASAP is a disaggregated asynchronous inference system for the prefill phase of MoE models that removes DP-EP synchronization barriers and reports 90% higher SLO-compliant throughput than synchronous baselines.
-
Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design
A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.
-
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
-
PICO: Performance Insights for Collective Operations
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.
-
Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization
Ascend-RaBitQ is the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search that decouples coarse ranking on NPU from fine ranking on CPU to leverage optimal hardware per stage.
-
Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving
The paper introduces an aging-based scheduler with LPRS and APC for chunked-prefill LLM engines that cuts mean end-to-end latency by over 10% and lowers P99 tail latency versus FCFS on real hardware.
-
SMC-AI: Scaling Monte Carlo Simulation to Four Trillion Atoms with AI Accelerators
SMC-AI scales Monte Carlo simulations to 4 trillion atoms on AI hardware clusters, achieving 32 times larger systems and 1.3 times higher throughput than prior records while decoupling ML models from the simulation core.