Serving large lan- guage models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu · 2025 · arXiv 2506.12708

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference

cs.DC · 2026-07-02 · unverdicted · novelty 7.0

Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

cs.AR · 2026-03-28 · unverdicted · novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

cs.LG · 2026-01-29 · unverdicted · novelty 7.0

A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.

ASAP: A Disaggregated and Asynchronous Inference System for MoE Prefill

cs.DC · 2026-06-21 · unverdicted · novelty 6.0

ASAP is a disaggregated asynchronous inference system for the prefill phase of MoE models that removes DP-EP synchronization barriers and reports 90% higher SLO-compliant throughput than synchronous baselines.

Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

cs.DC · 2026-06-09 · unverdicted · novelty 6.0

A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.

PICO: Performance Insights for Collective Operations

cs.DC · 2025-08-22 · unverdicted · novelty 6.0

PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.

Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization

cs.IR · 2026-05-15 · unverdicted · novelty 5.0

Ascend-RaBitQ is the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search that decouples coarse ranking on NPU from fine ranking on CPU to leverage optimal hardware per stage.

Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving

cs.DC · 2026-06-08 · unverdicted · novelty 4.0

The paper introduces an aging-based scheduler with LPRS and APC for chunked-prefill LLM engines that cuts mean end-to-end latency by over 10% and lowers P99 tail latency versus FCFS on real hardware.

Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

cs.DC · 2026-05-02 · accept · novelty 4.0

LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.

SMC-AI: Scaling Monte Carlo Simulation to Four Trillion Atoms with AI Accelerators

physics.comp-ph · 2026-04-09 · unverdicted · novelty 4.0

SMC-AI scales Monte Carlo simulations to 4 trillion atoms on AI hardware clusters, achieving 32 times larger systems and 1.3 times higher throughput than prior records while decoupling ML models from the simulation core.

citing papers explorer

Showing 11 of 11 citing papers after filters.

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing cs.DC · 2026-06-02 · unverdicted · none · ref 79
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference cs.DC · 2026-07-02 · unverdicted · none · ref 68
Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs cs.AR · 2026-03-28 · unverdicted · none · ref 67
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads cs.LG · 2026-01-29 · unverdicted · none · ref 15
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
ASAP: A Disaggregated and Asynchronous Inference System for MoE Prefill cs.DC · 2026-06-21 · unverdicted · none · ref 57
ASAP is a disaggregated asynchronous inference system for the prefill phase of MoE models that removes DP-EP synchronization barriers and reports 90% higher SLO-compliant throughput than synchronous baselines.
Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design cs.DC · 2026-06-09 · unverdicted · none · ref 61
A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 59
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
PICO: Performance Insights for Collective Operations cs.DC · 2025-08-22 · unverdicted · none · ref 58
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.
Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization cs.IR · 2026-05-15 · unverdicted · none · ref 49
Ascend-RaBitQ is the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search that decouples coarse ranking on NPU from fine ranking on CPU to leverage optimal hardware per stage.
Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving cs.DC · 2026-06-08 · unverdicted · none · ref 15
The paper introduces an aging-based scheduler with LPRS and APC for chunked-prefill LLM engines that cuts mean end-to-end latency by over 10% and lowers P99 tail latency versus FCFS on real hardware.
SMC-AI: Scaling Monte Carlo Simulation to Four Trillion Atoms with AI Accelerators physics.comp-ph · 2026-04-09 · unverdicted · none · ref 2
SMC-AI scales Monte Carlo simulations to 4 trillion atoms on AI hardware clusters, achieving 32 times larger systems and 1.3 times higher throughput than prior records while decoupling ML models from the simulation core.

Serving large lan- guage models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer