Serving large lan- guage models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, et al · 2025 · arXiv 2506.12708

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

cs.AR · 2026-03-28 · unverdicted · novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

cs.LG · 2026-01-29 · unverdicted · novelty 7.0

A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.

PICO: Performance Insights for Collective Operations

cs.DC · 2025-08-22 · unverdicted · novelty 6.0

PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.

Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization

cs.IR · 2026-05-15 · unverdicted · novelty 5.0

Ascend-RaBitQ is the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search that decouples coarse ranking on NPU from fine ranking on CPU to leverage optimal hardware per stage.

Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

cs.DC · 2026-05-02 · accept · novelty 4.0

LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.

SMC-AI: Scaling Monte Carlo Simulation to Four Trillion Atoms with AI Accelerators

physics.comp-ph · 2026-04-09 · unverdicted · novelty 4.0

SMC-AI scales Monte Carlo simulations to 4 trillion atoms on AI hardware clusters, achieving 32 times larger systems and 1.3 times higher throughput than prior records while decoupling ML models from the simulation core.

citing papers explorer

Showing 1 of 1 citing paper after filters.

PICO: Performance Insights for Collective Operations cs.DC · 2025-08-22 · unverdicted · none · ref 58
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.

Serving large lan- guage models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer