ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
arXiv preprint arXiv:2506.12708 , year=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.
Ascend-RaBitQ is the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search that decouples coarse ranking on NPU from fine ranking on CPU to leverage optimal hardware per stage.
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
SMC-AI scales Monte Carlo simulations to 4 trillion atoms on AI hardware clusters, achieving 32 times larger systems and 1.3 times higher throughput than prior records while decoupling ML models from the simulation core.
citing papers explorer
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
-
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
-
PICO: Performance Insights for Collective Operations
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.
-
Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization
Ascend-RaBitQ is the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search that decouples coarse ranking on NPU from fine ranking on CPU to leverage optimal hardware per stage.
-
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
-
SMC-AI: Scaling Monte Carlo Simulation to Four Trillion Atoms with AI Accelerators
SMC-AI scales Monte Carlo simulations to 4 trillion atoms on AI hardware clusters, achieving 32 times larger systems and 1.3 times higher throughput than prior records while decoupling ML models from the simulation core.