Performance analysis of multi-node LLM inference identifies all-reduce bottlenecks and introduces NVRAR hierarchical all-reduce achieving 1.9-3.6x lower latency than NCCL and up to 1.72x end-to-end batch latency reduction for Llama 3.1 405B in decode-heavy tensor-parallel workloads.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Understanding and Improving Communication Performance in Multi-node LLM Inference
Performance analysis of multi-node LLM inference identifies all-reduce bottlenecks and introduces NVRAR hierarchical all-reduce achieving 1.9-3.6x lower latency than NCCL and up to 1.72x end-to-end batch latency reduction for Llama 3.1 405B in decode-heavy tensor-parallel workloads.