pith. sign in

arxiv: 2511.09557 · v4 · pith:VSGQ6UIInew · submitted 2025-11-12 · 💻 cs.DC · cs.LG

Understanding and Improving Communication Performance in Multi-node LLM Inference

Pith reviewed 2026-05-21 19:19 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords multi-node LLM inferenceall-reduce optimizationtensor parallelismcommunication performanceNVSHMEMNCCLdecode latency
0
0 comments X

The pith

NVRAR, a hierarchical all-reduce built on recursive doubling and NVSHMEM, reduces communication latency in multi-node LLM inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how communication overheads slow down large language model inference when models are split across multiple nodes and GPUs. It pinpoints all-reduce operations as a frequent bottleneck in tensor-parallel decode phases. The authors introduce NVRAR to handle these operations more efficiently than existing libraries. Experiments on high-performance networks show clear improvements in both raw communication speed and overall batch processing time for a 405B parameter model. The results demonstrate that targeted changes to collective communication can directly lower end-to-end latency in distributed inference.

Core claim

NVRAR achieves up to 1.9×-3.6× lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72× reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

What carries the argument

NVRAR, the hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM, which performs data synchronization across nodes more efficiently than standard collectives.

Load-bearing premise

The latency reductions measured on the tested supercomputer interconnects and decode-heavy workloads will generalize to other production inference engines, network configurations, and workload mixes.

What would settle it

Measure end-to-end batch latency for Llama 3.1 405B on an Ethernet-based cluster or with a different inference engine and check whether the reduction stays above 1.5× compared to NCCL.

Figures

Figures reproduced from arXiv: 2511.09557 by Abhinav Bhatele, Akarsh Srivastava, Charles Fredrick Jekel, Harshitha Menon, Lannie Dalton Hough, Prajwal Singhania, Siddharth Singh.

Figure 1
Figure 1. Figure 1: Strong scaling performance of different inference engines on Perlmutter for Llama 3.1 70B Instruct. The Y-axis shows the end-to-end latency per batch in seconds and the X-axis shows the number of GPUs. 16 32 64 128 Number of GPUs 0 15 30 45 60 75 Time (s) Prefill-heavy (405B, NumPrompts = 32) YALIS (TP) vLLM V1 (TP) vLLM V0 (HP) 16 32 64 128 Number of GPUs 0 6 12 18 24 30 Time (s) Prefill-heavy (405B, NumP… view at source ↗
Figure 2
Figure 2. Figure 2: Strong scaling performance of different inference engines on Perlmutter for Llama 3.1 405B Instruct. The Y-axis shows the end-to-end latency per batch in seconds and the X-axis shows the number of GPUs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Synthetic GEMM benchmarks modeling Prefill (left) and Decode (right) matrix multiplications in the MLP layer of the 70B Llama model. Observation 2 For prefill-heavy workloads, both TP and PP reduce computation time, with PP achieving lower overall latency due to its reduced communication overhead. For decode-heavy workloads, PP does not reduce matrix multiplication time, while TP suffers from significant c… view at source ↗
Figure 3
Figure 3. Figure 3: Performance breakdown of TP (using YALIS) and HP (using vLLM V0) for the prefill-heavy and decode-heavy work￾loads on Perlmutter for the 70B Llama model. For the prefill-heavy workload ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three-phase NVRAR design: (1) intra-node reduce-scatter, (2) inter-node recursive-doubling all-reduce, (3) intra-node all-gather. faster within a node, but that its latency increases sharply across nodes and scales poorly. For 512 KB-1 MB messages, NCCL is 1.5-2× slower than MPI, with latency growing faster with message size at any given scale. While Cray￾MPICH’s implementation is proprietary, the open-sou… view at source ↗
Figure 6
Figure 6. Figure 6: Scaling performance of NCCL and MPI all-reduce for a range of message sizes on Perlmutter. Observation 3 For small message sizes, typical in the decode phase, NCCL all-reduce exhibits poor scaling across nodes and can at times be slower than MPI. 4 OPTIMIZED MULTI-NODE ALL-REDUCE Having established the usefulness of TP for decode-heavy workloads, we now focus on optimizing its communication bottlenecks. On… view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison of NVRAR and NCCL all-reduce for 256 KB and 1024 KB input sizes, across varying GPU counts on Perlmutter (A100, Slingshot-11) (left) and Vista (GH200, InfiniBand) (right). at the same time (avoiding a barrier-like synchronization). This check is performed at the beginning of the all-reduce operation, allowing for a rank to finish it’s all-reduce and use the data immediately, without … view at source ↗
Figure 8
Figure 8. Figure 8: Heatmaps showing the speedup of NVRAR over NCCL all-reduce in the standalone microbenchmark on Perlmutter and Vista. 8 GPUs 16 GPUs 32 GPUs 1.0 1.2 1.4 1.6 1.8 2.0 Relative Speedup 0.99 1.14 1.30 1.18 1.62 1.86 #P=8 #P=32 #P=8 #P=32 #P=8 #P=32 Perlmutter (70B, Decode-heavy) 16 GPUs 32 GPUs 64 GPUs 1.0 1.2 1.4 1.6 1.8 2.0 Relative Speedup 1.17 1.43 1.58 1.62 1.72 #P=8 #P=32 #P=8 #P=32 #P=8 #P=32 Perlmutter … view at source ↗
Figure 9
Figure 9. Figure 9: Relative Speedup of YALIS (TP) using NVRAR all-reduce over Yalis (TP) using NCCL all-reduce for the decode-heavy workload on Perlmutter and Vista, across different models and NumPrompts (#P = 8 and #P = 32). axis, consistent with our theoretical model (Eq. 9), for both 256 KB and 1024 KB messages. NCCL (blue) exhibits sim￾ilar scaling for 1024 KB messages, as it consistently uses the Tree algorithm (LL pro… view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Strong scaling performance of different inference en￾gines on Perlmutter for the Llama 3.1 70B (top) and 405B (bottom) models, for the Decode-Heavy with NumPrompts = 32. The Y-axis shows the time to completion for a batch of prompts in seconds and the X-axis shows the number of GPUs [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9$\times$-3.6$\times$ lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72$\times$ reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a performance study of multi-node distributed LLM inference across several engines including the YALIS research prototype. It identifies all-reduce operations as key bottlenecks in model-parallel scaling and introduces NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. The paper reports that NVRAR achieves 1.9×–3.6× lower latency than NCCL for 128 KB–2 MB messages on HPE Slingshot and InfiniBand interconnects, and that integrating NVRAR into YALIS yields up to 1.72× reduction in end-to-end batch latency for Llama 3.1 405B in multi-node tensor-parallel decode-heavy workloads.

Significance. If the measured improvements hold under the reported conditions, the work supplies useful empirical data on communication scaling for large-model inference and a concrete algorithmic alternative to NCCL. The direct wall-clock comparisons against NCCL baselines and the use of real supercomputer hardware are strengths. The stress-test concern about generalization does not land as a load-bearing issue because the abstract and results explicitly scope the latency claims to the tested HPE Slingshot/InfiniBand fabrics and the YALIS prototype; no broader generalization is asserted.

major comments (1)
  1. The central empirical claims rest on latency and end-to-end numbers whose reliability cannot be fully assessed without details on run counts, statistical variance, and exact workload definitions (message patterns, batch sizes, and node counts). This information is required to verify that the reported 1.9×–3.6× and 1.72× factors are reproducible rather than single-run artifacts.
minor comments (2)
  1. Clarify the exact NVSHMEM primitives and recursive-doubling schedule used in NVRAR (e.g., which collective is replaced at which hierarchy level) so that the algorithm can be re-implemented outside YALIS.
  2. Add a short discussion of why the observed gains appear only in the 128 KB–2 MB range and whether this range aligns with typical all-reduce sizes in other tensor-parallel workloads.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of experimental reproducibility. We address the major comment below and will update the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: The central empirical claims rest on latency and end-to-end numbers whose reliability cannot be fully assessed without details on run counts, statistical variance, and exact workload definitions (message patterns, batch sizes, and node counts). This information is required to verify that the reported 1.9×–3.6× and 1.72× factors are reproducible rather than single-run artifacts.

    Authors: We agree that the current manuscript would benefit from additional methodological details to support reproducibility. In the revised version we will expand the experimental setup and evaluation sections to report: (i) the number of runs per measurement (typically 100 iterations after a 10-iteration warm-up, with the median and standard deviation shown), (ii) explicit workload parameters including batch sizes (1–32), sequence lengths, node counts (2–8 nodes with 8 GPUs each), and message patterns (all-reduce collectives arising in tensor-parallel decode phases for Llama 3.1 405B), and (iii) the precise hardware and software configurations used on the HPE Slingshot and InfiniBand testbeds. These additions will allow readers to assess the statistical reliability of the 1.9×–3.6× and 1.72× factors. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on direct wall-clock measurements

full rationale

The manuscript reports empirical latency and end-to-end batch timing results obtained by running the implemented NVRAR algorithm against NCCL baselines on HPE Slingshot and InfiniBand fabrics inside the YALIS prototype. No equations, fitted parameters, or predictions are derived from prior results within the paper; the speedups (1.9–3.6× for 128 KB–2 MB messages, 1.72× batch latency) are presented as direct experimental observations rather than outputs of any self-referential derivation or self-citation chain. The central performance claims therefore remain independent of the paper's own inputs and are externally falsifiable by replication on the stated hardware and workloads.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on standard assumptions about NVSHMEM availability and the dominance of all-reduce in tensor-parallel decode phases; no free parameters are fitted and no new physical entities are postulated.

axioms (2)
  • domain assumption All-reduce operations constitute the primary communication bottleneck in multi-node tensor-parallel LLM inference
    Identified through the performance study described in the abstract.
  • domain assumption NVSHMEM provides lower-overhead one-sided communication than standard collectives on supported interconnects
    Basis for choosing NVSHMEM in the NVRAR design.
invented entities (1)
  • NVRAR no independent evidence
    purpose: Hierarchical all-reduce algorithm for reduced latency in multi-node LLM inference
    Newly proposed algorithm combining recursive doubling with NVSHMEM hierarchy.

pith-pipeline@v0.9.0 · 5765 in / 1524 out tokens · 113614 ms · 2026-05-21T19:19:37.997507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 6 internal anchors

  1. [1]

    Alvarez, E

    URL https://arxiv.org/abs/2506.04667. Alvarez, E. Analyzing the impact of tensor paral- lelism configurations on LLM inference performance. March

  2. [2]

    ISBN 9781450304610

    Association for Computing Machinery. ISBN 9781450304610. doi: 10.1145/ 2020373.2020375. URL https://doi.org/10. 1145/2020373.2020375. Devraj, A., Ding, E., Vijaya Kumar, A., Kleinberg, R., and Singh, R. Accelerating AllReduce with a persistent strag- gler, May

  3. [3]

    Grattafiori, A., Dubey, A., Jauhri, A., et al

    URL https://arxiv.org/abs/ 2505.23523. Grattafiori, A., Dubey, A., Jauhri, A., et al. The llama 3 herd of models,

  4. [4]

    The Llama 3 Herd of Models

    URL https://arxiv.org/ abs/2407.21783. Gropp, W., Lusk, E. R., Thakur, R., Balaji, P., Gillis, T., Guo, Y ., Latham, R., Raffenetti, K., and Zhou, H. Mpich. [Com- puter Software] https://doi.org/10.11578/ dc.20200514.13, jun

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URL https://doi. org/10.11578/dc.20200514.13. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  6. [6]

    T., Hausd¨orfer, O., and Verma, A

    Hansen-Palmus, J., Le, M. T., Hausd¨orfer, O., and Verma, A. Communication compression for tensor parallel llm inference.arXiv preprint arXiv:2411.09510,

  7. [7]

    doi: 10.1016/ S0167-8191(06)80021-9

    ISSN 0167-8191. doi: 10.1016/ S0167-8191(06)80021-9. URL https://doi.org/ 10.1016/S0167-8191(06)80021-9. Hu, Z., Shen, S., Bonato, T., Jeaugey, S., Alexander, C., Spada, E., Dinan, J., Hammond, J., and Hoefler, T. De- mystifying nccl: An in-depth analysis of gpu communi- cation protocols and algorithms,

  8. [8]

    OpenAI o1 System Card

    URL https://www.iea.org/reports/ energy-and-ai. Licence: CC BY 4.0. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  9. [9]

    Fast Inference from Transformers via Speculative Decoding

    URL https://arxiv.org/abs/2211.17192. LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication Li, Q., Zhang, B., Ye, L., Zhang, Y ., Wu, W., Sun, Y ., Ma, L., and Xie, Y . Flash communication: Reducing tensor parallelization bottleneck for fast large language model inference, December 2024a. URL https://arxiv...

  10. [10]

    C., Shoham, Y ., Wald, R., Walsh, T., Hamrah, A., Santarlasci, L., Lotufo, J

    Maslej, N., Fattorini, L., Perrault, R., Gil, Y ., Parli, V ., Kariuki, N., Capstick, E., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y ., Wald, R., Walsh, T., Hamrah, A., Santarlasci, L., Lotufo, J. B., Rome, A., Shi, A., and Oak, S. Artificial intelligence index report 2025,

  11. [11]

    URLhttps://arxiv.org/abs/2504.07139. Meta. Torch compile. https://docs.pytorch.org/ tutorials/intermediate/torch_compile_ tutorial.html,

  12. [12]

    neurips.cc/paper/2019/file/ bdbca288fee7f92f2bfa9f7012727740-Paper

    URL https://proceedings. neurips.cc/paper/2019/file/ bdbca288fee7f92f2bfa9f7012727740-Paper. pdf. Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference,

  13. [13]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  14. [14]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    URL https://arxiv. org/abs/2408.03314. Spector, B., Juravsky, J., Sul, S., Lim, D., Dugan, O., Arora, S., and R ´e, C. We Bought the Whole GPU, So We’re Damn Well Going to Use the Whole GPU, sep

  15. [15]

    Hazy Re- search Blog

    URL https://hazyresearch.stanford.edu/ blog/2025-09-28-tp-llama-main . Hazy Re- search Blog. Su, Q., Zhao, W., Li, X., Andoorveedu, M., Jiang, C., Zhu, Z., Song, K., Giannoula, C., and Pekhimenko, G. Seesaw: High-throughput llm inference via model re-sharding. arXiv preprint arXiv:2503.06433,

  16. [16]

    ISBN 978-3-540-39924-7

    Springer Berlin Heidel- berg. ISBN 978-3-540-39924-7. University, O. S. Osu micro-benchmarks 5.8. http://mvapich.cse.ohio-state.edu/ benchmarks/. vLLM Team. Announcing Llama 3.1 support in vLLM. vLLM Blog. https://blog.vllm.ai/2024/07/ 23/llama31.html, July

  17. [17]

    Accessed: 2025-10-30

    URL https:// blog.vllm.ai/2024/07/23/llama31.html. Accessed: 2025-10-30. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

  18. [18]

    URL https://arxiv.org/abs/2409. 11155. Xu, L., Suresh, K. K., Anthony, Q., Alnaasan, N., and Panda, D. K. Characterizing communication patterns in distributed large language model inference.arXiv preprint arXiv:2507.14392,

  19. [19]

    L., Athiwaratkun, B., and Dao, T

    Zhang, M., Mishra, M., Zhou, Z., Brandon, W., Wang, J., Kim, Y ., Ragan-Kelley, J., Song, S. L., Athiwaratkun, B., and Dao, T. Ladder-residual: parallelism-aware architec- ture for accelerating large model inference with commu- nication overlapping.arXiv preprint arXiv:2501.06589, 2025a. Zhang, S., Zheng, N., Lin, H., Jiang, Z., Bao, W., Jiang, C., Hou, Q...

  20. [20]

    In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pp

    Zhu, K., Gao, Y ., Zhao, Y ., Zhao, L., Zuo, G., Gu, Y ., Xie, D., Ye, Z., Kamahori, K., Lin, C.-Y ., et al.{NanoFlow}: Towards optimal large language model serving through- put. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pp. 749–765, 2025a. Zhu, R., Jiang, Z., Jin, C., Wu, P., Stuardo, C. A., Wang, D., Zhang, X., Zh...