pith. sign in

arxiv: 2606.20582 · v1 · pith:57OVAVXVnew · submitted 2026-05-08 · 💻 cs.NI

Communication-Semantic-Aware RDMA Loss Recovery for QP-scalable Hyperscale AI Training

Pith reviewed 2026-06-30 23:33 UTC · model grok-4.3

classification 💻 cs.NI
keywords RDMAloss recoverycollective communicationAI trainingunreliable datagramtail latencyscalabilityhyperscale
0
0 comments X

The pith

CSA-UD recovers RDMA packet losses in AI training by exploiting synchronization semantics to enable scalable unreliable datagram use without per-pair queue pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CSA-UD to overcome the scalability limits of reliable RDMA connections when running large collective operations for trillion-parameter model training. It separates data transmission from loss recovery and tunes detection intervals using the global synchronization points already present in distributed training. This approach adds multipath transmission and bitmap-guided reassembly so that high throughput can be maintained without lossless network fabrics. Test results indicate improved scalability over standard RC and more than 30 percent lower 99th-percentile flow completion times under heavy load.

Core claim

CSA-UD decouples data transmission from loss recovery and dynamically adjusts the loss detection interval, accelerating tail recovery and exploiting the synchronization semantics of distributed training. It further supports multipath transmission and bitmap-guided reassembly, enabling high throughput without requiring lossless fabrics.

What carries the argument

Communication-Semantic-Aware Unreliable Datagram (CSA-UD) loss recovery that uses training synchronization points to set dynamic loss detection intervals.

If this is right

  • CSA-UD achieves better scalability than RC by avoiding RNIC queue-pair cache limits on the order of thousands of entries.
  • Under high network load it reduces 99th-percentile flow completion times by over 30 percent compared with existing RC and UD counterparts.
  • Collective operations such as All-Reduce and All-to-All can run at high throughput without lossless fabrics.
  • Multipath transmission combined with bitmap-guided reassembly keeps tail recovery fast even when individual paths experience loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic-aware interval adjustment could be applied to other periodically synchronized distributed workloads outside AI training.
  • Bitmap-guided reassembly may allow existing RDMA hardware to support higher message rates without additional lossless-network requirements.
  • Dynamic loss detection tuned to sync points might be tested on workloads that have regular barriers but lack the strict all-to-all pattern of training collectives.

Load-bearing premise

The synchronization semantics of distributed training can be safely exploited to dynamically adjust loss detection intervals without missing losses or adding unacceptable overhead in production hyperscale deployments.

What would settle it

A production-scale training run in which CSA-UD's dynamic detection intervals either miss losses that delay iterations or add overhead that negates the reported tail-latency gains.

Figures

Figures reproduced from arXiv: 2606.20582 by Fengyuan Ren, Jiakun Bao, Kun Zhu, Tong Zhang, Xiaoxiang Hua, Zhenjiang Dong.

Figure 1
Figure 1. Figure 1: QP growth in RC (left) vs. UD (right). memory and cached in the RNIC’s on-chip SRAM for fast access. In high-concurrency scenarios, the limited RNIC cache capacity can be quickly exhausted. When active QPs exceed the cache limit, cache misses force the RNIC to retrieve QP metadata via PCIe from host memory, significantly increasing access latency, which is a major source of latency amplifica￾tion, especial… view at source ↗
Figure 2
Figure 2. Figure 2: Scalability Issues of RNICs. C. RDMA Connection Scalability Issue RDMA connection scalability issues are particularly pro￾nounced in workloads with dense all-to-all communication patterns. A typical application scenario is Large Language Model (LLM) training, especially Mixture-of-Experts (MoE) training, where the communication process often involves more than 10 K QPs . Such a large number of QPs pose sig… view at source ↗
Figure 4
Figure 4. Figure 4: CSA-UD’s tail-aware NACK. The scan scheduler adjusts the bitmap [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: CSA-UD Framework. based loss detection. Rather than using a static scan interval, CSA-UD dynamically adjusts the scanning fre￾quency based on two runtime indicators: remaining data volume, and instantaneous receive rate. During the early transmission stages, the interval remains long to reduce control traffic and avoid spurious losses due to out-of￾order packets. In later stages, the scan interval shortens… view at source ↗
Figure 5
Figure 5. Figure 5: Receiver-side bitmap and buffer mapping logic. PSNs are mapped [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scan interval Ti adaptation under varying bandwidth conditions. the scan interval to throughput deviation from the ideal rate. It is derived from two operational parameters: the maximum expected degradation ratio κ = Rideal/Rmin and the maximum allowed modulation factor Fmax. Enforcing Fi ≤ Fmax at Ri = Rmin gives  Rideal Rmin γ = Fmax ⇒ γ = ln Fmax ln κ . (7) For example, in a 100 Gbps network with Rmin… view at source ↗
Figure 8
Figure 8. Figure 8: Leaf-spine topology used in simulations. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: RNIC under different QP numbers and message sizes. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average FCT of five RDMA transport mechanisms under (a) low [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: 99th-percentile FCT and retransmission ratio under varying load. [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
read the original abstract

Current artificial intelligence (AI) infrastructures widely adopt Remote Direct Memory Access (RDMA) to support high-performance communication. Training trillion-parameter models involves frequent collective communication operations, such as All-Reduce and All-to-All, which generate intensive RDMA traffic. Existing RDMA deployments predominantly use the reliable connection (RC) model, where each process pair requires a dedicated queue pair (QP). This leads to poor scalability: since the RDMA-capable network interface card (RNIC) can cache only a few thousand QPs, excess entries trigger PCIe round-trip penalties. Meanwhile, global synchronization makes training sensitive to tail latency, where a few packet losses can delay iteration completion. To address these challenges, we propose Communication-Semantic-Aware Unreliable Datagram (CSA-UD), a novel RDMA loss recovery mechanism that combines scalability and reliability. CSA-UD decouples data transmission from loss recovery and dynamically adjusts the loss detection interval, accelerating tail recovery and exploiting the synchronization semantics of distributed training. It further supports multipath transmission and bitmap-guided reassembly, enabling high throughput without requiring lossless fabrics. Testbed experiments and ns-3 simulations show that CSA-UD significantly reduces tail latency under large-scale collective communication. Under high network load, it achieves better scalability than RC and over 30% lower 99th percentile flow completion times compared with counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes CSA-UD, a novel RDMA loss recovery mechanism for hyperscale AI training that decouples data transmission from loss recovery, dynamically adjusts loss detection intervals by exploiting collective synchronization semantics (e.g., All-Reduce), supports multipath transmission and bitmap-guided reassembly, and claims to achieve QP scalability superior to RC while reducing tail latency. Testbed experiments and ns-3 simulations are reported to show better scalability than RC and over 30% lower 99th-percentile flow completion times under high network load.

Significance. If the performance claims and safety of the dynamic interval adjustment hold under realistic straggler and jitter conditions, the work could meaningfully improve scalability and tail-latency behavior in large-scale collective communication for trillion-parameter training without requiring lossless fabrics.

major comments (2)
  1. [Abstract / CSA-UD design] Abstract (and CSA-UD design paragraph): the central performance claims (>30% lower 99th-percentile FCT, better scalability than RC) rest on the ability to safely shorten loss-detection intervals via synchronization semantics, yet no quantitative bound, tolerance analysis, or worst-case overhead is supplied for iteration-time variation or stragglers; this directly affects whether the reliability guarantee and tail-latency benefit are preserved.
  2. [Abstract] Abstract: the reported experimental results supply no setup details, baselines, error bars, statistical tests, or data-exclusion rules, rendering the soundness of the 30% FCT reduction and scalability claims unverifiable from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting areas where the manuscript can be strengthened. We address each major comment below and will incorporate revisions to improve clarity and verifiability of the claims.

read point-by-point responses
  1. Referee: [Abstract / CSA-UD design] Abstract (and CSA-UD design paragraph): the central performance claims (>30% lower 99th-percentile FCT, better scalability than RC) rest on the ability to safely shorten loss-detection intervals via synchronization semantics, yet no quantitative bound, tolerance analysis, or worst-case overhead is supplied for iteration-time variation or stragglers; this directly affects whether the reliability guarantee and tail-latency benefit are preserved.

    Authors: We agree that explicit quantitative bounds and tolerance analysis are needed to fully substantiate the safety of dynamically shortening loss-detection intervals. The manuscript describes how collective synchronization (e.g., All-Reduce barriers) limits iteration-time variation, but does not supply formal worst-case overhead calculations or sensitivity analysis under stragglers/jitter. In revision we will add a new analysis subsection deriving bounds from testbed-measured straggler distributions, including the maximum safe shortening factor and overhead under high jitter, to confirm that reliability and tail-latency gains are preserved. revision: yes

  2. Referee: [Abstract] Abstract: the reported experimental results supply no setup details, baselines, error bars, statistical tests, or data-exclusion rules, rendering the soundness of the 30% FCT reduction and scalability claims unverifiable from the provided text.

    Authors: The abstract is intentionally concise and therefore omits detailed methodology. The full manuscript (Section 5) already contains the testbed configuration, ns-3 parameters, baselines (RC and prior UD schemes), error bars from repeated runs, and statistical tests. To improve verifiability directly from the abstract we will add a short clause summarizing key setup elements (node count, loss rate, baselines) and a pointer to the full evaluation section for error bars and statistical details. revision: partial

Circularity Check

0 steps flagged

No significant circularity in mechanism design or experimental validation

full rationale

The paper proposes CSA-UD as a new RDMA loss recovery design that decouples transmission from recovery and adjusts detection intervals using collective synchronization semantics. Performance claims rest on testbed experiments and ns-3 simulations rather than any derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps. No equations or self-referential reductions appear in the abstract or described approach; the central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Ledger populated from abstract only; no free parameters, additional axioms, or invented entities beyond the proposed mechanism are described.

axioms (2)
  • domain assumption Existing RDMA deployments predominantly use the reliable connection (RC) model where each process pair requires a dedicated queue pair (QP).
    Stated as current practice leading to scalability issues.
  • domain assumption Global synchronization in training makes it sensitive to tail latency from packet losses.
    Presented as a key challenge for collective operations.
invented entities (1)
  • CSA-UD no independent evidence
    purpose: Communication-semantic-aware unreliable datagram RDMA loss recovery mechanism
    New mechanism introduced to combine scalability and reliability.

pith-pipeline@v0.9.1-grok · 5789 in / 1458 out tokens · 32762 ms · 2026-06-30T23:33:12.466222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages

  1. [1]

    Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

    C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma, W. Liang, Y . He, Y . Wang, Y . Liu, and Y . X. Wei, “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,”arXiv preprint arXiv:2505.09343, 2025

  2. [2]

    Trends in ai supercomputers,

    K. F. Pilz, J. Sanders, R. Rahman, and L. Heim, “Trends in ai supercomputers,”arXiv preprint arXiv:2504.16026, 2025

  3. [3]

    Understanding communication characteristics of distributed training,

    W. Li, X. Liu, Y . Li, Y . Jin, H. Tian, Z. Zhong, G. Liu, Y . Zhang, and K. Chen, “Understanding communication characteristics of distributed training,” inProceedings of the 8th Asia-Pacific Workshop on Network- ing (APNet), Sydney, NSW, Australia, Aug. 2024, pp. 1–8

  4. [4]

    OHIO: Improving RDMA network scalability in MPI Alltoall through optimized hierarchical and intra/inter-node communication overlap design,

    T. Tran, G. K. R. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abdul- jabbar, and D. K. Panda, “OHIO: Improving RDMA network scalability in MPI Alltoall through optimized hierarchical and intra/inter-node communication overlap design,” inProceedings of the 2024 IEEE Symposium on High-Performance Interconnects (HOTI). Santa Clara, CA, USA: IEEE, 2024

  5. [5]

    Janus: A unified distributed training framework for sparse mixture-of-experts models,

    J. Liu, J. H. Wang, and Y . Jiang, “Janus: A unified distributed training framework for sparse mixture-of-experts models,” inProceedings of the ACM SIGCOMM Conference. ACM, 2023, pp. 486–498

  6. [6]

    Rdma over ethernet for distributed AI training at Meta scale,

    A. Gangidi, R. Miao, S. Zheng, S. J. Bondu, G. Goes, H. Morsy, R. Puri, M. Riftadi, A. J. Shetty, J. Yang, S. Zhang, M. J. Fernandez, S. Gandham, and H. Zeng, “Rdma over ethernet for distributed AI training at Meta scale,” inProceedings of the ACM SIGCOMM 2024 Conference. Sydney, NSW, Australia: ACM, 2024, pp. 57–70

  7. [7]

    Birds of a feather flock together: Scaling rdma rpcs with flock,

    S. K. Monga, S. Kashyap, and C. Min, “Birds of a feather flock together: Scaling rdma rpcs with flock,” inProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP). ACM, 2021, pp. 212–227

  8. [8]

    1rma: Re-envisioning remote memory access for multitenant datacenters,

    A. Singhvi, A. Akella, D. Gibson, T. F. Wenisch, M. Wong-Chan, S. Clark, M. M. K. Martin, M. McLaren, P. Chandra, R. Cauble, H. M. G. Wassel, B. Montazeri, S. L. Sabato, J. Scherpelz, and A. Vahdat, “1rma: Re-envisioning remote memory access for multitenant datacenters,” in SIGCOMM ’20: Proceedings of the 2020 Annual Conference of the ACM Special Interest...

  9. [9]

    An optimized RDMA QP communication mechanism for hyperscale AI infrastructure,

    J. Wang, B. Lin, J. Zhang, M. Sun, and Y . Pan, “An optimized RDMA QP communication mechanism for hyperscale AI infrastructure,”Cluster Computing, vol. 28, p. 66, Nov. 2025

  10. [10]

    Congestion control for large-scale rdma deployments,

    H. Eran, H. Patel, J. Bruno, and ..., “Congestion control for large-scale rdma deployments,” inProceedings of the 2015 ACM SIGCOMM Conference, 2015

  11. [11]

    Timely: Rtt-based congestion control for the datacenter,

    R. Mittal, V . T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y . Wang, D. Wetherall, and D. Zats, “Timely: Rtt-based congestion control for the datacenter,” inProceedings of the 2015 ACM SIGCOMM Conference, 2015

  12. [12]

    Annex 14: Extended reliable connections (xrc),

    InfiniBand Trade Association, “Annex 14: Extended reliable connections (xrc),” 2009, https://members.infinibandta.org/apps/org/workgroup/ibta/ documents.php?folder id=102

  13. [13]

    Dynamically connected transport: Scalable rdma transport for large clusters,

    A. Rosenbaum and A. Margolin, “Dynamically connected transport: Scalable rdma transport for large clusters,” inOpenFabrics Alliance Workshop, 2018

  14. [14]

    NVIDIA Corporation,Mellanox Adapters Programmer’s Reference Manual (PRM), NVIDIA, 2020, available: https://network.nvidia.com/ files/doc-2020/ethernet-adapters-programming-manual.pdf

  15. [15]

    Revisiting network support for RDMA,

    R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy, S. Rat- nasamy, and S. Shenker, “Revisiting network support for RDMA,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). Association for Computing Machinery, 2018, pp. 313–326

  16. [16]

    Srnic: A scalable architecture for rdma nics,

    Z. Wang, W. Bai, K. Chen, H. Zhang, and R. Miao, “Srnic: A scalable architecture for rdma nics,” inProc. USENIX NSDI, 2023

  17. [17]

    A survey of storage systems in the rdma era,

    S. Ma, T. Ma, K. Chenet al., “A survey of storage systems in the rdma era,”IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4395–4409, 2022

  18. [18]

    Roud: Scalable rdma over ud in lossy data center networks,

    Z. He, Y . Chen, and B. Hua, “Roud: Scalable rdma over ud in lossy data center networks,” inProc. IEEE/ACM CCGrid, 2023

  19. [19]

    Memory efficient loss recovery for hardware-based transport in datacenter,

    Y . Lu, G. Chen, Z. Ruan, W. Xiao, B. Li, J. Zhang, Y . Xiong, P. Cheng, and E. Chen, “Memory efficient loss recovery for hardware-based transport in datacenter,” inProceedings of the 1st Asia-Pacific Workshop on Networking (APNet), Hong Kong, China, Aug. 2017, pp. 22–28

  20. [20]

    Infiniband architecture specification, volume 1, release 1.8,

    InfiniBand Trade Association, “Infiniband architecture specification, volume 1, release 1.8,” https://www.infinibandta.org/ibta-specification/, 2024, accessed: 2025-07-31

  21. [21]

    Star: Breaking the scalability limit for rdma,

    X. Wang, G. Chen, X. Yin, Y . Cheng, and B. Hua, “Star: Breaking the scalability limit for rdma,” inProc. IEEE ICNP, 2021

  22. [22]

    Scalable reliable datagram (srd): Reliable high- throughput networking for hpc and ml workloads,

    AWS Networking, “Scalable reliable datagram (srd): Reliable high- throughput networking for hpc and ml workloads,” 2022, https://aws. amazon.com/blogs/networking-and-content-delivery/introducing-srd/

  23. [23]

    Mp-rdma: En- abling rdma with multi-path transport in datacenters,

    G. Chen, Y . Geng, M. Alizadeh, and H. Balakrishnan, “Mp-rdma: En- abling rdma with multi-path transport in datacenters,” inProc. USENIX NSDI, 2018

  24. [24]

    Load balancing in pfc-enabled datacenter networks,

    J. Hu, C. Zeng, Z. Chen, W. Bai, and K. Chen, “Load balancing in pfc-enabled datacenter networks,” inProc. ACM Asia-Pacific Workshop on Networking (APNet), 2022, pp. 21–28

  25. [25]

    Mptd: Optimizing multi-path transport with dynamic target delay in datacenters,

    M. Li, S. Wang, T. Huang, and Y . Liu, “Mptd: Optimizing multi-path transport with dynamic target delay in datacenters,”Cluster Computing, vol. 27, no. 8, pp. 11 455–11 469, 2024

  26. [26]

    Ud-assisted multi-path transport in rdma,

    M. Choi, S. Lee, and Y . Kim, “Ud-assisted multi-path transport in rdma,” inProc. IEEE International Conference on Information and Communication Technology Convergence (ICTC), 2022, pp. 127–129