Communication-Semantic-Aware RDMA Loss Recovery for QP-scalable Hyperscale AI Training

Fengyuan Ren; Jiakun Bao; Kun Zhu; Tong Zhang; Xiaoxiang Hua; Zhenjiang Dong

arxiv: 2606.20582 · v1 · pith:57OVAVXVnew · submitted 2026-05-08 · 💻 cs.NI

Communication-Semantic-Aware RDMA Loss Recovery for QP-scalable Hyperscale AI Training

Xiaoxiang Hua , Tong Zhang , Zhenjiang Dong , Kun Zhu , Jiakun Bao , Fengyuan Ren This is my paper

Pith reviewed 2026-06-30 23:33 UTC · model grok-4.3

classification 💻 cs.NI

keywords RDMAloss recoverycollective communicationAI trainingunreliable datagramtail latencyscalabilityhyperscale

0 comments

The pith

CSA-UD recovers RDMA packet losses in AI training by exploiting synchronization semantics to enable scalable unreliable datagram use without per-pair queue pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CSA-UD to overcome the scalability limits of reliable RDMA connections when running large collective operations for trillion-parameter model training. It separates data transmission from loss recovery and tunes detection intervals using the global synchronization points already present in distributed training. This approach adds multipath transmission and bitmap-guided reassembly so that high throughput can be maintained without lossless network fabrics. Test results indicate improved scalability over standard RC and more than 30 percent lower 99th-percentile flow completion times under heavy load.

Core claim

CSA-UD decouples data transmission from loss recovery and dynamically adjusts the loss detection interval, accelerating tail recovery and exploiting the synchronization semantics of distributed training. It further supports multipath transmission and bitmap-guided reassembly, enabling high throughput without requiring lossless fabrics.

What carries the argument

Communication-Semantic-Aware Unreliable Datagram (CSA-UD) loss recovery that uses training synchronization points to set dynamic loss detection intervals.

If this is right

CSA-UD achieves better scalability than RC by avoiding RNIC queue-pair cache limits on the order of thousands of entries.
Under high network load it reduces 99th-percentile flow completion times by over 30 percent compared with existing RC and UD counterparts.
Collective operations such as All-Reduce and All-to-All can run at high throughput without lossless fabrics.
Multipath transmission combined with bitmap-guided reassembly keeps tail recovery fast even when individual paths experience loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic-aware interval adjustment could be applied to other periodically synchronized distributed workloads outside AI training.
Bitmap-guided reassembly may allow existing RDMA hardware to support higher message rates without additional lossless-network requirements.
Dynamic loss detection tuned to sync points might be tested on workloads that have regular barriers but lack the strict all-to-all pattern of training collectives.

Load-bearing premise

The synchronization semantics of distributed training can be safely exploited to dynamically adjust loss detection intervals without missing losses or adding unacceptable overhead in production hyperscale deployments.

What would settle it

A production-scale training run in which CSA-UD's dynamic detection intervals either miss losses that delay iterations or add overhead that negates the reported tail-latency gains.

Figures

Figures reproduced from arXiv: 2606.20582 by Fengyuan Ren, Jiakun Bao, Kun Zhu, Tong Zhang, Xiaoxiang Hua, Zhenjiang Dong.

**Figure 1.** Figure 1: QP growth in RC (left) vs. UD (right). memory and cached in the RNIC’s on-chip SRAM for fast access. In high-concurrency scenarios, the limited RNIC cache capacity can be quickly exhausted. When active QPs exceed the cache limit, cache misses force the RNIC to retrieve QP metadata via PCIe from host memory, significantly increasing access latency, which is a major source of latency amplification, especial… view at source ↗

**Figure 2.** Figure 2: Scalability Issues of RNICs. C. RDMA Connection Scalability Issue RDMA connection scalability issues are particularly pronounced in workloads with dense all-to-all communication patterns. A typical application scenario is Large Language Model (LLM) training, especially Mixture-of-Experts (MoE) training, where the communication process often involves more than 10 K QPs . Such a large number of QPs pose sig… view at source ↗

**Figure 4.** Figure 4: CSA-UD’s tail-aware NACK. The scan scheduler adjusts the bitmap [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: CSA-UD Framework. based loss detection. Rather than using a static scan interval, CSA-UD dynamically adjusts the scanning frequency based on two runtime indicators: remaining data volume, and instantaneous receive rate. During the early transmission stages, the interval remains long to reduce control traffic and avoid spurious losses due to out-oforder packets. In later stages, the scan interval shortens… view at source ↗

**Figure 5.** Figure 5: Receiver-side bitmap and buffer mapping logic. PSNs are mapped [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Scan interval Ti adaptation under varying bandwidth conditions. the scan interval to throughput deviation from the ideal rate. It is derived from two operational parameters: the maximum expected degradation ratio κ = Rideal/Rmin and the maximum allowed modulation factor Fmax. Enforcing Fi ≤ Fmax at Ri = Rmin gives Rideal Rmin γ = Fmax ⇒ γ = ln Fmax ln κ . (7) For example, in a 100 Gbps network with Rmin… view at source ↗

**Figure 8.** Figure 8: Leaf-spine topology used in simulations. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: RNIC under different QP numbers and message sizes. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Average FCT of five RDMA transport mechanisms under (a) low [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 13.** Figure 13: 99th-percentile FCT and retransmission ratio under varying load. [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

read the original abstract

Current artificial intelligence (AI) infrastructures widely adopt Remote Direct Memory Access (RDMA) to support high-performance communication. Training trillion-parameter models involves frequent collective communication operations, such as All-Reduce and All-to-All, which generate intensive RDMA traffic. Existing RDMA deployments predominantly use the reliable connection (RC) model, where each process pair requires a dedicated queue pair (QP). This leads to poor scalability: since the RDMA-capable network interface card (RNIC) can cache only a few thousand QPs, excess entries trigger PCIe round-trip penalties. Meanwhile, global synchronization makes training sensitive to tail latency, where a few packet losses can delay iteration completion. To address these challenges, we propose Communication-Semantic-Aware Unreliable Datagram (CSA-UD), a novel RDMA loss recovery mechanism that combines scalability and reliability. CSA-UD decouples data transmission from loss recovery and dynamically adjusts the loss detection interval, accelerating tail recovery and exploiting the synchronization semantics of distributed training. It further supports multipath transmission and bitmap-guided reassembly, enabling high throughput without requiring lossless fabrics. Testbed experiments and ns-3 simulations show that CSA-UD significantly reduces tail latency under large-scale collective communication. Under high network load, it achieves better scalability than RC and over 30% lower 99th percentile flow completion times compared with counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CSA-UD uses collective sync semantics to shorten RDMA loss detection intervals and claims 30% tail FCT gains, but the abstract supplies no experiment details and the timing assumption looks fragile under jitter.

read the letter

The core pitch is a new RDMA loss recovery scheme called CSA-UD that runs over unreliable datagrams, decouples data from recovery, and shortens detection intervals by leaning on the global synchronization points in All-Reduce and All-to-All. It adds multipath and bitmap reassembly to keep throughput up without lossless fabrics, and the authors say this beats RC on QP scalability while cutting 99th-percentile flow completion time by more than 30% under load.

The practical target is real: RNIC QP caches are small, so thousands of processes force PCIe round-trips, and tail latency from even a few losses stalls entire training iterations. Using the training semantics to adjust timeouts is a reasonable direction and the multipath plus bitmap pieces are concrete engineering choices that could help.

The main weakness is that the performance claims rest on an untested assumption about synchronization regularity. If iteration times vary because of stragglers or load imbalance, the dynamic interval either misses losses or reverts to conservative timeouts, erasing the tail benefit. The abstract gives no bound on tolerable jitter and no description of the testbed, baselines, number of runs, or error bars, so the 30% figure cannot be evaluated. The stress-test note on variable sync points therefore lands; nothing in the provided text shows it has been addressed.

This is aimed at researchers and engineers who build or tune large AI clusters and need alternatives to RC when QP limits bite. A reader already working on RDMA for collectives could extract the mechanism ideas, but the lack of experimental grounding means it is not yet ready to cite or deploy.

I would send it to peer review if the full paper contains reproducible experiments and some analysis or measurement of how much timing variation the scheme tolerates. Without that it stays a proposal rather than a demonstrated result.

Referee Report

2 major / 0 minor

Summary. The paper proposes CSA-UD, a novel RDMA loss recovery mechanism for hyperscale AI training that decouples data transmission from loss recovery, dynamically adjusts loss detection intervals by exploiting collective synchronization semantics (e.g., All-Reduce), supports multipath transmission and bitmap-guided reassembly, and claims to achieve QP scalability superior to RC while reducing tail latency. Testbed experiments and ns-3 simulations are reported to show better scalability than RC and over 30% lower 99th-percentile flow completion times under high network load.

Significance. If the performance claims and safety of the dynamic interval adjustment hold under realistic straggler and jitter conditions, the work could meaningfully improve scalability and tail-latency behavior in large-scale collective communication for trillion-parameter training without requiring lossless fabrics.

major comments (2)

[Abstract / CSA-UD design] Abstract (and CSA-UD design paragraph): the central performance claims (>30% lower 99th-percentile FCT, better scalability than RC) rest on the ability to safely shorten loss-detection intervals via synchronization semantics, yet no quantitative bound, tolerance analysis, or worst-case overhead is supplied for iteration-time variation or stragglers; this directly affects whether the reliability guarantee and tail-latency benefit are preserved.
[Abstract] Abstract: the reported experimental results supply no setup details, baselines, error bars, statistical tests, or data-exclusion rules, rendering the soundness of the 30% FCT reduction and scalability claims unverifiable from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting areas where the manuscript can be strengthened. We address each major comment below and will incorporate revisions to improve clarity and verifiability of the claims.

read point-by-point responses

Referee: [Abstract / CSA-UD design] Abstract (and CSA-UD design paragraph): the central performance claims (>30% lower 99th-percentile FCT, better scalability than RC) rest on the ability to safely shorten loss-detection intervals via synchronization semantics, yet no quantitative bound, tolerance analysis, or worst-case overhead is supplied for iteration-time variation or stragglers; this directly affects whether the reliability guarantee and tail-latency benefit are preserved.

Authors: We agree that explicit quantitative bounds and tolerance analysis are needed to fully substantiate the safety of dynamically shortening loss-detection intervals. The manuscript describes how collective synchronization (e.g., All-Reduce barriers) limits iteration-time variation, but does not supply formal worst-case overhead calculations or sensitivity analysis under stragglers/jitter. In revision we will add a new analysis subsection deriving bounds from testbed-measured straggler distributions, including the maximum safe shortening factor and overhead under high jitter, to confirm that reliability and tail-latency gains are preserved. revision: yes
Referee: [Abstract] Abstract: the reported experimental results supply no setup details, baselines, error bars, statistical tests, or data-exclusion rules, rendering the soundness of the 30% FCT reduction and scalability claims unverifiable from the provided text.

Authors: The abstract is intentionally concise and therefore omits detailed methodology. The full manuscript (Section 5) already contains the testbed configuration, ns-3 parameters, baselines (RC and prior UD schemes), error bars from repeated runs, and statistical tests. To improve verifiability directly from the abstract we will add a short clause summarizing key setup elements (node count, loss rate, baselines) and a pointer to the full evaluation section for error bars and statistical details. revision: partial

Circularity Check

0 steps flagged

No significant circularity in mechanism design or experimental validation

full rationale

The paper proposes CSA-UD as a new RDMA loss recovery design that decouples transmission from recovery and adjusts detection intervals using collective synchronization semantics. Performance claims rest on testbed experiments and ns-3 simulations rather than any derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps. No equations or self-referential reductions appear in the abstract or described approach; the central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Ledger populated from abstract only; no free parameters, additional axioms, or invented entities beyond the proposed mechanism are described.

axioms (2)

domain assumption Existing RDMA deployments predominantly use the reliable connection (RC) model where each process pair requires a dedicated queue pair (QP).
Stated as current practice leading to scalability issues.
domain assumption Global synchronization in training makes it sensitive to tail latency from packet losses.
Presented as a key challenge for collective operations.

invented entities (1)

CSA-UD no independent evidence
purpose: Communication-semantic-aware unreliable datagram RDMA loss recovery mechanism
New mechanism introduced to combine scalability and reliability.

pith-pipeline@v0.9.1-grok · 5789 in / 1458 out tokens · 32762 ms · 2026-06-30T23:33:12.466222+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages

[1]

Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma, W. Liang, Y . He, Y . Wang, Y . Liu, and Y . X. Wei, “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,”arXiv preprint arXiv:2505.09343, 2025

work page arXiv 2025
[2]

Trends in ai supercomputers,

K. F. Pilz, J. Sanders, R. Rahman, and L. Heim, “Trends in ai supercomputers,”arXiv preprint arXiv:2504.16026, 2025

work page arXiv 2025
[3]

Understanding communication characteristics of distributed training,

W. Li, X. Liu, Y . Li, Y . Jin, H. Tian, Z. Zhong, G. Liu, Y . Zhang, and K. Chen, “Understanding communication characteristics of distributed training,” inProceedings of the 8th Asia-Pacific Workshop on Network- ing (APNet), Sydney, NSW, Australia, Aug. 2024, pp. 1–8

2024
[4]

OHIO: Improving RDMA network scalability in MPI Alltoall through optimized hierarchical and intra/inter-node communication overlap design,

T. Tran, G. K. R. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abdul- jabbar, and D. K. Panda, “OHIO: Improving RDMA network scalability in MPI Alltoall through optimized hierarchical and intra/inter-node communication overlap design,” inProceedings of the 2024 IEEE Symposium on High-Performance Interconnects (HOTI). Santa Clara, CA, USA: IEEE, 2024

2024
[5]

Janus: A unified distributed training framework for sparse mixture-of-experts models,

J. Liu, J. H. Wang, and Y . Jiang, “Janus: A unified distributed training framework for sparse mixture-of-experts models,” inProceedings of the ACM SIGCOMM Conference. ACM, 2023, pp. 486–498

2023
[6]

Rdma over ethernet for distributed AI training at Meta scale,

A. Gangidi, R. Miao, S. Zheng, S. J. Bondu, G. Goes, H. Morsy, R. Puri, M. Riftadi, A. J. Shetty, J. Yang, S. Zhang, M. J. Fernandez, S. Gandham, and H. Zeng, “Rdma over ethernet for distributed AI training at Meta scale,” inProceedings of the ACM SIGCOMM 2024 Conference. Sydney, NSW, Australia: ACM, 2024, pp. 57–70

2024
[7]

Birds of a feather flock together: Scaling rdma rpcs with flock,

S. K. Monga, S. Kashyap, and C. Min, “Birds of a feather flock together: Scaling rdma rpcs with flock,” inProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP). ACM, 2021, pp. 212–227

2021
[8]

1rma: Re-envisioning remote memory access for multitenant datacenters,

A. Singhvi, A. Akella, D. Gibson, T. F. Wenisch, M. Wong-Chan, S. Clark, M. M. K. Martin, M. McLaren, P. Chandra, R. Cauble, H. M. G. Wassel, B. Montazeri, S. L. Sabato, J. Scherpelz, and A. Vahdat, “1rma: Re-envisioning remote memory access for multitenant datacenters,” in SIGCOMM ’20: Proceedings of the 2020 Annual Conference of the ACM Special Interest...

2020
[9]

An optimized RDMA QP communication mechanism for hyperscale AI infrastructure,

J. Wang, B. Lin, J. Zhang, M. Sun, and Y . Pan, “An optimized RDMA QP communication mechanism for hyperscale AI infrastructure,”Cluster Computing, vol. 28, p. 66, Nov. 2025

2025
[10]

Congestion control for large-scale rdma deployments,

H. Eran, H. Patel, J. Bruno, and ..., “Congestion control for large-scale rdma deployments,” inProceedings of the 2015 ACM SIGCOMM Conference, 2015

2015
[11]

Timely: Rtt-based congestion control for the datacenter,

R. Mittal, V . T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y . Wang, D. Wetherall, and D. Zats, “Timely: Rtt-based congestion control for the datacenter,” inProceedings of the 2015 ACM SIGCOMM Conference, 2015

2015
[12]

Annex 14: Extended reliable connections (xrc),

InfiniBand Trade Association, “Annex 14: Extended reliable connections (xrc),” 2009, https://members.infinibandta.org/apps/org/workgroup/ibta/ documents.php?folder id=102

2009
[13]

Dynamically connected transport: Scalable rdma transport for large clusters,

A. Rosenbaum and A. Margolin, “Dynamically connected transport: Scalable rdma transport for large clusters,” inOpenFabrics Alliance Workshop, 2018

2018
[14]

NVIDIA Corporation,Mellanox Adapters Programmer’s Reference Manual (PRM), NVIDIA, 2020, available: https://network.nvidia.com/ files/doc-2020/ethernet-adapters-programming-manual.pdf

2020
[15]

Revisiting network support for RDMA,

R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy, S. Rat- nasamy, and S. Shenker, “Revisiting network support for RDMA,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). Association for Computing Machinery, 2018, pp. 313–326

2018
[16]

Srnic: A scalable architecture for rdma nics,

Z. Wang, W. Bai, K. Chen, H. Zhang, and R. Miao, “Srnic: A scalable architecture for rdma nics,” inProc. USENIX NSDI, 2023

2023
[17]

A survey of storage systems in the rdma era,

S. Ma, T. Ma, K. Chenet al., “A survey of storage systems in the rdma era,”IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4395–4409, 2022

2022
[18]

Roud: Scalable rdma over ud in lossy data center networks,

Z. He, Y . Chen, and B. Hua, “Roud: Scalable rdma over ud in lossy data center networks,” inProc. IEEE/ACM CCGrid, 2023

2023
[19]

Memory efficient loss recovery for hardware-based transport in datacenter,

Y . Lu, G. Chen, Z. Ruan, W. Xiao, B. Li, J. Zhang, Y . Xiong, P. Cheng, and E. Chen, “Memory efficient loss recovery for hardware-based transport in datacenter,” inProceedings of the 1st Asia-Pacific Workshop on Networking (APNet), Hong Kong, China, Aug. 2017, pp. 22–28

2017
[20]

Infiniband architecture specification, volume 1, release 1.8,

InfiniBand Trade Association, “Infiniband architecture specification, volume 1, release 1.8,” https://www.infinibandta.org/ibta-specification/, 2024, accessed: 2025-07-31

2024
[21]

Star: Breaking the scalability limit for rdma,

X. Wang, G. Chen, X. Yin, Y . Cheng, and B. Hua, “Star: Breaking the scalability limit for rdma,” inProc. IEEE ICNP, 2021

2021
[22]

Scalable reliable datagram (srd): Reliable high- throughput networking for hpc and ml workloads,

AWS Networking, “Scalable reliable datagram (srd): Reliable high- throughput networking for hpc and ml workloads,” 2022, https://aws. amazon.com/blogs/networking-and-content-delivery/introducing-srd/

2022
[23]

Mp-rdma: En- abling rdma with multi-path transport in datacenters,

G. Chen, Y . Geng, M. Alizadeh, and H. Balakrishnan, “Mp-rdma: En- abling rdma with multi-path transport in datacenters,” inProc. USENIX NSDI, 2018

2018
[24]

Load balancing in pfc-enabled datacenter networks,

J. Hu, C. Zeng, Z. Chen, W. Bai, and K. Chen, “Load balancing in pfc-enabled datacenter networks,” inProc. ACM Asia-Pacific Workshop on Networking (APNet), 2022, pp. 21–28

2022
[25]

Mptd: Optimizing multi-path transport with dynamic target delay in datacenters,

M. Li, S. Wang, T. Huang, and Y . Liu, “Mptd: Optimizing multi-path transport with dynamic target delay in datacenters,”Cluster Computing, vol. 27, no. 8, pp. 11 455–11 469, 2024

2024
[26]

Ud-assisted multi-path transport in rdma,

M. Choi, S. Lee, and Y . Kim, “Ud-assisted multi-path transport in rdma,” inProc. IEEE International Conference on Information and Communication Technology Convergence (ICTC), 2022, pp. 127–129

2022

[1] [1]

Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma, W. Liang, Y . He, Y . Wang, Y . Liu, and Y . X. Wei, “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,”arXiv preprint arXiv:2505.09343, 2025

work page arXiv 2025

[2] [2]

Trends in ai supercomputers,

K. F. Pilz, J. Sanders, R. Rahman, and L. Heim, “Trends in ai supercomputers,”arXiv preprint arXiv:2504.16026, 2025

work page arXiv 2025

[3] [3]

Understanding communication characteristics of distributed training,

W. Li, X. Liu, Y . Li, Y . Jin, H. Tian, Z. Zhong, G. Liu, Y . Zhang, and K. Chen, “Understanding communication characteristics of distributed training,” inProceedings of the 8th Asia-Pacific Workshop on Network- ing (APNet), Sydney, NSW, Australia, Aug. 2024, pp. 1–8

2024

[4] [4]

OHIO: Improving RDMA network scalability in MPI Alltoall through optimized hierarchical and intra/inter-node communication overlap design,

T. Tran, G. K. R. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abdul- jabbar, and D. K. Panda, “OHIO: Improving RDMA network scalability in MPI Alltoall through optimized hierarchical and intra/inter-node communication overlap design,” inProceedings of the 2024 IEEE Symposium on High-Performance Interconnects (HOTI). Santa Clara, CA, USA: IEEE, 2024

2024

[5] [5]

Janus: A unified distributed training framework for sparse mixture-of-experts models,

J. Liu, J. H. Wang, and Y . Jiang, “Janus: A unified distributed training framework for sparse mixture-of-experts models,” inProceedings of the ACM SIGCOMM Conference. ACM, 2023, pp. 486–498

2023

[6] [6]

Rdma over ethernet for distributed AI training at Meta scale,

A. Gangidi, R. Miao, S. Zheng, S. J. Bondu, G. Goes, H. Morsy, R. Puri, M. Riftadi, A. J. Shetty, J. Yang, S. Zhang, M. J. Fernandez, S. Gandham, and H. Zeng, “Rdma over ethernet for distributed AI training at Meta scale,” inProceedings of the ACM SIGCOMM 2024 Conference. Sydney, NSW, Australia: ACM, 2024, pp. 57–70

2024

[7] [7]

Birds of a feather flock together: Scaling rdma rpcs with flock,

S. K. Monga, S. Kashyap, and C. Min, “Birds of a feather flock together: Scaling rdma rpcs with flock,” inProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP). ACM, 2021, pp. 212–227

2021

[8] [8]

1rma: Re-envisioning remote memory access for multitenant datacenters,

A. Singhvi, A. Akella, D. Gibson, T. F. Wenisch, M. Wong-Chan, S. Clark, M. M. K. Martin, M. McLaren, P. Chandra, R. Cauble, H. M. G. Wassel, B. Montazeri, S. L. Sabato, J. Scherpelz, and A. Vahdat, “1rma: Re-envisioning remote memory access for multitenant datacenters,” in SIGCOMM ’20: Proceedings of the 2020 Annual Conference of the ACM Special Interest...

2020

[9] [9]

An optimized RDMA QP communication mechanism for hyperscale AI infrastructure,

J. Wang, B. Lin, J. Zhang, M. Sun, and Y . Pan, “An optimized RDMA QP communication mechanism for hyperscale AI infrastructure,”Cluster Computing, vol. 28, p. 66, Nov. 2025

2025

[10] [10]

Congestion control for large-scale rdma deployments,

H. Eran, H. Patel, J. Bruno, and ..., “Congestion control for large-scale rdma deployments,” inProceedings of the 2015 ACM SIGCOMM Conference, 2015

2015

[11] [11]

Timely: Rtt-based congestion control for the datacenter,

R. Mittal, V . T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y . Wang, D. Wetherall, and D. Zats, “Timely: Rtt-based congestion control for the datacenter,” inProceedings of the 2015 ACM SIGCOMM Conference, 2015

2015

[12] [12]

Annex 14: Extended reliable connections (xrc),

InfiniBand Trade Association, “Annex 14: Extended reliable connections (xrc),” 2009, https://members.infinibandta.org/apps/org/workgroup/ibta/ documents.php?folder id=102

2009

[13] [13]

Dynamically connected transport: Scalable rdma transport for large clusters,

A. Rosenbaum and A. Margolin, “Dynamically connected transport: Scalable rdma transport for large clusters,” inOpenFabrics Alliance Workshop, 2018

2018

[14] [14]

NVIDIA Corporation,Mellanox Adapters Programmer’s Reference Manual (PRM), NVIDIA, 2020, available: https://network.nvidia.com/ files/doc-2020/ethernet-adapters-programming-manual.pdf

2020

[15] [15]

Revisiting network support for RDMA,

R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy, S. Rat- nasamy, and S. Shenker, “Revisiting network support for RDMA,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). Association for Computing Machinery, 2018, pp. 313–326

2018

[16] [16]

Srnic: A scalable architecture for rdma nics,

Z. Wang, W. Bai, K. Chen, H. Zhang, and R. Miao, “Srnic: A scalable architecture for rdma nics,” inProc. USENIX NSDI, 2023

2023

[17] [17]

A survey of storage systems in the rdma era,

S. Ma, T. Ma, K. Chenet al., “A survey of storage systems in the rdma era,”IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4395–4409, 2022

2022

[18] [18]

Roud: Scalable rdma over ud in lossy data center networks,

Z. He, Y . Chen, and B. Hua, “Roud: Scalable rdma over ud in lossy data center networks,” inProc. IEEE/ACM CCGrid, 2023

2023

[19] [19]

Memory efficient loss recovery for hardware-based transport in datacenter,

Y . Lu, G. Chen, Z. Ruan, W. Xiao, B. Li, J. Zhang, Y . Xiong, P. Cheng, and E. Chen, “Memory efficient loss recovery for hardware-based transport in datacenter,” inProceedings of the 1st Asia-Pacific Workshop on Networking (APNet), Hong Kong, China, Aug. 2017, pp. 22–28

2017

[20] [20]

Infiniband architecture specification, volume 1, release 1.8,

InfiniBand Trade Association, “Infiniband architecture specification, volume 1, release 1.8,” https://www.infinibandta.org/ibta-specification/, 2024, accessed: 2025-07-31

2024

[21] [21]

Star: Breaking the scalability limit for rdma,

X. Wang, G. Chen, X. Yin, Y . Cheng, and B. Hua, “Star: Breaking the scalability limit for rdma,” inProc. IEEE ICNP, 2021

2021

[22] [22]

Scalable reliable datagram (srd): Reliable high- throughput networking for hpc and ml workloads,

AWS Networking, “Scalable reliable datagram (srd): Reliable high- throughput networking for hpc and ml workloads,” 2022, https://aws. amazon.com/blogs/networking-and-content-delivery/introducing-srd/

2022

[23] [23]

Mp-rdma: En- abling rdma with multi-path transport in datacenters,

G. Chen, Y . Geng, M. Alizadeh, and H. Balakrishnan, “Mp-rdma: En- abling rdma with multi-path transport in datacenters,” inProc. USENIX NSDI, 2018

2018

[24] [24]

Load balancing in pfc-enabled datacenter networks,

J. Hu, C. Zeng, Z. Chen, W. Bai, and K. Chen, “Load balancing in pfc-enabled datacenter networks,” inProc. ACM Asia-Pacific Workshop on Networking (APNet), 2022, pp. 21–28

2022

[25] [25]

Mptd: Optimizing multi-path transport with dynamic target delay in datacenters,

M. Li, S. Wang, T. Huang, and Y . Liu, “Mptd: Optimizing multi-path transport with dynamic target delay in datacenters,”Cluster Computing, vol. 27, no. 8, pp. 11 455–11 469, 2024

2024

[26] [26]

Ud-assisted multi-path transport in rdma,

M. Choi, S. Lee, and Y . Kim, “Ud-assisted multi-path transport in rdma,” inProc. IEEE International Conference on Information and Communication Technology Convergence (ICTC), 2022, pp. 127–129

2022