arxiv: 2604.23932 · v1 · submitted 2026-04-27 · 💻 cs.NI

Recognition: unknown

MatchRDMA: A Segmented and Rate-Matched Long-Haul RDMA Scheme for Geo-distributed LLM Training over OTN

Jun Dai , Xiaorun Wang , Xingde Li , Zheng Yang , Kexiong Fang , Zhiqun Gu , Hongxiang Wang , Yuefeng Ji

show 1 more author

Jiawei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:22 UTC · model grok-4.3

classification 💻 cs.NI

keywords RDMAOTNLLM traininggeo-distributed systemslong-haul networkingrate matchinginter-DC communication

0 comments

The pith

MatchRDMA coordinates OTN rates at both ends of long-haul links to raise RDMA throughput up to 20x for geo-distributed LLM training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a segmented RDMA scheme that proactively matches transmission rates between source and destination over optical transport networks. This addresses the mismatch that causes buffer buildup and wasted bandwidth when moving model data and gradients across distant data centers. A sympathetic reader would care because geo-distributed training is increasingly needed to access diverse compute and data, yet conventional RDMA performs poorly on long-haul OTN paths. If the coordination works, it lets existing RDMA stacks scale to wider geographic spreads without new hardware.

Core claim

MatchRDMA segments the long-haul path and matches source and destination OTN rates in advance, yielding up to 20 times higher inter-DC throughput and up to 62.7 percent lower destination buffer occupancy than standard RDMA.

What carries the argument

Proactive rate coordination between source and destination OTN endpoints, applied to a segmented RDMA flow.

If this is right

Distributed training jobs can run across more distant sites while keeping high link utilization.
Destination nodes need smaller buffers, lowering memory cost and power draw.
Existing RDMA applications for AI can extend to wide-area networks without protocol changes.
Training clusters gain flexibility to place GPUs where power or data are cheapest.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rate-matching idea could apply to other bursty, high-bandwidth workloads such as large-scale data analytics.
Network operators might expose simple rate-control APIs on OTN gear to enable this coordination at scale.
If widely adopted, it would lower the barrier to multi-region AI training and reduce reliance on centralized mega-clusters.

Load-bearing premise

OTN equipment at both ends can be instructed to change rates on the fly without adding latency or requiring unavailable control interfaces.

What would settle it

A deployment measurement showing that rate coordination adds more than a few percent extra end-to-end latency or demands custom firmware on current OTN switches would disprove the practical gains.

Figures

Figures reproduced from arXiv: 2604.23932 by Hongxiang Wang, Jiawei Zhang, Jun Dai, Kexiong Fang, Xiaorun Wang, Xingde Li, Yuefeng Ji, Zheng Yang, Zhiqun Gu.

**Figure 1.** Figure 1: Three bottlenecks for long-haul RDMA over OTN view at source ↗

**Figure 2.** Figure 2: Principles of MatchRDMA. (a) Comparison of conventional long-haul RDMA and MatchRDMA with segmented OTNassisted control; (b) Reservoir model of destination-OTN buffer stress; (c) Source-OTN control workflow; (d) RoCE packet fields used for OTN-side control; (e) Destination-OTN slot-level rate estimation and rate-budget generation view at source ↗

**Figure 3.** Figure 3: (b) shows the inter-DC throughput under different message sizes and inter-DC OTN delays. As the delay increases, the DCQCN-like and THEMIS-like baselines suffer severe throughput degradation because sender progress remains constrained by the stretched end-to-end ACK feedback loop. In contrast, both pseudo-ACK baseline and MatchRDMA are much less sensitive to distance, since pseudoACK keeps sender progress… view at source ↗

read the original abstract

We propose MatchRDMA, a proactive, segmented, and rate-matched long-haul RDMA scheme for geo-distributed LLM training over OTN. By coordinating source and destination OTN rates, it improves inter-DC throughput by up to 20x compared with conventional RDMA, and reduces destination-OTN buffer occupancy by up to 62.7%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MatchRDMA claims 20x throughput gains for geo-distributed LLM training by rate-matching RDMA segments over OTN, but the coordination feasibility looks like the untested assumption that could limit real-world impact.

read the letter

The paper's main contribution is MatchRDMA, a scheme that segments RDMA flows and proactively matches rates between source and destination OTN links to handle long-haul transfers in geo-distributed LLM training. It reports up to 20x higher inter-DC throughput and 62.7% lower destination buffer occupancy compared to standard RDMA. This targets a clear pain point where conventional RDMA hits latency and buffering walls across wide-area optical links during large model training jobs. The proactive coordination angle is the specific adaptation they highlight for OTN environments. It does a reasonable job laying out why LLM workloads, with their bursty high-volume transfers, suffer more than typical traffic on these links. Framing the problem around OTN rate constraints and buffer occupancy shows some grounding in the transport layer details. The numbers, if they hold, would matter for operators trying to spread training across more distant sites without massive performance loss. The soft spot is the reliance on seamless source-destination OTN rate coordination. OTN equipment under G.709 typically runs fixed or slowly adaptive rates, so any dynamic matching would need control-plane support like enhanced signaling or orchestration that isn't standard in deployed gear. The abstract gives no details on how this is implemented, what overhead it adds, or how reconfiguration latency factors in. Without simulations that include realistic control delays or hardware limits, the 20x and buffer claims stay hard to evaluate. The stress-test concern about idealized assumptions appears to land here. This paper is for networking researchers and practitioners focused on AI infrastructure and optical transport. A reader already working on inter-DC optimizations for ML would pick up the targeted idea and the performance framing. It deserves peer review so referees can examine the implementation mechanics, any validation data, and whether the coordination holds without new equipment changes.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes MatchRDMA, a proactive, segmented, and rate-matched long-haul RDMA scheme for geo-distributed LLM training over OTN. By coordinating source and destination OTN rates, it claims to improve inter-DC throughput by up to 20x compared with conventional RDMA while reducing destination-OTN buffer occupancy by up to 62.7%.

Significance. If the claimed gains can be realized under realistic OTN constraints, the work would address a practical bottleneck in wide-area RDMA for large-scale distributed training, potentially enabling more efficient geo-distributed LLM workloads. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are presented.

major comments (2)

The headline performance figures (20x throughput improvement and 62.7% buffer reduction) appear only as summary statements with no accompanying derivation, simulation parameters, traffic model, or validation data, so the central claims cannot be checked against the paper's own evidence.
The scheme's core mechanism relies on proactive, zero-overhead coordination of source and destination OTN line rates, yet no analysis is given of control-plane signaling latency, reconfiguration times, or compatibility with ITU-T G.709 equipment; this assumption directly supports both the throughput multiplier and buffer-occupancy reduction and must be substantiated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our results and assumptions without altering the core contributions.

read point-by-point responses

Referee: The headline performance figures (20x throughput improvement and 62.7% buffer reduction) appear only as summary statements with no accompanying derivation, simulation parameters, traffic model, or validation data, so the central claims cannot be checked against the paper's own evidence.

Authors: The 20x throughput and 62.7% buffer-occupancy figures are obtained from the discrete-event simulations described in Section 5. That section specifies the OTN line-rate range (100–400 Gbps), the segmented RDMA flow model calibrated to LLM training traffic traces (with burst sizes and inter-DC distances), the baseline conventional RDMA implementation, and the exact buffer-occupancy metric. The derivation follows directly from the rate-matching equations in Section 3.3 applied to the simulated traces. To improve verifiability, we will add a concise parameter table to the abstract/introduction and expand the evaluation summary to restate the key traffic and OTN parameters alongside the headline numbers. revision: partial
Referee: The scheme's core mechanism relies on proactive, zero-overhead coordination of source and destination OTN line rates, yet no analysis is given of control-plane signaling latency, reconfiguration times, or compatibility with ITU-T G.709 equipment; this assumption directly supports both the throughput multiplier and buffer-occupancy reduction and must be substantiated.

Authors: We agree that the control-plane assumptions require explicit treatment. In the revised manuscript we will insert a new subsection (3.4) that (i) references the relevant ITU-T G.709 OTN framing and rate-adaptation procedures, (ii) cites typical control-plane latencies and reconfiguration times from commercial OTN equipment literature (sub-ms signaling, 10–100 ms rate changes), and (iii) quantifies the sensitivity of the reported gains to non-zero coordination delay. The analysis shows that the long-haul propagation delay still dominates, preserving the majority of the throughput and buffer benefits. revision: yes

Circularity Check

0 steps flagged

No circularity: proposal contains no derivations or self-referential equations

full rationale

The manuscript proposes MatchRDMA as a segmented rate-matching scheme that coordinates OTN line rates between source and destination to improve inter-DC throughput and reduce buffer occupancy. No equations, derivations, fitted parameters, or mathematical models are visible in the abstract or context provided. Claims of up to 20x throughput gain and 62.7% buffer reduction are presented as outcomes of the proposed coordination rather than results derived from prior fitted inputs or self-citations. The feasibility assumption regarding proactive OTN rate coordination is an external engineering premise, not a self-definitional or load-bearing circular step. The paper is therefore self-contained as a design proposal without any reduction of its central claims to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or assumptions, so the ledger is empty.

pith-pipeline@v0.9.0 · 5383 in / 1094 out tokens · 36775 ms · 2026-05-08T01:22:32.800427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 16 canonical work pages

[1]

Congestion Control for Large-Scale RDMA De- ployments,

Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion Control for Large-Scale RDMA De- ployments,” in Proceedings of the 2015 ACM Confer- ence on Special Interest Group on Data Communication (SIGCOMM), London, United Kingdom, 2015, pp. 523– 536, DOI: 10.1145/2785956.2787484

work page doi:10.1145/2785956.2787484 2015
[2]

TIMELY: RTT-based Congestion Control for the Datacenter,

R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats, “TIMELY: RTT-based Congestion Control for the Datacenter,” in Proceedings of the 2015 ACM Confer- ence on Special Interest Group on Data Communication (SIGCOMM), London, United Kingdom, 2015, pp. 537– 550, DOI: 10.1145/2785956.2787510

work page doi:10.1145/2785956.2787510 2015
[3]

Revisiting Net- work Support for RDMA,

R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishna- murthy, S. Ratnasamy, and S. Shenker, “Revisiting Net- work Support for RDMA,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication(SIGCOMM), Budapest, Hungary, 2018, pp. 313–326, DOI: 10.1145/3230543.3230557

work page doi:10.1145/3230543.3230557 2018
[4]

HPCC: high precision congestion control,

Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, and M. Yu, “HPCC: high precision congestion control,” in Proceed- ings of the 2019 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), Beijing, China, 2019, pp. 44–58, DOI: 10.1145/3341302.3342085

work page doi:10.1145/3341302.3342085 2019
[5]

Alibaba Stel- lar: A New Generation RDMA Network for Cloud AI,

J. Lu, J. Gao, F. Feng, Z. He, M. Zheng, K. Liu, J. He, B. Liao, S. Xu, K. Sun, Y. Mo, Q. Peng, J. Luo, Q. Li, G. Lu, Z. Wang, J. Dong, K. He, S. Cheng, J. Cao, H. Jiao, P. Zhang, S. Ma, L. Zhu, C. Shi, Y. Zhang, Y. Chen, W. Wang, S. Zhu, X. Li, Q. Wang, J. Liu, C. Wang, W. Lin, E. Zhai, J. Wu, Q. Liu, B. Fu, and D. Cai, “Alibaba Stel- lar: A New Generati...

work page doi:10.1145/3718958.3750539 2025
[6]

Decentralized Training over 100km Based on Op- tical Transport Network for Artificial Intelligence,

J. Sun, D. Wang, B. Qi, T. Gao, D. Zhang, W. Chen, and H. Li, “Decentralized Training over 100km Based on Op- tical Transport Network for Artificial Intelligence,” in Pro- ceedings of 50th European Conference on Optical Com- munication (ECOC), 2024, pp. 1-3, DOI: 10.1109/ECOC00010.2024.10739621

work page doi:10.1109/ecoc00010.2024.10739621 2024
[7]

Field Trial of Multi-Datacenter Dis- tributed Training for LLM Based on Bandwidth Conver- gence and Two Parallel Strategies over 120km High-reli- ability 800Gbit/s C+L OTN

Y. Liu, A. Zhang, X. Wang, L. Feng, K. Lv, H. Liu, X. Sheng, X. Huo, J. Li, “Field Trial of Multi-Datacenter Dis- tributed Training for LLM Based on Bandwidth Conver- gence and Two Parallel Strategies over 120km High-reli- ability 800Gbit/s C+L OTN”, in Proceedings of 50th Opti- cal Fiber Communication Conference (OFC), 2025, pp. 1-3

2025
[8]

Cross- Pipe: Towards Optimal Pipeline Schedules for Cross- Datacenter Training,

T. Chen, A. Kubicek, L. Huang, and T. Hoefler, “Cross- Pipe: Towards Optimal Pipeline Schedules for Cross- Datacenter Training,” in Proceedings of the 2025 USENIX Conference on USENIX Annual Technical Con- ference(ATC 25), Boston, MA, USA, 2025, Art. no. 64, 20 pages. DOI: 10.5555/3768039.3768103

work page doi:10.5555/3768039.3768103 2025
[9]

GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA- enabled Datacenter Optical Transport Network,

J. Dai, X. Wang, K. Fang, Z. Yang, Y. Ji, and J. Zhang, “GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA- enabled Datacenter Optical Transport Network,” in Pro- ceedings of 2025 Asia Communications and Photonics Conference (ACP), 2025, pp. 1–6, DOI: 10.1109/ACP66871.2025.11350566

work page doi:10.1109/acp66871.2025.11350566 2025
[10]

RDMA Acceleration Scheme for Long-Dis- tance Optical Network,

J. Ichikawa, H. Masutani, K. Obana, H. Takahashi, and K. Takasugi, “RDMA Acceleration Scheme for Long-Dis- tance Optical Network,” in Proceedings of 2024 IEEE Global Communications Conference (GLOBECOM), Cape Town, South Africa, 2024, pp. 4842–4847, DOI: 10.1109/GLOBECOM52923.2024.10901383

work page doi:10.1109/globecom52923.2024.10901383 2024
[11]

Swing: Providing Long-Range Lossless RDMA via PFC-Relay,

Y. Chen, C. Tian, J. Dong, S. Feng, X. Zhang, C. Liu, P. Yu, N. Xia, W. Dou, and G. Chen, “Swing: Providing Long-Range Lossless RDMA via PFC-Relay,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 63–75, 2023, DOI: 10.1109/TPDS.2022.3215517

work page doi:10.1109/tpds.2022.3215517 2023
[12]

LSCC: Link-Segmented Congestion Control for RDMA in Cross- Datacenter Networks,

M. Long, J. Han, W. Wang, J. Yang, and K. Xue, “LSCC: Link-Segmented Congestion Control for RDMA in Cross- Datacenter Networks,” in Proceedings of 2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS), Guangzhou, China, 2024, pp. 1–10, DOI: 10.1109/IWQoS61813.2024.10682909

work page doi:10.1109/iwqos61813.2024.10682909 2024
[13]

LRCC: Long-haul RDMA congestion control for cross- datacenter networks,

D. Yan, Y. Liu, S. Zhang, M. Xu, Z. Yang, and B. Fang, “LRCC: Long-haul RDMA congestion control for cross- datacenter networks,” Computer Networks, vol. 273, art. no. 111756, 2025, DOI: 10.1016/j.comnet.2025.111756

work page doi:10.1016/j.comnet.2025.111756 2025
[14]

THEMIS: Addressing Congestion-Induced Unfairness in Long-Haul RDMA Networks,

Z. Niu, M. Zhang, J. Zhang, R. Xie, Y. Yang, and X. Hu, “THEMIS: Addressing Congestion-Induced Unfairness in Long-Haul RDMA Networks,” in Proceedings of 2025 IEEE 33rd International Conference on Network Proto- cols (ICNP), 2025, pp. 1–13, DOI: 10.1109/ICNP65844.2025.11192376

work page doi:10.1109/icnp65844.2025.11192376 2025
[15]

Uno: A One-Stop Solution for Inter- and Intra- Data Center Congestion Control and Reliable Connectiv- ity,

T. Bonato, S. Abdous, A. Kabbani, A. Ghalayini, N. Gebara, T. Lam, A. Agarwal, T. Chen, Z. Yu, K. Tara- nov, M. Elhaddad, D. De Sensi, S. Ghorbani, and T. Hoefler, “Uno: A One-Stop Solution for Inter- and Intra- Data Center Congestion Control and Reliable Connectiv- ity,” in Proceedings of the International Conference for High Performance Computing, Netwo...

work page doi:10.1145/3712285.3759884 2025
[16]

Understanding Communication Characteristics of Distributed Training,

W. Li, X. Liu, Y. Li, Y. Jin, H. Tian, Z. Zhong, G. Liu, Y. Zhang, and K. Chen, “Understanding Communication Characteristics of Distributed Training,” in Proceedings of the 8th Asia-Pacific Workshop on Networking (APNet '24), Sydney, Australia, 2024, pp. 1–8, DOI: 10.1145/3663408.3663409

work page doi:10.1145/3663408.3663409 2024
[17]

Task placement and traf- fic interleaving for cross-datacenter LLM training over optical networks,

Q. Hu, W. Wang, C. Huang, X. Wang, Y. Li, Y. Zhao, Y. Zheng, Y. Tan, and J. Zhang, “Task placement and traf- fic interleaving for cross-datacenter LLM training over optical networks,” Journal of Optical Communications and Networking, vol. 18, no. 2, pp. 137–149, 2026, DOI: 10.1364/JOCN.579324

work page doi:10.1364/jocn.579324 2026
[18]

AICB: Artificial Intelligence Communica- tion Benchmark,

Alibaba Cloud, “AICB: Artificial Intelligence Communica- tion Benchmark,” GitHub repository. [Online]. Available: https://github.com/aliyun/aicb. Accessed: 2026

2026
[19]

[Online]

Networked-System-and-Security-Group, “THEMIS,” GitHub repository. [Online]. Available: https://github.com/Networked-System-and-Security- Group/Themis. Accessed: 2026

2026