arxiv: 2604.16880 · v1 · submitted 2026-04-18 · 💻 cs.NI

Recognition: unknown

Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

Yuze Jin , Xin Zhe Khooi , Ruyi Yao , Mun Choon Chan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:58 UTC · model grok-4.3

classification 💻 cs.NI

keywords ring collectivesstep misalignmentin-network mitigationcollective communicationdistributed AI trainingcongestion signalspipeline synchronization

0 comments

The pith

Symphony mitigates step misalignments in ring-based collectives by tracking per-job pipeline progress in the network and using congestion signals to selectively throttle faster flows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Ring-based collective operations in distributed AI training rely on synchronized step-wise progression for efficient pipelining, but network jitter and congestion often cause misalignments that increase overall completion time. Symphony introduces an in-network approach that maintains a lightweight tracker of per-job pipeline progress and applies selective throttling to outpacing flows via existing congestion signals, letting lagging flows catch up. This avoids any need for global coordination or modifications outside the network itself. The design is shown to reduce collective communication time, with reported gains reaching 54 percent under simulated conditions. A hardware prototype further indicates the mechanism can be realized on programmable switches.

Core claim

Symphony is an in-network solution that detects pipeline step misalignment in ring collectives and mitigates its impact through a lightweight per-job progress tracker and novel use of congestion signals to selectively throttle outpacing flows, allowing lagging flows to catch up without global coordination.

What carries the argument

Lightweight per-job pipeline progress tracker paired with selective throttling of outpacing flows via congestion signals.

If this is right

Collective communication time in ring-based operations decreases by up to 54 percent when network jitter and congestion are present.
Ring collectives maintain better step synchronization without application changes or global coordination.
The mechanism reuses existing congestion signals, allowing compatibility with current network hardware.
Implementation on programmable switches is feasible, as demonstrated by a working prototype.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective throttling idea could apply to other step-synchronized communication patterns beyond rings.
Lowering misalignment might reduce latency variance across large distributed training jobs.
Similar in-network controls could address related performance variability in data-center networks.
Combining the tracker with application-level scheduling might further improve end-to-end training throughput.

Load-bearing premise

A lightweight per-job progress tracker can be maintained accurately at line rate and selective throttling via existing congestion signals will let lagging flows catch up without creating new bottlenecks.

What would settle it

A measurement showing that the progress tracker cannot operate accurately at line rate or that selective throttling increases overall collective completion time instead of reducing it.

Figures

Figures reproduced from arXiv: 2604.16880 by Mun Choon Chan, Ruyi Yao, Xin Zhe Khooi, Yuze Jin.

**Figure 1.** Figure 1: Ideal vs. Realistic behavior of ring-based collective operations. (a–b) Even small path imbalances (e.g., ECMP hash collisions) cause uneven link sharing and flow accumulation, unlike the ideal case. (c) Logical view of a single ring. (d) Under ideal conditions, steps progress in perfect lockstep. (e) In practice, congestion on a single link delays a step, causing subsequent steps to overlap and creating p… view at source ↗

**Figure 2.** Figure 2: Degree of step overlap and normalized step [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Symphony. The switch performs progress tracking to identify and throttle outpacing flows, proactively freeing up bandwidth for lagging steps without global coordination. when the framework enforces ideal coordination at the endpoints, highlighting the need for an in-network mechanism that manages the step-level progress. Network load balancing. Data center load balancing solutions [4, 22, 25,… view at source ↗

**Figure 4.** Figure 4: Step misalignment mitigation effectiveness. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: CDF of CCT for Ring AllReduce. 1× 2× 4× 8× 16× 32× 64× Computation Reduction Ratio 0.6 0.8 1.0 Normalized JCT Baseline Symphony [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized JCT for Transformer training under varying computation reduction ratios. an initial accumulation of 10 overlapping steps, Symphony prevents further divergence and keeps the number of overlapping steps under control (unlike Baseline), and effectively brings down the CCT. Robustness. We further test the robustness of Symphony by observing its behavior over multiple runs. Fig. 4b shows the CDF of … view at source ↗

**Figure 8.** Figure 8: Impact of different network conditions and [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: HW prototype evaluation. We show the trans [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

Ring-based collective operations are widely used in distributed AI training due to their efficient bandwidth utilization. While ring communication excels at pipelining, its performance is heavily dependent on having synchronized step-wise progression. This presents a mismatch to the underlying network conditions in practice: collective operations are vulnerable to network jitter and congestion, leading to step misalignment and increased collective completion time. To that end, we propose Symphony, an in-network solution that detects pipeline step misalignment and mitigates its impact. Symphony introduces (1) a lightweight mechanism to track per-job pipeline progress and (2) a novel use of congestion signals to selectively throttle outpacing flows, allowing lagging flows to catch up without global coordination. Through simulations using Astra-Sim, we show that Symphony effectively mitigates step misalignments in ring-based collectives, resulting in up to 54% improvement in job/collective communication time. Finally, we prototype and validate Symphony on an Intel Tofino2 programmable switch to demonstrate its practicality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Symphony adds a practical in-network tracker plus selective throttling to fix ring step misalignment, but the 54% gains rest on unverified line-rate assumptions that the prototype does not fully test.

read the letter

Symphony targets a real issue: ring collectives in distributed training lose time when network jitter or congestion knocks steps out of alignment. The paper's main contribution is a lightweight per-job progress tracker combined with congestion-signal throttling that slows faster flows so laggards can catch up, all without global coordination or changes to the ring algorithm itself. That combination is new enough to stand out from prior work on collective scheduling or congestion control alone. The Astra-Sim results and Tofino2 prototype show the idea can be implemented and produces measurable speedups, up to 54% in the reported cases. That is useful evidence for a systems paper. The prototype at least confirms the mechanisms fit on current programmable hardware without obvious resource blow-ups. The soft spots are exactly where the stress-test note flags them. The performance numbers depend on the tracker staying accurate and cheap at line rate across multiple jobs, and on throttling not spawning new bottlenecks, oscillations, or fairness problems. The prototype only validates basic feasibility; it does not measure end-to-end collective time or rule out those side effects. Astra-Sim abstracts switch behavior, so any mismatch between the model and real Tofino2 dynamics could shrink or erase the reported gains. The abstract also gives little on baselines, run-to-run variability, or how many flows were excluded, which makes it hard to judge how robust the 54% figure is. This work is for networking researchers and cluster operators who already run ring collectives at scale and want an in-network lever rather than end-host changes. Readers who care about programmable switches or AI training performance will get concrete value from the mechanism and the prototype numbers. It is coherent on its own terms and shows honest engagement with the practical constraints, so it deserves a serious referee. I would send it to review with requests for more detail on tracker overhead, throttling dynamics, and simulation-to-hardware correlation.

Referee Report

3 major / 2 minor

Summary. The paper proposes Symphony, an in-network mechanism to detect and mitigate step misalignments in ring-based collective operations (e.g., for distributed AI training) caused by network jitter and congestion. It introduces a lightweight per-job pipeline progress tracker and selective throttling of outpacing flows via existing congestion signals to let lagging participants catch up without global coordination. Simulations in Astra-Sim report up to 54% reduction in job/collective communication time, with a basic feasibility prototype implemented on an Intel Tofino2 programmable switch.

Significance. If the reported gains prove robust, Symphony offers a practical, deployable improvement to collective communication performance in AI training clusters by addressing a real mismatch between ring pipelining assumptions and network conditions. The in-network design avoids host or collective modifications, and the Tofino2 prototype provides concrete evidence of implementability on modern programmable hardware, which strengthens the work's engineering contribution.

major comments (3)

[§5] §5 (Evaluation), simulation results paragraph: The claim of 'up to 54% improvement in job/collective communication time' is presented without any description of the baselines (e.g., unmodified ring collectives or alternative mitigation schemes), the distribution of network jitter/congestion parameters tested, the number of simulation runs, or measures of variability/statistical significance. This makes it impossible to assess whether the gains are load-bearing or sensitive to modeling assumptions.
[§3.2] §3.2 (Selective Throttling Mechanism): The argument that congestion-signal-based throttling of faster flows will reliably allow lagging ring participants to catch up without inducing secondary bottlenecks, fairness violations, or oscillations rests on unverified assumptions about per-job tracker accuracy at line rate and switch behavior under concurrent jobs. The Tofino2 prototype only validates basic feasibility and does not measure end-to-end collective completion time or these dynamics.
[§4] §4 (Prototype Implementation): While the Tofino2 prototype demonstrates that the tracker and throttling logic can be realized in P4, it provides no quantitative data on state overhead, lookup latency, or interaction with existing congestion control under realistic multi-job workloads, leaving the 'lightweight' and 'line-rate' claims unquantified.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicitly naming the ring-based collectives evaluated (e.g., AllReduce, AllGather) and the range of job sizes or ring lengths tested.
[§3] Notation for the per-job progress tracker (e.g., variables for step counters and congestion thresholds) should be introduced with a small table or diagram in §3 for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we will revise the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [§5] §5 (Evaluation), simulation results paragraph: The claim of 'up to 54% improvement in job/collective communication time' is presented without any description of the baselines (e.g., unmodified ring collectives or alternative mitigation schemes), the distribution of network jitter/congestion parameters tested, the number of simulation runs, or measures of variability/statistical significance. This makes it impossible to assess whether the gains are load-bearing or sensitive to modeling assumptions.

Authors: We agree that the evaluation section requires additional detail to substantiate the reported gains. In the revised manuscript, we will expand §5 to describe the baselines (unmodified ring collectives without Symphony and any alternative schemes considered), the specific distributions and ranges of network jitter and congestion parameters used in Astra-Sim, the number of independent simulation runs per configuration, and statistical measures including means, standard deviations, and confidence intervals. revision: yes
Referee: [§3.2] §3.2 (Selective Throttling Mechanism): The argument that congestion-signal-based throttling of faster flows will reliably allow lagging ring participants to catch up without inducing secondary bottlenecks, fairness violations, or oscillations rests on unverified assumptions about per-job tracker accuracy at line rate and switch behavior under concurrent jobs. The Tofino2 prototype only validates basic feasibility and does not measure end-to-end collective completion time or these dynamics.

Authors: The Astra-Sim simulations demonstrate that selective throttling mitigates misalignments without inducing the noted secondary effects across the evaluated scenarios. We will revise §3.2 to provide a more explicit discussion of the underlying assumptions, supported by simulation evidence on tracker accuracy and dynamics. We will also clarify that the Tofino2 prototype serves to validate implementability at line rate while end-to-end performance and multi-job behavior are evaluated via simulation, and note concurrent-job interactions as an area for future quantification. revision: partial
Referee: [§4] §4 (Prototype Implementation): While the Tofino2 prototype demonstrates that the tracker and throttling logic can be realized in P4, it provides no quantitative data on state overhead, lookup latency, or interaction with existing congestion control under realistic multi-job workloads, leaving the 'lightweight' and 'line-rate' claims unquantified.

Authors: We agree that quantitative metrics would strengthen the prototype claims. In the revised §4, we will add data from the Tofino2 implementation, including state overhead (e.g., table entries and memory usage per job), measured lookup and processing latencies, and observations on interactions with standard congestion control under multi-job workloads where tested. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are empirical simulation results, not derived from self-referential equations or fits

full rationale

The paper is a systems design work proposing Symphony's per-job progress tracker and congestion-signal throttling for ring collectives. Its central claim (up to 54% improvement) is demonstrated via Astra-Sim simulations and Tofino2 prototype validation rather than any mathematical derivation chain. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The evaluation is independent and falsifiable through external simulation runs, satisfying the criteria for a non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, mathematical axioms, or newly invented entities are described in the abstract. The contribution rests on standard assumptions about programmable switch capabilities and the semantics of congestion signals.

pith-pipeline@v0.9.0 · 5478 in / 1174 out tokens · 37129 ms · 2026-05-10T06:58:49.875523+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 49 canonical work pages · 4 internal anchors

[1]

Grattafiori Aaron, Dubey Abhimanyu, Jauhri Abhinav, Pandey Ab- hinav, Kadian Abhishek, Al-Dahle Ahmad, Letman Aiesha, Mathur Akhil, Schelten Alan, et al . 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Saksham Agarwal, Shijin Rajakrishnan, Akshay Narayan, Rachit Agar- wal, David Shmoys, and Amin Vahdat. 2018. Sincronia: near-optimal network design for coflows. InACM SIGCOMM

2018
[3]

Albert Gran Alcoz, Alexander Dietmüller, and Laurent Vanbever. 2020. SP-PIFO: Approximating Push-In First-Out Behaviors using Strict- Priority Queues. InUSENIX NSDI

2020
[4]

Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Ma- tus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: distributed congestion-aware load balancing for datacenters. InACM SIGCOMM. doi:10.1145/2619239.2626316

work page doi:10.1145/2619239.2626316 2014
[5]

Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. pFabric: minimal near-optimal datacenter transport. InACM SIGCOMM

2013
[6]

Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ram- jee, and Nipun Kwatra. 2022. Varuna: scalable, low-cost training of massive deep learning models. InACM EuroSys. doi:10.1145/3492321. 3519584

work page doi:10.1145/3492321 2022
[7]

Wei Bai, Kai Chen, Li Chen, Changhoon Kim, and Haitao Wu. 2016. Enabling ECN over Generic Packet Scheduling. InACM CoNEXT

2016
[8]

Wei Bai, Li Chen, Kai Chen, and Haitao Wu. 2016. Enabling ECN in multi-service multi-queue data centers. InUSENIX NSDI

2016
[9]

Baidu Research. 2017. Baidu AllReduce: A Lightweight Library for Distributed Deep Learning. https://github.com/baidu-research/baidu- allreduce

2017
[10]

Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. 2014. P4: programming protocol- independent packet processors.SIGCOMM Comput. Commun. Rev.44, 3 (July 2014), 87–95. doi:10.1145/2656877.2656890

work page doi:10.1145/2656877.2656890 2014
[11]

Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McK- eown, Martin Izzard, Fernando Mujica, and Mark Horowitz. 2013. For- warding metamorphosis: fast programmable match-action processing in hardware for SDN. InACM SIGCOMM. doi:10.1145/2486001.2486011

work page doi:10.1145/2486001.2486011 2013
[12]

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, and Ennan Zhai. 2024. Crux: GPU- Efficient Communication Scheduling for Deep Learning Training. In ACM SIGCOMM. doi:10.1145/3651890.3672239

work page doi:10.1145/3651890.3672239 2024
[13]

Houssine Chetto and Maryline Chetto. 1989. Some Results of the Earliest Deadline Scheduling Algorithm.IEEE Trans. Softw. Eng.15, 10 (Oct. 1989), 1261–1269. doi:10.1109/TSE.1989.559777

work page doi:10.1109/tse.1989.559777 1989
[14]

Mosharaf Chowdhury and Ion Stoica. 2015. Efficient Coflow Schedul- ing Without Prior Knowledge. InACM SIGCOMM

2015
[15]

Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient coflow scheduling with Varys. InACM SIGCOMM

2014
[16]

Ultra Ethernet Consortium. 2025. Ultra Ethernet Specification v1.0 (UEC Specification 6.11.25). Specification: https://ultraethernet.org/ wp-content/uploads/sites/20/2025/06/UE-Specification-6.11.25.pdf

2025
[17]

Demers, S

A. Demers, S. Keshav, and S. Shenker. 1989. Analysis and simulation of a fair queueing algorithm. InACM SIGCOMM

1989
[18]

Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, Yang Bai, Shuguang Wang, Wencong Xiao, Jianxi Ye, Minlan Yu, and Hong Xu. 2025. Mycroft: Tracing Dependencies in Collec- tive Communication Towards Reliable LLM Training. InACM SOSP. doi:10.1145/3731569.3764848

work page doi:10.1145/3731569.3764848 2025
[19]

Sally Floyd and Van Jacobson. 1993. Random early detection gateways for congestion avoidance.IEEE/ACM Trans. Netw.1, 4 (Aug. 1993), 397–413. doi:10.1109/90.251892

work page doi:10.1109/90.251892 1993
[20]

Sally Floyd, Dr. K. K. Ramakrishnan, and David L. Black. 2001. The Addition of Explicit Congestion Notification (ECN) to IP. RFC 3168. doi:10.17487/RFC3168

work page doi:10.17487/rfc3168 2001
[21]

Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. 2024. RDMA over Ethernet for Distributed Training at Meta Scale. InACM SIGCOMM. doi:10.1145/3651890.3672233

work page doi:10.1145/3651890.3672233 2024
[22]

In: ACM SIGCOMM

Soudeh Ghorbani, Zibin Yang, P. Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. 2017. DRILL: Micro Load Balancing for Low-latency Data Center Networks. InACM SIGCOMM. doi:10.1145/ 3098822.3098839

work page arXiv 2017
[23]

Andrew Gibiansky. 2017. Bringing HPC Techniques to Deep Learn- ing (ring allreduce). https://andrew.gibiansky.com/blog/machine- learning/baidu-allreduce/

2017
[24]

Vin, and Haichen Chen

Pawan Goyal, Harrick M. Vin, and Haichen Chen. 1996. Start-time fair queueing: a scheduling algorithm for integrated services packet switching networks. InACM SIGCOMM

1996
[25]

Greg Dorai. 2025. Cisco: The New Family of Cisco Smart Switches: Built to Power What’s Next. https://www.cisco.com/site/us/en/ products/networking/silicon-one/index.html. 13 Yuze Jin, Xin Zhe Khooi, Ruyi Yao, and Mun Choon Chan

2025
[26]

Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N

Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft. 2015. Queues Don’t Matter When You Can JUMP Them!. InUSENIX NSDI

2015
[27]

Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. InACM SIGCOMM. doi:10.1145/2934872.2934908

work page doi:10.1145/2934872.2934908 2016
[28]

Ganger, Phillip B

Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2016. Addressing the straggler problem for iterative convergent parallel ML. InACM SoCC. doi:10.1145/2987550.2987554

work page doi:10.1145/2987550.2987554 2016
[29]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV] https://arxiv.org/abs/1512.03385

work page internal anchor Pith review arXiv 2015
[30]

Hoeiland-Joergensen, P

T. Hoeiland-Joergensen, P. McKenney, D. Taht, J. Gettys, and E. Du- mazet. 2018. RFC 8290: The Flow Queue CoDel Packet Scheduler and Active Queue Management Algorithm

2018
[31]

Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoefler. 2025. Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms. arXiv:2507.04786 [cs.DC] https://arxiv.org/abs/2507.04786

work page arXiv 2025
[32]

Hanlin Huang, Ke Xu, Tong Li, Zhuotao Liu, Xinle Du, and Xi- angyu Gao. 2025. DiffECN: Differential ECN Marking for Datacenter Networks.IEEE Transactions on Networking33, 1 (2025), 210–225. doi:10.1109/TNET.2024.3477511

work page doi:10.1109/tnet.2024.3477511 2025
[33]

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, and Rajeev Thakur. 2024. gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters. InACM ICS. doi:10.1145/3650200.3656636

work page doi:10.1145/3650200.3656636 2024
[34]

Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao

Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. InACM HotOS. doi:10.1145/ 3102980.3103005

work page arXiv 2017
[35]

Intel Corporation. [n. d.]. Intel®Tofino™2. https://www.intel.com/ content/www/us/en/products/network-io/programmable-ethernet- switch/tofino-2-series.html
[36]

Charlie Hu, and Xiaojun Lin

Akshay Jajoo, Y. Charlie Hu, and Xiaojun Lin. 2019. Your Coflow has Many Flows: Sampling them for Fun and Speed. InUSENIX ATC

2019
[37]

Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowd- hury. 2023. Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. InACM SOSP. doi:10.1145/3600006.3613152

work page doi:10.1145/3600006.3613152 2023
[38]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

2024
[39]

Raj Joshi, Cha Hwan Song, Xin Zhe Khooi, Nishant Budhdev, Ayush Mishra, Mun Choon Chan, and Ben Leong. 2023. Masking Corruption Packet Losses in Datacenter Networks with Link-local Retransmission. InACM SIGCOMM. doi:10.1145/3603269.3604853

work page doi:10.1145/3603269.3604853 2023
[40]

Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. 2017. Straggler mitigation in distributed optimization through data encoding. InNIPS. Curran Associates Inc

2017
[41]

Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer Rexford. 2016. HULA: Scalable Load Balancing Using Pro- grammable Data Planes. InACM Proceedings of the Symposium on SDN Research (SOSR ’16). doi:10.1145/2890955.2890968

work page doi:10.1145/2890955.2890968 2016
[42]

Xin Zhe Khooi, Levente Csikor, Jialin Li, Min Suk Kang, and Dinil Mon Divakaran. 2021. Revisiting Heavy-Hitter Detection on Commodity Programmable Switches. In2021 IEEE 7th International Conference on Network Softwarization (NetSoft)

2021
[43]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. InUSENIX NSDI

2021
[44]

Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papail- iopoulos, and Kannan Ramchandran. 2016. Speeding up distributed machine learning using codes. InIEEE International Symposium on Information Theory (ISIT)

2016
[45]

Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zihao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, Bowen Liu, Junxue Zhang, Kai Chen, and Bingyang Liu. 2025. Revisiting RDMA Reliability for Lossy Fabrics. InACM SIGCOMM. doi:10.1145/3718958.3750480

work page doi:10.1145/3718958.3750480 2025
[46]

Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Al- izadeh, and Minlan Yu. 2019. HPCC: high precision congestion control. InACM SIGCOMM. doi:10.1145/3341302.3342085

work page doi:10.1145/3341302.3342085 2019
[47]

Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu, Ying Zhang, Xiaofeng Ye, Yiming Zhang, and Kai Chen. 2025. MixNet: A Runtime Reconfigurable Optical- Electrical Fabric for Distributed Mixture-of-Experts Training. InACM SIGCOMM. doi:10.1145/3718958.3750465

work page doi:10.1145/3718958.3750465 2025
[48]

Hwijoon Lim, Juncheol Ye, Sangeetha Abdu Jyothi, and Dongsu Han
[49]

InACM SIGCOMM

Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs. InACM SIGCOMM. doi:10.1145/3651890. 3672228

work page doi:10.1145/3651890
[50]

Shengkai Lin, Qinwei Yang, Zengyin Yang, Yuchuan Wang, and Shizhen Zhao. 2024. LubeRDMA: A Fail-safe Mechanism of RDMA. InACM APNET. doi:10.1145/3663408.3663411

work page doi:10.1145/3663408.3663411 2024
[51]

linux-rdma contributors. 2024. rdma-core: RDMA Core Userspace Libraries and Daemons. https://github.com/linux-rdma/rdma-core

2024
[52]

Shuo Liu, Qiaoling Wang, Junyi Zhang, Wenfei Wu, Qinliang Lin, Yao Liu, Meng Xu, Marco Canini, Ray C. C. Cheung, and Jianfei He. 2023. In-Network Aggregation with Transport Transparency for Distributed Training. InACM ASPLOS. doi:10.1145/3582016.3582037

work page doi:10.1145/3582016.3582037 2023
[53]

Tongrui Liu, Chenyang Hei, Fuliang Li, Chengxi Gao, Jiamin Cao, Tian- shu Wang, Ennan Zhai, and Xingwei Wang. 2025. ResCCL: Resource- Efficient Scheduling for Collective Communication. InACM SIGCOMM. doi:10.1145/3718958.3750514

work page doi:10.1145/3718958.3750514 2025
[54]

Xuting Liu, Behnaz Arzani, Siva Kesava Reddy Kakarla, Liangyu Zhao, Vincent Liu, Miguel Castro, Srikanth Kandula, and Luke Marshall
[55]

InACM SIGCOMM

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem. InACM SIGCOMM. doi:10.1145/ 3651890.3672249

work page arXiv
[56]

Zhangqiang Ming, Yuchong Hu, Xinjue Zheng, Wenxiang Zhou, and Dan Feng. 2025. SAFusion: Efficient Tensor Fusion with Sparsifica- tion Ahead for High-Performance Distributed DNN Training. InACM HPDC. New York, NY, USA. doi:10.1145/3731545.3731581

work page doi:10.1145/3731545.3731581 2025
[57]

Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wether- all, and David Zats. 2015. TIMELY: RTT-based Congestion Control for the Datacenter. InACM SIGCOMM. doi:10.1145/2785956.2787510

work page doi:10.1145/2785956.2787510 2015
[58]

Behnam Montazeri, Yilong Li, Mohammad Alizadeh, and John Ouster- hout. 2018. Homa: a receiver-driven low-latency transport protocol using network priorities. InACM SIGCOMM. doi:10.1145/3230543. 3230564

work page doi:10.1145/3230543 2018
[59]

Pipedream: generalized pipeline parallelism for dnn training,

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN 14 Symphony training. InACM SOSP. doi:10.1145/3341301.3359646

work page doi:10.1145/3341301.3359646 2019
[60]

Kathleen Nichols and Van Jacobson. 2012. Controlling queue delay. Commun. ACM55, 7 (July 2012), 42–50. doi:10.1145/2209249.2209264

work page doi:10.1145/2209249.2209264 2012
[61]

nsnam. 2025. ns-3 Network Simulator. https://www.nsnam.org/

2025
[62]

Nvidia. 2022. Doubling all2all Performance with NVIDIA Col- lective Communication Library 2.12. https://developer.nvidia. com/blog/doubling-all2all-performance-with-nvidia-collective- communication-library-2-12/

2022
[63]

NVIDIA. 2025. NVIDIA Collective Communication Library (NCCL) Documentation. https://docs.nvidia.com/deeplearning/nccl/user- guide/docs/index.html

2025
[64]

Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero Bubble Pipeline Parallelism. arXiv:2401.10241 [cs.DC] https://arxiv. org/abs/2401.10241

work page arXiv 2023
[65]

Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, and Dennis Cai. 2024. Alibaba HPN: A Data Center Network for Large Language Model Training. InACM SIGCOMM. doi:10.1145/3651890.3672265

work page doi:10.1145/3651890.3672265 2024
[66]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
[67]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

ZeRO: Memory Optimizations Toward Training Trillion Parame- ter Models. arXiv:1910.02054 [cs.LG] https://arxiv.org/abs/1910.02054

work page arXiv 1910
[68]

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2020. ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. InIEEE ISPASS

2020
[69]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. InACM SIGKDD. doi:10.1145/3394486.3406703

work page doi:10.1145/3394486.3406703 2020
[70]

Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, and Michael Matheson

Joshua Romero, Junqi Yin, Nouamane Laanait, Bing Xie, M. Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, and Michael Matheson. 2022. Accelerating Collective Com- munication in Data Parallel Training across Deep Learning Frame- works. InUSENIX NSDI 22

2022
[71]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. InUSENIX NSDI

2021
[72]

Linus E Schrage and Louis W Miller. 1966. The Queue M/G/1 with the Shortest Remaining Processing Time Discipline.Operations Research 14, 4 (1966), 670–684

1966
[73]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799 [cs.LG] https://arxiv.org/abs/1802.05799

work page Pith review arXiv 2018
[74]

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches. InUSENIX NSDI

2023
[75]

Sharma, Ming Liu, Kishore Atreya, and Arvind Krish- namurthy

Naveen Kr. Sharma, Ming Liu, Kishore Atreya, and Arvind Krish- namurthy. 2018. Approximating fair queueing on reconfigurable switches. InUSENIX NSDI

2018
[76]

Shreedhar and G

M. Shreedhar and G. Varghese. 1996. Efficient fair queuing using deficit round-robin.IEEE/ACM Transactions on Networking4, 3 (1996), 375–385. doi:10.1109/90.502236

work page doi:10.1109/90.502236 1996
[77]

Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jeevaraj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Min- lan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lu...

work page arXiv 2026
[78]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs.CV] https://arxiv.org/abs/1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2015
[79]

Cha Hwan Song, Xin Zhe Khooi, Raj Joshi, Inho Choi, Jialin Li, and Mun Choon Chan. 2023. Network Load Balancing with In-network Re- ordering Support for RDMA. InACM SIGCOMM. doi:10.1145/3603269. 3604849

work page doi:10.1145/3603269 2023
[80]

Dimakis, and Nikos Karam- patziakis

Rashish Tandon, Qi Lei, Alexandros G. Dimakis, and Nikos Karam- patziakis. 2017. Gradient coding: avoiding stragglers in distributed learning. InICML. JMLR.org

2017

Showing first 80 references.