EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet

Binzhang Fu; Chao Jiang; Chenqi Zhao; Jianbo Dong; Jianglong Nie; Jiangyuan Chen; Jiaqi Sun; Jing Lin; Junjie Wang; Junkai Chen

arxiv: 2605.18683 · v1 · pith:XX7EZSL2new · submitted 2026-05-18 · 💻 cs.DC

EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet

Yitao Yuan , Jianglong Nie , Tianyu Bai , Ruizhe Zhou , Siyuan Cao , Xujie Fan , Yuchen Xu , Junkai Chen

show 23 more authors

Chenqi Zhao Nengyuan Zhang Shaoke Fang Jiangyuan Chen Yuanfeng Chen Jiaqi Sun Zhan Wang Xiaohua Xu Yuchao Zhang Yang Liu Xiangrui Yang Jing Lin Xiaohe Hu Yang Li Chao Jiang Limin Xiao Weifeng Zhang Junjie Wang Wei Cheng Yazhu Lan Jianbo Dong Binzhang Fu Wenfei Wu

This is my paper

Pith reviewed 2026-05-20 08:04 UTC · model grok-4.3

classification 💻 cs.DC

keywords in-network collectiveEthernet protocolabstractionpolymorphismAI accelerationnetwork hardwarecollective operationsINC

0 comments

The pith

EPIC introduces a standard-Ethernet-compatible abstraction for in-network collectives that supports polymorphic realizations across different hardware capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EPIC to enable in-network collective acceleration for AI on Ethernet by creating a unified abstraction that aligns functional boundaries with participant roles. This abstraction supports polymorphic realizations that adapt to different hardware capabilities. It solves adoption issues in the open Ethernet ecosystem through a modular design for incremental hardware development, formal verification of correctness, and a versatile resource management model. Sympathetic readers would care as this could optimize AI training and inference without needing proprietary hardware.

Core claim

EPIC (Ethernet Polymorphic In-network Collective) is an INC protocol specification and reference system built on 'Unified Abstraction, Polymorphic Realization'. It introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities. The design addresses challenges with modular evolution, formal proofs of correctness for all modes, and a unified resource management model.

What carries the argument

The unified abstraction in EPIC that aligns functional boundaries with participant roles to enable polymorphic realizations for hardware of varying capabilities.

If this is right

Vendors can incrementally develop hardware from simple to complex implementations without losing compatibility.
Formal verification confirms the correctness of all proposed polymorphic modes.
A unified resource management model supports a wide range of in-network collective scenarios.
Performance gains are realized in AI training and inference workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This abstraction could be adapted to other networking standards to broaden INC adoption.
Real-world deployments might reveal optimizations for specific AI model types.
The modular approach could template other cross-layer network protocols.

Load-bearing premise

A modular design enables an evolutionary path from simple to complex implementations allowing vendors to iterate their hardware incrementally.

What would settle it

A failure in formal verification of any polymorphic mode or the absence of performance gains in Tofino testbed experiments compared to standard Ethernet collectives would falsify the claims.

Figures

Figures reproduced from arXiv: 2605.18683 by Binzhang Fu, Chao Jiang, Chenqi Zhao, Jianbo Dong, Jianglong Nie, Jiangyuan Chen, Jiaqi Sun, Jing Lin, Junjie Wang, Junkai Chen, Limin Xiao, Nengyuan Zhang, Ruizhe Zhou, Shaoke Fang, Siyuan Cao, Tianyu Bai, Wei Cheng, Weifeng Zhang, Wenfei Wu, Xiangrui Yang, Xiaohe Hu, Xiaohua Xu, Xujie Fan, Yang Li, Yang Liu, Yazhu Lan, Yitao Yuan, Yuanfeng Chen, Yuchao Zhang, Yuchen Xu, Zhan Wang.

**Figure 1.** Figure 1: EPIC’s abstraction isolation and contention-based sharing—transforming fragmented resources into a flexible, programmable fabric. This approach minimizes administrative overhead while maximizing resource utilization across dynamic AI scenarios (§6). 3 EPIC OVERVIEW 3.1 Abstraction We define EPIC’s abstraction and map it to Ethernet clusters. EPIC’s realization can support 6 collective primitives. INC Tre… view at source ↗

**Figure 3.** Figure 3: EPIC Architecture 3.3 Workflow 3.3.1 System and Group Lifecycle Bootup. The lifecycle begins with component initialization: CommLib daemons launch on hosts, and IncEngine is enabled on switches. All instances report local states to the IncManager, which constructs the global topology and manages resources (§6.1). When an application calls InitGroup(), the IncManager computes and maps a logical IncTree on… view at source ↗

**Figure 4.** Figure 4: Flow Control in CommLib [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: A pitfall in ModeIII design age the same pre-configured routing rules and polymorphic RoCE functions, ensuring high-performance communication without the bidirectional result distribution required by AllReduce. Interaction with RoCE Flow Control and Congestion Control. EPIC’s flow control means keeping inflight data volume from overwhelming the switch buffers. The three mechanisms — EPIC flow control, Ro… view at source ↗

**Figure 8.** Figure 8: Lookup Tables in Mode-III psnStart (Recv States) arrived[N] epsn Endpoint 1 (Send States) lastAcked payload buffer Endpoint A degree buffer Pipe data pkt ACK incoming endpoints outgoing endpoints ACK iff pkt.psn ∈ [psnStart, psnStart+N) ≡ min{o.lastAcked | o∈outgoing}+1 [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 10.** Figure 10: Comparison of Modes (outer is better). requires full message reception prior to processing, whereas Mode-II/III utilize packet-level (MTU) pipelining. Consequently, Mode-II/III reduces end-to-end latency by approximately (2𝐻 − 1) (𝑀 − 1)𝑈 /𝐵, offering advantages unless the message size 𝑆 is exceptionally large or the propagation delay 𝐿 dominates. Logic Complexity. Considering the module organization, M… view at source ↗

**Figure 13.** Figure 13: [Testbed, Tree-2-8] Collective Algorithm [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 12.** Figure 12: [Testbed, Tree-2-8] AllReduce Algorithm Throughput finish normally. EPIC-III has more complex switch workflow, so its throughput is lower than EPIC-II. It is hard to make a fair performance comparison (and not the goal of emulation) among EPIC-I vs EPIC-II/III vs MPI, because their software stack differs (details in Appendix §J). Evolvability. The development of EPIC-II and III follows the modular design.… view at source ↗

**Figure 14.** Figure 14: [Testbed, Tree-2-8] Bandwidth Complements between ReduceScatter and AllGather receive for both ring-based collectives, causing bandwidth contention. Conversely, EPIC breaks this limit by exploiting directional link independence. By orchestrating nodes to transmit for RS while receiving for AG (and vice versa), EPIC ensures non-overlapping directional flows. Consequently, EPIC achieves a theoretical thro… view at source ↗

**Figure 15.** Figure 15: [Packet Simulation, Tree-2-8] AllReduce Al [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

**Figure 16.** Figure 16: [Flow Simulation] Tail 15% JCT of Alibaba [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗

**Figure 17.** Figure 17: EPIC in switch micro architecture Modern high-performance switches typically utilize a pipelined architecture comprising an Ingress Pipeline, a Traffic Manager (TM), and an Egress Pipeline. The Ingress Pipeline performs packet parsing and forwarding decisions, while the TM serves as the data plane core, managing on-chip shared memory for buffering and scheduling. The Egress Pipeline completes the proces… view at source ↗

**Figure 18.** Figure 18: Transmission Efficiency Switches operate in Store-and-Forward model but at different granularities across modes. Following the definitions in [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: IncManager and policies [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: [Flow Simulation] JCT CDF of Trace 1 on 2048-GPU Fat-tree. 4,500 5,000 5,500 6,000 6,500 7,000 0 0.2 0.4 0.6 0.8 1 Job Completion Time (s) CDF (a) All Jobs Ring EDT Spatial Mux Temporal Mux 4,500 5,000 5,500 6,000 6,500 7,000 0.85 0.9 0.95 1 Job Completion Time (s) CDF (b) Tail 15% Jobs [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

**Figure 23.** Figure 23: [Flow Simulation] 8-GPU’s JCT CDF of Trace [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

**Figure 24.** Figure 24: [Flow Simulation] 8-GPU’s JCT CDF of Trace [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗

read the original abstract

In-Network Collective (INC) acceleration holds immense potential for optimizing AI training and inference; however, its cross-layer nature has historically hindered investment and adoption within the open Ethernet ecosystem. To bridge this gap, we propose EPIC (Ethernet Polymorphic In-network Collective), an INC protocol specification and reference system built on the principle of "Unified Abstraction, Polymorphic Realization." EPIC introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities. We address three fundamental challenges: first, we employ a modular design that enables an evolutionary path from simple to complex implementations, allowing vendors to iterate their hardware incrementally; second, we apply formal verification methodologies to prove the correctness of all proposed polymorphic modes; and third, we develop a unified resource management model versatile enough for diverse INC scenarios. Extensive validation -- spanning model checking, packet/flow simulations, VM emulation, Tofino Testbed, and FPGA/RTL verification -- confirms EPIC's correctness, performance gain, and feasibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EPIC's unified abstraction and polymorphic modes for in-network collectives on Ethernet is the fresh part, but whether the basic realizations actually run on unmodified commodity switches is the part that still needs checking.

read the letter

The main takeaway is that this paper defines EPIC as a protocol spec that keeps one abstraction layer while letting different hardware do the work in different ways. The modular evolutionary path from simple to complex implementations is the piece that feels new compared to earlier INC work that usually locked into one hardware style or another. They also lay out a resource management model meant to cover multiple collective patterns without rewriting everything each time. That combination of abstraction plus polymorphism plus incremental hardware path is what they are really selling. The formal verification across modes and the spread of validation from model checking through simulations, emulation, Tofino, and FPGA/RTL is more concrete than most papers in this area manage. Credit for shipping that level of checking instead of stopping at high-level claims. The soft spot sits right where the stress-test note points. The abstract keeps saying the design is compatible with standard Ethernet, yet every hardware result they show uses programmable platforms. If even the lowest mode still needs custom packet processing or stateful tables, then the claimed starting point on plain L2/L3 switches does not hold and the evolutionary story loses its first rung. That is not a minor detail; it directly affects whether vendors without programmable silicon can begin adopting it. The performance numbers are plausible but still rest on the testbed setups rather than large-scale production traces. No circular fitting or invented entities jump out. This is for distributed-systems people who care about collective communication in Ethernet-based AI clusters and want a concrete protocol spec plus verification approach to build on. A reader already working on in-network acceleration or hardware offload would get usable ideas from the design sections. It is worth sending to peer review. The topic matters and the structure is solid enough that referees can give targeted feedback on the compatibility and realization details.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes EPIC (Ethernet Polymorphic In-network Collective), a protocol specification and reference system for in-network collective acceleration on Ethernet. It introduces a unified abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles and supports polymorphic realizations for different hardware capabilities. The work addresses three challenges: modular design for evolutionary hardware implementation, formal verification of correctness for all modes, and a unified resource management model. Validation spans model checking, simulations, VM emulation, Tofino testbed, and FPGA/RTL verification, claiming correctness, performance gains, and feasibility.

Significance. If the central claims hold, particularly the compatibility with unmodified Ethernet and the polymorphic approach enabling incremental vendor adoption, this could facilitate broader investment and adoption of in-network collectives in the open Ethernet ecosystem for AI training and inference, addressing a historical barrier due to cross-layer nature.

major comments (1)

[Abstract] Abstract: The abstract states that EPIC is 'compatible with standard Ethernet' and offers 'polymorphic realizations tailored to varying hardware capabilities', with validation including 'Tofino Testbed, and FPGA/RTL verification'. If even the basic modes require programmable switch features (as implied by the testbeds used for all modes), this contradicts the claim of compatibility with unmodified standard Ethernet hardware and the modular evolutionary path starting from simple implementations on existing vendor silicon. This is load-bearing for the central claim and requires clarification or evidence of a non-programmable basic mode.

minor comments (1)

[Abstract] Abstract: The abstract mentions 'extensive validation' but does not specify quantitative performance gains or specific metrics used to confirm feasibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for identifying a point that is central to the paper's claims. We address the major comment below and commit to revisions that improve clarity without altering the core technical contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that EPIC is 'compatible with standard Ethernet' and offers 'polymorphic realizations tailored to varying hardware capabilities', with validation including 'Tofino Testbed, and FPGA/RTL verification'. If even the basic modes require programmable switch features (as implied by the testbeds used for all modes), this contradicts the claim of compatibility with unmodified standard Ethernet hardware and the modular evolutionary path starting from simple implementations on existing vendor silicon. This is load-bearing for the central claim and requires clarification or evidence of a non-programmable basic mode.

Authors: We appreciate the referee highlighting this important clarification need. The EPIC abstraction is defined to be compatible with standard Ethernet by using unmodified Ethernet frame formats, standard multicast groups, and conventional switch forwarding tables for the basic mode; no programmable data-plane features are required for correct operation or role alignment in this mode. Polymorphic realizations then layer optional in-network computation on top when programmable hardware (Tofino, FPGA) is present. The listed testbeds were selected to exercise the full spectrum of modes and to provide rigorous verification of the advanced realizations, but they do not imply that the basic mode depends on programmability. We agree that the manuscript would benefit from an explicit statement of per-mode hardware requirements and a short illustrative example of the basic mode on commodity silicon. We will therefore revise the abstract, add a clarifying paragraph in Section 3, and include a table summarizing hardware prerequisites for each polymorphic variant. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: design proposal rests on independent specification and validation

full rationale

The paper presents EPIC as a new protocol specification and reference system based on the principle of unified abstraction with polymorphic realizations. Claims about compatibility with standard Ethernet, modular evolutionary path, formal verification, and resource management are introduced as design choices rather than derived from fitted parameters or prior self-referential results. Validation spans model checking, simulations, emulation, Tofino testbed, and FPGA/RTL, providing external checks. No equations, self-citations, or reductions to inputs by construction appear in the abstract or described structure. The derivation chain is self-contained as a proposed architecture with stated assumptions that do not presuppose the target outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the assumption that formal verification can cover all polymorphic modes and that the modular structure supports incremental hardware development without additional free parameters or invented entities beyond the protocol itself.

axioms (1)

domain assumption Formal verification methodologies can prove the correctness of all proposed polymorphic modes.
Stated in the abstract as one of the three fundamental challenges addressed.

invented entities (1)

EPIC abstraction and polymorphic realizations no independent evidence
purpose: To provide a unified, Ethernet-compatible interface for in-network collectives with hardware-specific implementations.
Newly defined protocol elements introduced to solve the cross-layer adoption problem.

pith-pipeline@v0.9.0 · 5825 in / 1135 out tokens · 32653 ms · 2026-05-20T08:04:23.335894+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EPIC introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ a modular design that enables an evolutionary path from simple to complex implementations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 2 internal anchors

[1]

AMD. 2024. RCCL. (2024). https://github.com/ROCm/rccl

work page 2024
[2]

Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, et al . 2023. Empowering azure storage with RDMA. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 49–67

work page 2023
[3]

Marcel Blöcher, Lin Wang, Patrick Eugster, and Max Schmidt. 2021. Switches for HIRE: Resource scheduling for data center in-network computing. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 268–285

work page 2021
[4]

Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors.ACM SIGCOMM Computer Communication Review44, 3 (2014), 87–95

work page 2014
[5]

Broadcom Inc. 2026. StrataXGS Tomahawk 5 Series: 51.2 Tb/s Ethernet Switch ASIC Family. https://www.broadcom.com/products/ethernet -connectivity/switching/strataxgs/bcm78920-series. (2026). Accessed: 2026-02-06

work page 2026
[6]

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, and Ennan Zhai. 2024. Crux: Gpu-efficient communication scheduling for deep learning training. InProceedings of the ACM SIGCOMM 2024 Conference. 1–15

work page 2024
[7]

Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-network allreduce. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16

work page 2021
[8]

Salvatore Di Girolamo, Andreas Kurth, Alexandru Calotoiu, Thomas Benz, Timo Schneider, Jakub Beránek, Luca Benini, and Torsten Hoefler

work page
[9]

In2021 ACM/IEEE 48th Annual Interna- tional Symposium on Computer Architecture (ISCA)

A RISC-V in-network accelerator for flexible high-performance low-power packet processing. In2021 ACM/IEEE 48th Annual Interna- tional Symposium on Computer Architecture (ISCA). IEEE, 958–971

work page
[10]

Mihai Dobrescu, Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, and Sylvia Ratnasamy. 2009. RouteBricks: Exploiting parallelism to scale software routers. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 15–28

work page 2009
[11]

Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, et al . 2024. Boost- ing large-scale parallel training efficiency with c4: A communication- driven approach.arXiv preprint arXiv:2406.04594(2024)

work page arXiv 2024
[12]

Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang, Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, et al

work page
[13]

ACCL: Architecting Highly Scalable Distributed Training Sys- tems with Highly Efficient Collective Communication Library.IEEE Micro41, 5 (2021), 85–92

work page 2021
[14]

Shichen Dong, Zhixiong Niu, Mingchao Zhang, Zhiying Xu, Chuntao Hu, Pengzhi Zhu, Qingchun Song, Lei Qu, Peng Cheng, Cam-Tu Nguyen, et al . 2025. Mina: Fine-Grained In-network Aggregation Resource Scheduling for Machine Learning Service. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

work page 2025
[15]

Cheng Tien Ee, Rodrigo Fonseca, Sukun Kim, Daekyeong Moon, Ar- salan Tavakoli, David E Culler, Scott Shenker, and Ion Stoica. 2006. A Modular Network Layer for Sensornets.. InOSDI, Vol. 6. 249–262

work page 2006
[16]

Jin Fang, Gongming Zhao, Hongli Xu, Changbo Wu, and Zhuolong Yu. 2023. GRID: Gradient routing with in-network aggregation for distributed training.IEEE/ACM Transactions on Networking31, 5 (2023), 2267–2280

work page 2023
[17]

Fagg, George Bosilca, Thara Angskun, Jack J

Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. InPro- ceedings of the 11th...

work page 2004
[18]

Massimo Gallo and Rafael Laufer. 2018. ClickNF: a Modular Stack for Custom Network Functions. In2018 USENIX Annual Technical Conference (USENIX ATC 18). 745–757

work page 2018
[19]

Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, et al . 2024. Rdma over eth- ernet for distributed training at meta scale. InProceedings of the ACM SIGCOMM 2024 Conference. 57–70

work page 2024
[20]

Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, et al. 2021. When cloud storage meets RDMA. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 519–533

work page 2021
[21]

Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network ag- gregation for shared machine learning clusters.Proceedings of Machine Learning and Systems3 (2021), 829–844

work page 2021
[22]

Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchu- bievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggre- gation protocol (SHArP): A hardware architecture for efficient data reduction. In2016 First International Workshop on Communication Op- timizations in HPC (C...

work page 2016
[23]

Richard L Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, et al. 2020. Scalable hierarchical aggregation and reduction protocol (sharp) tm streaming-aggregation hardware design and eval- uation. InInternational Conference on High Performance Computing. Springer, 41–59

work page 2020
[24]

Yongchao He, Wenfei Wu, Yanfang Le, Ming Liu, and ChonLam Lao

work page
[25]

InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

A generic service to provide in-network aggregation for key- value streams. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 33–47

work page
[26]

Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, and Alex Aiken. 2024. HiCCL: A Hierarchical Collective Communication Library.arXiv preprint arXiv:2408.05962(2024)

work page arXiv 2024
[27]

Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Giro- lamo, Shigang Li, Marco Heddes, Deepak Goel, Miguel Castro, and Steve Scott. 2024. Hammingmesh: A network topology for large-scale deep learning.Commun. ACM67, 12 (2024), 97–105

work page 2024
[28]

Torsten Hoefler, Mikhail Khalilov, Josiah Clark, Surendra Anubolu, Mohan Kalkunte, Karen Schramm, Eric Spada, Duncan Roweth, Keith Underwood, Adrian Caulfield, et al. 2025. In-Network Collective Op- erations: Game Changer or Challenge for AI Workloads?Computer 59, 1 (2025), 24–33

work page 2025
[29]

Torsten Hoefler, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Im- plementation and performance analysis of non-blocking collective operations for MPI. InProceedings of the 2007 ACM/IEEE conference on Supercomputing. 1–10

work page 2007
[30]

Torsten Hoefler and Dmitry Moor. 2014. Energy, memory, and run- time tradeoffs for implementing collective communication operations. Supercomputing frontiers and innovations1, 2 (2014), 58–75

work page 2014
[31]

Chengyuan Huang, Yixiao Gao, Wei Chen, Duoxing Li, Yibo Xiao, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, et al

work page
[32]

In2023 IEEE 31st 15 Y

MC-RDMA: Improving Replication Performance of RDMA-based Distributed Systems with Reliable Multicast Support. In2023 IEEE 31st 15 Y. Yuan, J. Nie, T. Bai, R. Zhou, S. Cao, X. Fan, et al. International Conference on Network Protocols (ICNP). IEEE, 1–11

work page
[33]

Guyue Huang, Hao Li, Le Qin, Jiayi Huang, Yangwook Kang, Yufei Ding, and Yuan Xie. 2025. TRACI: Network Acceleration of Input- Dynamic Communication for Large-Scale Deep Learning Recommen- dation Model. InProceedings of the 52nd Annual International Sympo- sium on Computer Architecture. 1880–1893

work page 2025
[34]

InfiniBand Trade Association. [n. d.].Supplement to InfiniBand™Archi- tecture Specification Volume 1 Release 1.2.1: Annex A17: RoCEv2. Tech- nical Specification Supplement Release 1.2.1, Annex A17. InfiniBand Trade Association. Proprietary document; available via InfiniBand Trade Association membership

work page
[35]

Intel. 2024. Intel Tofino 2. (2024). https://www.intel.com/content/ww w/us/en/products/details/network-io/intelligent-fabric-processors/ tofino-2.html

work page 2024
[36]

Intel. 2024. oneAPI Collective Communications Library (oneCCL). (2024). https://github.com/oneapi-src/oneCCL

work page 2024
[37]

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture. 1–14

work page 2023
[38]

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, and Torsten Hoefler. 2024. Network-offloaded bandwidth-optimal broadcast and Allgather for distributed AI. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–17

work page 2024
[39]

Heehoon Kim, Junyeol Ryu, and Jaejin Lee. 2024. TCCL: Discovering Better Communication Paths for PCIe GPU Clusters. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 999–1015

work page 2024
[40]

Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multi- processor collectives. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 996–1009

work page 2020
[41]

Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M Frans Kaashoek. 2000. The Click modular router.ACM Transactions on Computer Systems (TOCS)18, 3 (2000), 263–297

work page 2000
[42]

2002.Specifying systems: The TLA+ language and tools for hardware and software engineers

Leslie Lamport. 2002.Specifying systems: The TLA+ language and tools for hardware and software engineers. Addison-Wesley

work page 2002
[43]

ChonLam Lao, Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, Jiangfei Duan, Minlan Yu, Aditya Akella, Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, Ennan Zhai, Dennis Cai, Zhengping Qian, and Jingren Zhou. 2026. Continuum: An Interruption-Resilient Runtime for ML Training. InOSDI

work page 2026
[44]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network aggregation for multi-tenant learning. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 741–761

work page 2021
[45]

Wenxue Li, Xiangzhou Liu, Yuxuan Li, Yilun Jin, Han Tian, Zhizhen Zhong, Guyue Liu, Ying Zhang, and Kai Chen. 2024. Understanding communication characteristics of distributed training. InProceedings of the 8th Asia-Pacific Workshop on Networking. 1–8

work page 2024
[46]

Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zihao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, et al. 2025. Revisiting RDMA Reliability for Lossy Fabrics. In Proceedings of the ACM SIGCOMM 2025 Conference. 85–98

work page 2025
[47]

Wenxue Li, Junyi Zhang, Yufei Liu, Gaoxiong Zeng, Zilong Wang, Chaoliang Zeng, Pengpeng Zhou, Qiaoling Wang, and Kai Chen. 2024. Cepheus: accelerating datacenter applications with high-performance roce-capable multicast. In2024 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 908–921

work page 2024
[48]

Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. InProceedings of the 46th International Symposium on Computer Architecture. 279–291

work page 2019
[49]

Zhaoyi Li, Jiawei Huang, Yijun Li, Aikun Xu, Shengwen Zhou, Jingling Liu, and Jianxin Wang. 2023. A2TP: Aggregator-aware in-network aggregation for multi-tenant learning. InProceedings of the Eighteenth European Conference on Computer Systems. 639–653

work page 2023
[50]

Zhaoyi Li, Jiawei Huang, Tao Zhang, Shengwen Zhou, Qile Wang, Yijun Li, Jingling Liu, Wanchun Jiang, and Jianxin Wang. 2023. PA-ATP: Progress-Aware Transmission Protocol for In-Network Aggregation. In2023 IEEE 31st International Conference on Network Protocols (ICNP). IEEE, 1–11

work page 2023
[51]

Linux man-pages project. [n. d.].rxe(7): Software RDMA over Ethernet (RoCE) driver. ([n. d.]). https://man7.org/linux/man-pages/man7/rxe. 7.html Documents Linux kernel modulerdma_rxe(Soft-RoCE/RXE)

work page
[52]

Linux RDMA Community. 2024. libibverbs: Userspace InfiniBand Verbs Library. https://github.com/linux-rdma/rdma-core/tree/maste r/libibverbs. (2024). Part of rdma-core; provides the ibv_* API for RDMA device management, QP/CQ/MR operations, and data transfer

work page 2024
[53]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

work page
[54]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Shuo Liu, Qiaoling Wang, Junyi Zhang, Wenfei Wu, Qinliang Lin, Yao Liu, Meng Xu, Marco Canini, Ray CC Cheung, and Jianfei He. 2023. In-network aggregation with transport transparency for distributed training. InProceedings of the 28th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 3. 376–391

work page 2023
[56]

Qingkai Meng, Hao Zheng, Zhenhui Zhang, ChonLam Lao, Chengyuan Huang, Baojia Li, Ziyuan Zhu, Hao Lu, Weizhen Dang, Zitong Lin, et al

work page
[57]

InProceedings of the ACM SIGCOMM 2025 Conference

Astral: A datacenter infrastructure for large language model training at scale. InProceedings of the ACM SIGCOMM 2025 Conference. 609–625

work page 2025
[58]

Zili Meng, Jun Bi, Haiping Wang, Chen Sun, and Hongxin Hu. 2019. MicroNF: An efficient framework for enabling modularized service chains in NFV.IEEE Journal on Selected Areas in Communications37, 8 (2019), 1851–1865

work page 2019
[59]

Microsoft. 2023. MSCCL. (2023). https://github.com/microsoft/msccl

work page 2023
[60]

NVIDIA. 2024. NCCL. (2024). https://github.com/NVIDIA/nccl

work page 2024
[61]

NVIDIA Corporation. [n. d.].NVIDIA NVLink High-Speed Interconnect: Application Performance. Whitepaper. NVIDIA Corporation

work page
[62]

OMNeT++ Community. 2024. OMNeT++ Discrete Event Simulator. https://omnetpp.org. (2024). Version 6.2

work page 2024
[63]

OpenInfra Foundation. [n. d.]. OpenStack: Open source cloud comput- ing infrastructure. ([n. d.]). https://www.openstack.org/

work page
[64]

Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia Rat- nasamy, and Scott Shenker. 2016. NetBricks: Taking the V out of NFV. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 203–216

work page 2016
[65]

Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Jonathan Stringer, Pravin Shelar, Keith Amidon, and Martin Casado

Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan J. Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Jonathan Stringer, Pravin Shelar, Keith Amidon, and Martin Casado. 2015. The Design and Imple- mentation of Open vSwitch. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15). USENIX Association, 117–130. https://www....

work page 2015
[66]

Chenchen Qi, Wenfei Wu, Yongcan Wang, Keqiang He, Yu-Hsiang Kao, Zongying He, Chen-Yu Yen, Zhuo Jiang, Feng Luo, Surendra Anubolu, et al. 2025. SGLB: Scalable and Robust Global Load Balancing in Commodity AI Clusters. InProceedings of the ACM SIGCOMM 2025 16 EPIC Conference. 626–644

work page 2025
[67]

Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, et al. 2024. Alibaba hpn: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference. 691–706

work page 2024
[68]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

work page
[69]

InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Zero: Memory optimizations toward training trillion param- eter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

work page
[70]

Ori Rottenstreich and Jose Yallouz. 2024. Edge-disjoint tree allocation for multi-tenant cloud security in datacenter topologies.IEEE/ACM Transactions on Networking32, 4 (2024), 2858–2874

work page 2024
[71]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtárik. 2021. Scaling distributed machine learning with In-Network aggregation. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 785–808

work page 2021
[72]

Raz Segal, Chen Avin, and Gabriel Scalosub. 2021. SOAR: Minimiz- ing network utilization with bounded in-network computing. InPro- ceedings of the 17th International Conference on emerging Networking EXperiments and Technologies. 16–29

work page 2021
[73]

Raz Segal, Chen Avin, and Gabriel Scalosub. 2022. Constrained in- network computing with low congestion in datacenter networks. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 1639–1648

work page 2022
[74]

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. TACCL: Guiding Collective Algorithm Synthe- sis using Communication Sketches. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 593–612

work page 2023
[75]

The Tcpdump Group. [n. d.]. libpcap: Portable packet capture library. ([n. d.]). https://www.tcpdump.org/

work page
[76]

UAlink Consortium. 2024. UAlink Consortium. Online Consortium Website. (2024). https://ualinkconsortium.org/

work page 2024
[77]

Ultra Ethernet Consortium. 2024. Ultra Ethernet Specification Update. Ultra Ethernet Consortium Blog. (29 August 2024). https://ultraether net.org/ultra-ethernet-specification-update/ Accessed: 2026-02-06

work page 2024
[78]

Xinchen Wan, Luyang Li, Han Tian, Xudong Liao, Xinyang Huang, Chaoliang Zeng, Zilong Wang, Xinyu Yang, Ke Cheng, Qingsong Ning, et al. 2025. A Generic and Efficient Communication Framework for Message-level In-Network Computing. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

work page 2025
[79]

Ruiqi Wang, Dezun Dong, Fei Lei, Junchao Ma, Ke Wu, and Kai Lu

work page
[80]

In Proceedings of the 37th International Conference on Supercomputing

Roar: A router microarchitecture for in-network allreduce. In Proceedings of the 37th International Conference on Supercomputing. 423–436

work page

Showing first 80 references.

[1] [1]

AMD. 2024. RCCL. (2024). https://github.com/ROCm/rccl

work page 2024

[2] [2]

Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, et al . 2023. Empowering azure storage with RDMA. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 49–67

work page 2023

[3] [3]

Marcel Blöcher, Lin Wang, Patrick Eugster, and Max Schmidt. 2021. Switches for HIRE: Resource scheduling for data center in-network computing. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 268–285

work page 2021

[4] [4]

Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors.ACM SIGCOMM Computer Communication Review44, 3 (2014), 87–95

work page 2014

[5] [5]

Broadcom Inc. 2026. StrataXGS Tomahawk 5 Series: 51.2 Tb/s Ethernet Switch ASIC Family. https://www.broadcom.com/products/ethernet -connectivity/switching/strataxgs/bcm78920-series. (2026). Accessed: 2026-02-06

work page 2026

[6] [6]

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, and Ennan Zhai. 2024. Crux: Gpu-efficient communication scheduling for deep learning training. InProceedings of the ACM SIGCOMM 2024 Conference. 1–15

work page 2024

[7] [7]

Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-network allreduce. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16

work page 2021

[8] [8]

Salvatore Di Girolamo, Andreas Kurth, Alexandru Calotoiu, Thomas Benz, Timo Schneider, Jakub Beránek, Luca Benini, and Torsten Hoefler

work page

[9] [9]

In2021 ACM/IEEE 48th Annual Interna- tional Symposium on Computer Architecture (ISCA)

A RISC-V in-network accelerator for flexible high-performance low-power packet processing. In2021 ACM/IEEE 48th Annual Interna- tional Symposium on Computer Architecture (ISCA). IEEE, 958–971

work page

[10] [10]

Mihai Dobrescu, Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, and Sylvia Ratnasamy. 2009. RouteBricks: Exploiting parallelism to scale software routers. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 15–28

work page 2009

[11] [11]

Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, et al . 2024. Boost- ing large-scale parallel training efficiency with c4: A communication- driven approach.arXiv preprint arXiv:2406.04594(2024)

work page arXiv 2024

[12] [12]

Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang, Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, et al

work page

[13] [13]

ACCL: Architecting Highly Scalable Distributed Training Sys- tems with Highly Efficient Collective Communication Library.IEEE Micro41, 5 (2021), 85–92

work page 2021

[14] [14]

Shichen Dong, Zhixiong Niu, Mingchao Zhang, Zhiying Xu, Chuntao Hu, Pengzhi Zhu, Qingchun Song, Lei Qu, Peng Cheng, Cam-Tu Nguyen, et al . 2025. Mina: Fine-Grained In-network Aggregation Resource Scheduling for Machine Learning Service. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

work page 2025

[15] [15]

Cheng Tien Ee, Rodrigo Fonseca, Sukun Kim, Daekyeong Moon, Ar- salan Tavakoli, David E Culler, Scott Shenker, and Ion Stoica. 2006. A Modular Network Layer for Sensornets.. InOSDI, Vol. 6. 249–262

work page 2006

[16] [16]

Jin Fang, Gongming Zhao, Hongli Xu, Changbo Wu, and Zhuolong Yu. 2023. GRID: Gradient routing with in-network aggregation for distributed training.IEEE/ACM Transactions on Networking31, 5 (2023), 2267–2280

work page 2023

[17] [17]

Fagg, George Bosilca, Thara Angskun, Jack J

Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. InPro- ceedings of the 11th...

work page 2004

[18] [18]

Massimo Gallo and Rafael Laufer. 2018. ClickNF: a Modular Stack for Custom Network Functions. In2018 USENIX Annual Technical Conference (USENIX ATC 18). 745–757

work page 2018

[19] [19]

Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, et al . 2024. Rdma over eth- ernet for distributed training at meta scale. InProceedings of the ACM SIGCOMM 2024 Conference. 57–70

work page 2024

[20] [20]

Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, et al. 2021. When cloud storage meets RDMA. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 519–533

work page 2021

[21] [21]

Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network ag- gregation for shared machine learning clusters.Proceedings of Machine Learning and Systems3 (2021), 829–844

work page 2021

[22] [22]

Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchu- bievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggre- gation protocol (SHArP): A hardware architecture for efficient data reduction. In2016 First International Workshop on Communication Op- timizations in HPC (C...

work page 2016

[23] [23]

Richard L Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, et al. 2020. Scalable hierarchical aggregation and reduction protocol (sharp) tm streaming-aggregation hardware design and eval- uation. InInternational Conference on High Performance Computing. Springer, 41–59

work page 2020

[24] [24]

Yongchao He, Wenfei Wu, Yanfang Le, Ming Liu, and ChonLam Lao

work page

[25] [25]

InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

A generic service to provide in-network aggregation for key- value streams. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 33–47

work page

[26] [26]

Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, and Alex Aiken. 2024. HiCCL: A Hierarchical Collective Communication Library.arXiv preprint arXiv:2408.05962(2024)

work page arXiv 2024

[27] [27]

Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Giro- lamo, Shigang Li, Marco Heddes, Deepak Goel, Miguel Castro, and Steve Scott. 2024. Hammingmesh: A network topology for large-scale deep learning.Commun. ACM67, 12 (2024), 97–105

work page 2024

[28] [28]

Torsten Hoefler, Mikhail Khalilov, Josiah Clark, Surendra Anubolu, Mohan Kalkunte, Karen Schramm, Eric Spada, Duncan Roweth, Keith Underwood, Adrian Caulfield, et al. 2025. In-Network Collective Op- erations: Game Changer or Challenge for AI Workloads?Computer 59, 1 (2025), 24–33

work page 2025

[29] [29]

Torsten Hoefler, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Im- plementation and performance analysis of non-blocking collective operations for MPI. InProceedings of the 2007 ACM/IEEE conference on Supercomputing. 1–10

work page 2007

[30] [30]

Torsten Hoefler and Dmitry Moor. 2014. Energy, memory, and run- time tradeoffs for implementing collective communication operations. Supercomputing frontiers and innovations1, 2 (2014), 58–75

work page 2014

[31] [31]

Chengyuan Huang, Yixiao Gao, Wei Chen, Duoxing Li, Yibo Xiao, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, et al

work page

[32] [32]

In2023 IEEE 31st 15 Y

MC-RDMA: Improving Replication Performance of RDMA-based Distributed Systems with Reliable Multicast Support. In2023 IEEE 31st 15 Y. Yuan, J. Nie, T. Bai, R. Zhou, S. Cao, X. Fan, et al. International Conference on Network Protocols (ICNP). IEEE, 1–11

work page

[33] [33]

Guyue Huang, Hao Li, Le Qin, Jiayi Huang, Yangwook Kang, Yufei Ding, and Yuan Xie. 2025. TRACI: Network Acceleration of Input- Dynamic Communication for Large-Scale Deep Learning Recommen- dation Model. InProceedings of the 52nd Annual International Sympo- sium on Computer Architecture. 1880–1893

work page 2025

[34] [34]

InfiniBand Trade Association. [n. d.].Supplement to InfiniBand™Archi- tecture Specification Volume 1 Release 1.2.1: Annex A17: RoCEv2. Tech- nical Specification Supplement Release 1.2.1, Annex A17. InfiniBand Trade Association. Proprietary document; available via InfiniBand Trade Association membership

work page

[35] [35]

Intel. 2024. Intel Tofino 2. (2024). https://www.intel.com/content/ww w/us/en/products/details/network-io/intelligent-fabric-processors/ tofino-2.html

work page 2024

[36] [36]

Intel. 2024. oneAPI Collective Communications Library (oneCCL). (2024). https://github.com/oneapi-src/oneCCL

work page 2024

[37] [37]

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture. 1–14

work page 2023

[38] [38]

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, and Torsten Hoefler. 2024. Network-offloaded bandwidth-optimal broadcast and Allgather for distributed AI. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–17

work page 2024

[39] [39]

Heehoon Kim, Junyeol Ryu, and Jaejin Lee. 2024. TCCL: Discovering Better Communication Paths for PCIe GPU Clusters. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 999–1015

work page 2024

[40] [40]

Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multi- processor collectives. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 996–1009

work page 2020

[41] [41]

Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M Frans Kaashoek. 2000. The Click modular router.ACM Transactions on Computer Systems (TOCS)18, 3 (2000), 263–297

work page 2000

[42] [42]

2002.Specifying systems: The TLA+ language and tools for hardware and software engineers

Leslie Lamport. 2002.Specifying systems: The TLA+ language and tools for hardware and software engineers. Addison-Wesley

work page 2002

[43] [43]

ChonLam Lao, Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, Jiangfei Duan, Minlan Yu, Aditya Akella, Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, Ennan Zhai, Dennis Cai, Zhengping Qian, and Jingren Zhou. 2026. Continuum: An Interruption-Resilient Runtime for ML Training. InOSDI

work page 2026

[44] [44]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network aggregation for multi-tenant learning. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 741–761

work page 2021

[45] [45]

Wenxue Li, Xiangzhou Liu, Yuxuan Li, Yilun Jin, Han Tian, Zhizhen Zhong, Guyue Liu, Ying Zhang, and Kai Chen. 2024. Understanding communication characteristics of distributed training. InProceedings of the 8th Asia-Pacific Workshop on Networking. 1–8

work page 2024

[46] [46]

Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zihao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, et al. 2025. Revisiting RDMA Reliability for Lossy Fabrics. In Proceedings of the ACM SIGCOMM 2025 Conference. 85–98

work page 2025

[47] [47]

Wenxue Li, Junyi Zhang, Yufei Liu, Gaoxiong Zeng, Zilong Wang, Chaoliang Zeng, Pengpeng Zhou, Qiaoling Wang, and Kai Chen. 2024. Cepheus: accelerating datacenter applications with high-performance roce-capable multicast. In2024 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 908–921

work page 2024

[48] [48]

Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. InProceedings of the 46th International Symposium on Computer Architecture. 279–291

work page 2019

[49] [49]

Zhaoyi Li, Jiawei Huang, Yijun Li, Aikun Xu, Shengwen Zhou, Jingling Liu, and Jianxin Wang. 2023. A2TP: Aggregator-aware in-network aggregation for multi-tenant learning. InProceedings of the Eighteenth European Conference on Computer Systems. 639–653

work page 2023

[50] [50]

Zhaoyi Li, Jiawei Huang, Tao Zhang, Shengwen Zhou, Qile Wang, Yijun Li, Jingling Liu, Wanchun Jiang, and Jianxin Wang. 2023. PA-ATP: Progress-Aware Transmission Protocol for In-Network Aggregation. In2023 IEEE 31st International Conference on Network Protocols (ICNP). IEEE, 1–11

work page 2023

[51] [51]

Linux man-pages project. [n. d.].rxe(7): Software RDMA over Ethernet (RoCE) driver. ([n. d.]). https://man7.org/linux/man-pages/man7/rxe. 7.html Documents Linux kernel modulerdma_rxe(Soft-RoCE/RXE)

work page

[52] [52]

Linux RDMA Community. 2024. libibverbs: Userspace InfiniBand Verbs Library. https://github.com/linux-rdma/rdma-core/tree/maste r/libibverbs. (2024). Part of rdma-core; provides the ibv_* API for RDMA device management, QP/CQ/MR operations, and data transfer

work page 2024

[53] [53]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

work page

[54] [54]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Shuo Liu, Qiaoling Wang, Junyi Zhang, Wenfei Wu, Qinliang Lin, Yao Liu, Meng Xu, Marco Canini, Ray CC Cheung, and Jianfei He. 2023. In-network aggregation with transport transparency for distributed training. InProceedings of the 28th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 3. 376–391

work page 2023

[56] [56]

Qingkai Meng, Hao Zheng, Zhenhui Zhang, ChonLam Lao, Chengyuan Huang, Baojia Li, Ziyuan Zhu, Hao Lu, Weizhen Dang, Zitong Lin, et al

work page

[57] [57]

InProceedings of the ACM SIGCOMM 2025 Conference

Astral: A datacenter infrastructure for large language model training at scale. InProceedings of the ACM SIGCOMM 2025 Conference. 609–625

work page 2025

[58] [58]

Zili Meng, Jun Bi, Haiping Wang, Chen Sun, and Hongxin Hu. 2019. MicroNF: An efficient framework for enabling modularized service chains in NFV.IEEE Journal on Selected Areas in Communications37, 8 (2019), 1851–1865

work page 2019

[59] [59]

Microsoft. 2023. MSCCL. (2023). https://github.com/microsoft/msccl

work page 2023

[60] [60]

NVIDIA. 2024. NCCL. (2024). https://github.com/NVIDIA/nccl

work page 2024

[61] [61]

NVIDIA Corporation. [n. d.].NVIDIA NVLink High-Speed Interconnect: Application Performance. Whitepaper. NVIDIA Corporation

work page

[62] [62]

OMNeT++ Community. 2024. OMNeT++ Discrete Event Simulator. https://omnetpp.org. (2024). Version 6.2

work page 2024

[63] [63]

OpenInfra Foundation. [n. d.]. OpenStack: Open source cloud comput- ing infrastructure. ([n. d.]). https://www.openstack.org/

work page

[64] [64]

Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia Rat- nasamy, and Scott Shenker. 2016. NetBricks: Taking the V out of NFV. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 203–216

work page 2016

[65] [65]

Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Jonathan Stringer, Pravin Shelar, Keith Amidon, and Martin Casado

Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan J. Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Jonathan Stringer, Pravin Shelar, Keith Amidon, and Martin Casado. 2015. The Design and Imple- mentation of Open vSwitch. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15). USENIX Association, 117–130. https://www....

work page 2015

[66] [66]

Chenchen Qi, Wenfei Wu, Yongcan Wang, Keqiang He, Yu-Hsiang Kao, Zongying He, Chen-Yu Yen, Zhuo Jiang, Feng Luo, Surendra Anubolu, et al. 2025. SGLB: Scalable and Robust Global Load Balancing in Commodity AI Clusters. InProceedings of the ACM SIGCOMM 2025 16 EPIC Conference. 626–644

work page 2025

[67] [67]

Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, et al. 2024. Alibaba hpn: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference. 691–706

work page 2024

[68] [68]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

work page

[69] [69]

InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Zero: Memory optimizations toward training trillion param- eter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

work page

[70] [70]

Ori Rottenstreich and Jose Yallouz. 2024. Edge-disjoint tree allocation for multi-tenant cloud security in datacenter topologies.IEEE/ACM Transactions on Networking32, 4 (2024), 2858–2874

work page 2024

[71] [71]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtárik. 2021. Scaling distributed machine learning with In-Network aggregation. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 785–808

work page 2021

[72] [72]

Raz Segal, Chen Avin, and Gabriel Scalosub. 2021. SOAR: Minimiz- ing network utilization with bounded in-network computing. InPro- ceedings of the 17th International Conference on emerging Networking EXperiments and Technologies. 16–29

work page 2021

[73] [73]

Raz Segal, Chen Avin, and Gabriel Scalosub. 2022. Constrained in- network computing with low congestion in datacenter networks. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 1639–1648

work page 2022

[74] [74]

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. TACCL: Guiding Collective Algorithm Synthe- sis using Communication Sketches. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 593–612

work page 2023

[75] [75]

The Tcpdump Group. [n. d.]. libpcap: Portable packet capture library. ([n. d.]). https://www.tcpdump.org/

work page

[76] [76]

UAlink Consortium. 2024. UAlink Consortium. Online Consortium Website. (2024). https://ualinkconsortium.org/

work page 2024

[77] [77]

Ultra Ethernet Consortium. 2024. Ultra Ethernet Specification Update. Ultra Ethernet Consortium Blog. (29 August 2024). https://ultraether net.org/ultra-ethernet-specification-update/ Accessed: 2026-02-06

work page 2024

[78] [78]

Xinchen Wan, Luyang Li, Han Tian, Xudong Liao, Xinyang Huang, Chaoliang Zeng, Zilong Wang, Xinyu Yang, Ke Cheng, Qingsong Ning, et al. 2025. A Generic and Efficient Communication Framework for Message-level In-Network Computing. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

work page 2025

[79] [79]

Ruiqi Wang, Dezun Dong, Fei Lei, Junchao Ma, Ke Wu, and Kai Lu

work page

[80] [80]

In Proceedings of the 37th International Conference on Supercomputing

Roar: A router microarchitecture for in-network allreduce. In Proceedings of the 37th International Conference on Supercomputing. 423–436

work page