pith. sign in

arxiv: 2605.18683 · v1 · pith:XX7EZSL2new · submitted 2026-05-18 · 💻 cs.DC

EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet

Pith reviewed 2026-05-20 08:04 UTC · model grok-4.3

classification 💻 cs.DC
keywords in-network collectiveEthernet protocolabstractionpolymorphismAI accelerationnetwork hardwarecollective operationsINC
0
0 comments X

The pith

EPIC introduces a standard-Ethernet-compatible abstraction for in-network collectives that supports polymorphic realizations across different hardware capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EPIC to enable in-network collective acceleration for AI on Ethernet by creating a unified abstraction that aligns functional boundaries with participant roles. This abstraction supports polymorphic realizations that adapt to different hardware capabilities. It solves adoption issues in the open Ethernet ecosystem through a modular design for incremental hardware development, formal verification of correctness, and a versatile resource management model. Sympathetic readers would care as this could optimize AI training and inference without needing proprietary hardware.

Core claim

EPIC (Ethernet Polymorphic In-network Collective) is an INC protocol specification and reference system built on 'Unified Abstraction, Polymorphic Realization'. It introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities. The design addresses challenges with modular evolution, formal proofs of correctness for all modes, and a unified resource management model.

What carries the argument

The unified abstraction in EPIC that aligns functional boundaries with participant roles to enable polymorphic realizations for hardware of varying capabilities.

If this is right

  • Vendors can incrementally develop hardware from simple to complex implementations without losing compatibility.
  • Formal verification confirms the correctness of all proposed polymorphic modes.
  • A unified resource management model supports a wide range of in-network collective scenarios.
  • Performance gains are realized in AI training and inference workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This abstraction could be adapted to other networking standards to broaden INC adoption.
  • Real-world deployments might reveal optimizations for specific AI model types.
  • The modular approach could template other cross-layer network protocols.

Load-bearing premise

A modular design enables an evolutionary path from simple to complex implementations allowing vendors to iterate their hardware incrementally.

What would settle it

A failure in formal verification of any polymorphic mode or the absence of performance gains in Tofino testbed experiments compared to standard Ethernet collectives would falsify the claims.

Figures

Figures reproduced from arXiv: 2605.18683 by Binzhang Fu, Chao Jiang, Chenqi Zhao, Jianbo Dong, Jianglong Nie, Jiangyuan Chen, Jiaqi Sun, Jing Lin, Junjie Wang, Junkai Chen, Limin Xiao, Nengyuan Zhang, Ruizhe Zhou, Shaoke Fang, Siyuan Cao, Tianyu Bai, Wei Cheng, Weifeng Zhang, Wenfei Wu, Xiangrui Yang, Xiaohe Hu, Xiaohua Xu, Xujie Fan, Yang Li, Yang Liu, Yazhu Lan, Yitao Yuan, Yuanfeng Chen, Yuchao Zhang, Yuchen Xu, Zhan Wang.

Figure 1
Figure 1. Figure 1: EPIC’s abstraction isolation and contention-based sharing—transforming frag￾mented resources into a flexible, programmable fabric. This approach minimizes administrative overhead while maximiz￾ing resource utilization across dynamic AI scenarios (§6). 3 EPIC OVERVIEW 3.1 Abstraction We define EPIC’s abstraction and map it to Ethernet clusters. EPIC’s realization can support 6 collective primitives. INC Tre… view at source ↗
Figure 3
Figure 3. Figure 3: EPIC Architecture 3.3 Workflow 3.3.1 System and Group Lifecycle Bootup. The lifecycle begins with component initialization: CommLib daemons launch on hosts, and IncEngine is en￾abled on switches. All instances report local states to the IncManager, which constructs the global topology and man￾ages resources (§6.1). When an application calls InitGroup(), the IncManager computes and maps a logical IncTree on… view at source ↗
Figure 4
Figure 4. Figure 4: Flow Control in CommLib [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: A pitfall in Mode￾III design age the same pre-configured routing rules and polymorphic RoCE functions, ensuring high-performance communica￾tion without the bidirectional result distribution required by AllReduce. Interaction with RoCE Flow Control and Congestion Control. EPIC’s flow control means keeping inflight data volume from overwhelming the switch buffers. The three mechanisms — EPIC flow control, Ro… view at source ↗
Figure 8
Figure 8. Figure 8: Lookup Tables in Mode-III psnStart (Recv States) arrived[N] epsn Endpoint 1 (Send States) lastAcked payload buffer Endpoint A degree buffer Pipe data pkt ACK incoming endpoints outgoing endpoints ACK iff pkt.psn ∈ [psnStart, psnStart+N) ≡ min{o.lastAcked | o∈outgoing}+1 [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of Modes (outer is better). requires full message reception prior to processing, whereas Mode-II/III utilize packet-level (MTU) pipelining. Conse￾quently, Mode-II/III reduces end-to-end latency by approx￾imately (2𝐻 − 1) (𝑀 − 1)𝑈 /𝐵, offering advantages unless the message size 𝑆 is exceptionally large or the propagation delay 𝐿 dominates. Logic Complexity. Considering the module organization, M… view at source ↗
Figure 13
Figure 13. Figure 13: [Testbed, Tree-2-8] Collective Algorithm [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: [Testbed, Tree-2-8] AllReduce Algorithm Throughput finish normally. EPIC-III has more complex switch workflow, so its throughput is lower than EPIC-II. It is hard to make a fair performance comparison (and not the goal of emulation) among EPIC-I vs EPIC-II/III vs MPI, because their software stack differs (details in Appendix §J). Evolvability. The development of EPIC-II and III follows the modular design.… view at source ↗
Figure 14
Figure 14. Figure 14: [Testbed, Tree-2-8] Bandwidth Comple￾ments between ReduceScatter and AllGather receive for both ring-based collectives, causing bandwidth contention. Conversely, EPIC breaks this limit by exploit￾ing directional link independence. By orchestrating nodes to transmit for RS while receiving for AG (and vice versa), EPIC ensures non-overlapping directional flows. Consequently, EPIC achieves a theoretical thro… view at source ↗
Figure 15
Figure 15. Figure 15: [Packet Simulation, Tree-2-8] AllReduce Al [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: [Flow Simulation] Tail 15% JCT of Alibaba [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: EPIC in switch micro architecture Modern high-performance switches typically utilize a pipelined architecture comprising an Ingress Pipeline, a Traffic Man￾ager (TM), and an Egress Pipeline. The Ingress Pipeline per￾forms packet parsing and forwarding decisions, while the TM serves as the data plane core, managing on-chip shared memory for buffering and scheduling. The Egress Pipeline completes the proces… view at source ↗
Figure 18
Figure 18. Figure 18: Transmission Efficiency Switches operate in Store-and-Forward model but at differ￾ent granularities across modes. Following the definitions in [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: IncManager and policies [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: [Flow Simulation] JCT CDF of Trace 1 on 2048-GPU Fat-tree. 4,500 5,000 5,500 6,000 6,500 7,000 0 0.2 0.4 0.6 0.8 1 Job Completion Time (s) CDF (a) All Jobs Ring EDT Spatial Mux Temporal Mux 4,500 5,000 5,500 6,000 6,500 7,000 0.85 0.9 0.95 1 Job Completion Time (s) CDF (b) Tail 15% Jobs [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗
Figure 23
Figure 23. Figure 23: [Flow Simulation] 8-GPU’s JCT CDF of Trace [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: [Flow Simulation] 8-GPU’s JCT CDF of Trace [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗
read the original abstract

In-Network Collective (INC) acceleration holds immense potential for optimizing AI training and inference; however, its cross-layer nature has historically hindered investment and adoption within the open Ethernet ecosystem. To bridge this gap, we propose EPIC (Ethernet Polymorphic In-network Collective), an INC protocol specification and reference system built on the principle of "Unified Abstraction, Polymorphic Realization." EPIC introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities. We address three fundamental challenges: first, we employ a modular design that enables an evolutionary path from simple to complex implementations, allowing vendors to iterate their hardware incrementally; second, we apply formal verification methodologies to prove the correctness of all proposed polymorphic modes; and third, we develop a unified resource management model versatile enough for diverse INC scenarios. Extensive validation -- spanning model checking, packet/flow simulations, VM emulation, Tofino Testbed, and FPGA/RTL verification -- confirms EPIC's correctness, performance gain, and feasibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes EPIC (Ethernet Polymorphic In-network Collective), a protocol specification and reference system for in-network collective acceleration on Ethernet. It introduces a unified abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles and supports polymorphic realizations for different hardware capabilities. The work addresses three challenges: modular design for evolutionary hardware implementation, formal verification of correctness for all modes, and a unified resource management model. Validation spans model checking, simulations, VM emulation, Tofino testbed, and FPGA/RTL verification, claiming correctness, performance gains, and feasibility.

Significance. If the central claims hold, particularly the compatibility with unmodified Ethernet and the polymorphic approach enabling incremental vendor adoption, this could facilitate broader investment and adoption of in-network collectives in the open Ethernet ecosystem for AI training and inference, addressing a historical barrier due to cross-layer nature.

major comments (1)
  1. [Abstract] Abstract: The abstract states that EPIC is 'compatible with standard Ethernet' and offers 'polymorphic realizations tailored to varying hardware capabilities', with validation including 'Tofino Testbed, and FPGA/RTL verification'. If even the basic modes require programmable switch features (as implied by the testbeds used for all modes), this contradicts the claim of compatibility with unmodified standard Ethernet hardware and the modular evolutionary path starting from simple implementations on existing vendor silicon. This is load-bearing for the central claim and requires clarification or evidence of a non-programmable basic mode.
minor comments (1)
  1. [Abstract] Abstract: The abstract mentions 'extensive validation' but does not specify quantitative performance gains or specific metrics used to confirm feasibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for identifying a point that is central to the paper's claims. We address the major comment below and commit to revisions that improve clarity without altering the core technical contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states that EPIC is 'compatible with standard Ethernet' and offers 'polymorphic realizations tailored to varying hardware capabilities', with validation including 'Tofino Testbed, and FPGA/RTL verification'. If even the basic modes require programmable switch features (as implied by the testbeds used for all modes), this contradicts the claim of compatibility with unmodified standard Ethernet hardware and the modular evolutionary path starting from simple implementations on existing vendor silicon. This is load-bearing for the central claim and requires clarification or evidence of a non-programmable basic mode.

    Authors: We appreciate the referee highlighting this important clarification need. The EPIC abstraction is defined to be compatible with standard Ethernet by using unmodified Ethernet frame formats, standard multicast groups, and conventional switch forwarding tables for the basic mode; no programmable data-plane features are required for correct operation or role alignment in this mode. Polymorphic realizations then layer optional in-network computation on top when programmable hardware (Tofino, FPGA) is present. The listed testbeds were selected to exercise the full spectrum of modes and to provide rigorous verification of the advanced realizations, but they do not imply that the basic mode depends on programmability. We agree that the manuscript would benefit from an explicit statement of per-mode hardware requirements and a short illustrative example of the basic mode on commodity silicon. We will therefore revise the abstract, add a clarifying paragraph in Section 3, and include a table summarizing hardware prerequisites for each polymorphic variant. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: design proposal rests on independent specification and validation

full rationale

The paper presents EPIC as a new protocol specification and reference system based on the principle of unified abstraction with polymorphic realizations. Claims about compatibility with standard Ethernet, modular evolutionary path, formal verification, and resource management are introduced as design choices rather than derived from fitted parameters or prior self-referential results. Validation spans model checking, simulations, emulation, Tofino testbed, and FPGA/RTL, providing external checks. No equations, self-citations, or reductions to inputs by construction appear in the abstract or described structure. The derivation chain is self-contained as a proposed architecture with stated assumptions that do not presuppose the target outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the assumption that formal verification can cover all polymorphic modes and that the modular structure supports incremental hardware development without additional free parameters or invented entities beyond the protocol itself.

axioms (1)
  • domain assumption Formal verification methodologies can prove the correctness of all proposed polymorphic modes.
    Stated in the abstract as one of the three fundamental challenges addressed.
invented entities (1)
  • EPIC abstraction and polymorphic realizations no independent evidence
    purpose: To provide a unified, Ethernet-compatible interface for in-network collectives with hardware-specific implementations.
    Newly defined protocol elements introduced to solve the cross-layer adoption problem.

pith-pipeline@v0.9.0 · 5825 in / 1135 out tokens · 32653 ms · 2026-05-20T08:04:23.335894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 2 internal anchors

  1. [1]

    AMD. 2024. RCCL. (2024). https://github.com/ROCm/rccl

  2. [2]

    Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, et al . 2023. Empowering azure storage with RDMA. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 49–67

  3. [3]

    Marcel Blöcher, Lin Wang, Patrick Eugster, and Max Schmidt. 2021. Switches for HIRE: Resource scheduling for data center in-network computing. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 268–285

  4. [4]

    Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors.ACM SIGCOMM Computer Communication Review44, 3 (2014), 87–95

  5. [5]

    Broadcom Inc. 2026. StrataXGS Tomahawk 5 Series: 51.2 Tb/s Ethernet Switch ASIC Family. https://www.broadcom.com/products/ethernet -connectivity/switching/strataxgs/bcm78920-series. (2026). Accessed: 2026-02-06

  6. [6]

    Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, and Ennan Zhai. 2024. Crux: Gpu-efficient communication scheduling for deep learning training. InProceedings of the ACM SIGCOMM 2024 Conference. 1–15

  7. [7]

    Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-network allreduce. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16

  8. [8]

    Salvatore Di Girolamo, Andreas Kurth, Alexandru Calotoiu, Thomas Benz, Timo Schneider, Jakub Beránek, Luca Benini, and Torsten Hoefler

  9. [9]

    In2021 ACM/IEEE 48th Annual Interna- tional Symposium on Computer Architecture (ISCA)

    A RISC-V in-network accelerator for flexible high-performance low-power packet processing. In2021 ACM/IEEE 48th Annual Interna- tional Symposium on Computer Architecture (ISCA). IEEE, 958–971

  10. [10]

    Mihai Dobrescu, Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, and Sylvia Ratnasamy. 2009. RouteBricks: Exploiting parallelism to scale software routers. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 15–28

  11. [11]

    Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, et al . 2024. Boost- ing large-scale parallel training efficiency with c4: A communication- driven approach.arXiv preprint arXiv:2406.04594(2024)

  12. [12]

    Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang, Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, et al

  13. [13]

    ACCL: Architecting Highly Scalable Distributed Training Sys- tems with Highly Efficient Collective Communication Library.IEEE Micro41, 5 (2021), 85–92

  14. [14]

    Shichen Dong, Zhixiong Niu, Mingchao Zhang, Zhiying Xu, Chuntao Hu, Pengzhi Zhu, Qingchun Song, Lei Qu, Peng Cheng, Cam-Tu Nguyen, et al . 2025. Mina: Fine-Grained In-network Aggregation Resource Scheduling for Machine Learning Service. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

  15. [15]

    Cheng Tien Ee, Rodrigo Fonseca, Sukun Kim, Daekyeong Moon, Ar- salan Tavakoli, David E Culler, Scott Shenker, and Ion Stoica. 2006. A Modular Network Layer for Sensornets.. InOSDI, Vol. 6. 249–262

  16. [16]

    Jin Fang, Gongming Zhao, Hongli Xu, Changbo Wu, and Zhuolong Yu. 2023. GRID: Gradient routing with in-network aggregation for distributed training.IEEE/ACM Transactions on Networking31, 5 (2023), 2267–2280

  17. [17]

    Fagg, George Bosilca, Thara Angskun, Jack J

    Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. InPro- ceedings of the 11th...

  18. [18]

    Massimo Gallo and Rafael Laufer. 2018. ClickNF: a Modular Stack for Custom Network Functions. In2018 USENIX Annual Technical Conference (USENIX ATC 18). 745–757

  19. [19]

    Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, et al . 2024. Rdma over eth- ernet for distributed training at meta scale. InProceedings of the ACM SIGCOMM 2024 Conference. 57–70

  20. [20]

    Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, et al. 2021. When cloud storage meets RDMA. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 519–533

  21. [21]

    Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network ag- gregation for shared machine learning clusters.Proceedings of Machine Learning and Systems3 (2021), 829–844

  22. [22]

    Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchu- bievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggre- gation protocol (SHArP): A hardware architecture for efficient data reduction. In2016 First International Workshop on Communication Op- timizations in HPC (C...

  23. [23]

    Richard L Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, et al. 2020. Scalable hierarchical aggregation and reduction protocol (sharp) tm streaming-aggregation hardware design and eval- uation. InInternational Conference on High Performance Computing. Springer, 41–59

  24. [24]

    Yongchao He, Wenfei Wu, Yanfang Le, Ming Liu, and ChonLam Lao

  25. [25]

    InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

    A generic service to provide in-network aggregation for key- value streams. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 33–47

  26. [26]

    Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, and Alex Aiken. 2024. HiCCL: A Hierarchical Collective Communication Library.arXiv preprint arXiv:2408.05962(2024)

  27. [27]

    Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Giro- lamo, Shigang Li, Marco Heddes, Deepak Goel, Miguel Castro, and Steve Scott. 2024. Hammingmesh: A network topology for large-scale deep learning.Commun. ACM67, 12 (2024), 97–105

  28. [28]

    Torsten Hoefler, Mikhail Khalilov, Josiah Clark, Surendra Anubolu, Mohan Kalkunte, Karen Schramm, Eric Spada, Duncan Roweth, Keith Underwood, Adrian Caulfield, et al. 2025. In-Network Collective Op- erations: Game Changer or Challenge for AI Workloads?Computer 59, 1 (2025), 24–33

  29. [29]

    Torsten Hoefler, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Im- plementation and performance analysis of non-blocking collective operations for MPI. InProceedings of the 2007 ACM/IEEE conference on Supercomputing. 1–10

  30. [30]

    Torsten Hoefler and Dmitry Moor. 2014. Energy, memory, and run- time tradeoffs for implementing collective communication operations. Supercomputing frontiers and innovations1, 2 (2014), 58–75

  31. [31]

    Chengyuan Huang, Yixiao Gao, Wei Chen, Duoxing Li, Yibo Xiao, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, et al

  32. [32]

    In2023 IEEE 31st 15 Y

    MC-RDMA: Improving Replication Performance of RDMA-based Distributed Systems with Reliable Multicast Support. In2023 IEEE 31st 15 Y. Yuan, J. Nie, T. Bai, R. Zhou, S. Cao, X. Fan, et al. International Conference on Network Protocols (ICNP). IEEE, 1–11

  33. [33]

    Guyue Huang, Hao Li, Le Qin, Jiayi Huang, Yangwook Kang, Yufei Ding, and Yuan Xie. 2025. TRACI: Network Acceleration of Input- Dynamic Communication for Large-Scale Deep Learning Recommen- dation Model. InProceedings of the 52nd Annual International Sympo- sium on Computer Architecture. 1880–1893

  34. [34]

    InfiniBand Trade Association. [n. d.].Supplement to InfiniBand™Archi- tecture Specification Volume 1 Release 1.2.1: Annex A17: RoCEv2. Tech- nical Specification Supplement Release 1.2.1, Annex A17. InfiniBand Trade Association. Proprietary document; available via InfiniBand Trade Association membership

  35. [35]

    Intel. 2024. Intel Tofino 2. (2024). https://www.intel.com/content/ww w/us/en/products/details/network-io/intelligent-fabric-processors/ tofino-2.html

  36. [36]

    Intel. 2024. oneAPI Collective Communications Library (oneCCL). (2024). https://github.com/oneapi-src/oneCCL

  37. [37]

    Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture. 1–14

  38. [38]

    Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, and Torsten Hoefler. 2024. Network-offloaded bandwidth-optimal broadcast and Allgather for distributed AI. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–17

  39. [39]

    Heehoon Kim, Junyeol Ryu, and Jaejin Lee. 2024. TCCL: Discovering Better Communication Paths for PCIe GPU Clusters. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 999–1015

  40. [40]

    Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multi- processor collectives. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 996–1009

  41. [41]

    Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M Frans Kaashoek. 2000. The Click modular router.ACM Transactions on Computer Systems (TOCS)18, 3 (2000), 263–297

  42. [42]

    2002.Specifying systems: The TLA+ language and tools for hardware and software engineers

    Leslie Lamport. 2002.Specifying systems: The TLA+ language and tools for hardware and software engineers. Addison-Wesley

  43. [43]

    ChonLam Lao, Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, Jiangfei Duan, Minlan Yu, Aditya Akella, Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, Ennan Zhai, Dennis Cai, Zhengping Qian, and Jingren Zhou. 2026. Continuum: An Interruption-Resilient Runtime for ML Training. InOSDI

  44. [44]

    ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network aggregation for multi-tenant learning. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 741–761

  45. [45]

    Wenxue Li, Xiangzhou Liu, Yuxuan Li, Yilun Jin, Han Tian, Zhizhen Zhong, Guyue Liu, Ying Zhang, and Kai Chen. 2024. Understanding communication characteristics of distributed training. InProceedings of the 8th Asia-Pacific Workshop on Networking. 1–8

  46. [46]

    Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zihao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, et al. 2025. Revisiting RDMA Reliability for Lossy Fabrics. In Proceedings of the ACM SIGCOMM 2025 Conference. 85–98

  47. [47]

    Wenxue Li, Junyi Zhang, Yufei Liu, Gaoxiong Zeng, Zilong Wang, Chaoliang Zeng, Pengpeng Zhou, Qiaoling Wang, and Kai Chen. 2024. Cepheus: accelerating datacenter applications with high-performance roce-capable multicast. In2024 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 908–921

  48. [48]

    Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. InProceedings of the 46th International Symposium on Computer Architecture. 279–291

  49. [49]

    Zhaoyi Li, Jiawei Huang, Yijun Li, Aikun Xu, Shengwen Zhou, Jingling Liu, and Jianxin Wang. 2023. A2TP: Aggregator-aware in-network aggregation for multi-tenant learning. InProceedings of the Eighteenth European Conference on Computer Systems. 639–653

  50. [50]

    Zhaoyi Li, Jiawei Huang, Tao Zhang, Shengwen Zhou, Qile Wang, Yijun Li, Jingling Liu, Wanchun Jiang, and Jianxin Wang. 2023. PA-ATP: Progress-Aware Transmission Protocol for In-Network Aggregation. In2023 IEEE 31st International Conference on Network Protocols (ICNP). IEEE, 1–11

  51. [51]

    Linux man-pages project. [n. d.].rxe(7): Software RDMA over Ethernet (RoCE) driver. ([n. d.]). https://man7.org/linux/man-pages/man7/rxe. 7.html Documents Linux kernel modulerdma_rxe(Soft-RoCE/RXE)

  52. [52]

    Linux RDMA Community. 2024. libibverbs: Userspace InfiniBand Verbs Library. https://github.com/linux-rdma/rdma-core/tree/maste r/libibverbs. (2024). Part of rdma-core; provides the ibv_* API for RDMA device management, QP/CQ/MR operations, and data transfer

  53. [53]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

  54. [54]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437 (2024)

  55. [55]

    Shuo Liu, Qiaoling Wang, Junyi Zhang, Wenfei Wu, Qinliang Lin, Yao Liu, Meng Xu, Marco Canini, Ray CC Cheung, and Jianfei He. 2023. In-network aggregation with transport transparency for distributed training. InProceedings of the 28th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 3. 376–391

  56. [56]

    Qingkai Meng, Hao Zheng, Zhenhui Zhang, ChonLam Lao, Chengyuan Huang, Baojia Li, Ziyuan Zhu, Hao Lu, Weizhen Dang, Zitong Lin, et al

  57. [57]

    InProceedings of the ACM SIGCOMM 2025 Conference

    Astral: A datacenter infrastructure for large language model training at scale. InProceedings of the ACM SIGCOMM 2025 Conference. 609–625

  58. [58]

    Zili Meng, Jun Bi, Haiping Wang, Chen Sun, and Hongxin Hu. 2019. MicroNF: An efficient framework for enabling modularized service chains in NFV.IEEE Journal on Selected Areas in Communications37, 8 (2019), 1851–1865

  59. [59]

    Microsoft. 2023. MSCCL. (2023). https://github.com/microsoft/msccl

  60. [60]

    NVIDIA. 2024. NCCL. (2024). https://github.com/NVIDIA/nccl

  61. [61]

    NVIDIA Corporation. [n. d.].NVIDIA NVLink High-Speed Interconnect: Application Performance. Whitepaper. NVIDIA Corporation

  62. [62]

    OMNeT++ Community. 2024. OMNeT++ Discrete Event Simulator. https://omnetpp.org. (2024). Version 6.2

  63. [63]

    OpenInfra Foundation. [n. d.]. OpenStack: Open source cloud comput- ing infrastructure. ([n. d.]). https://www.openstack.org/

  64. [64]

    Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia Rat- nasamy, and Scott Shenker. 2016. NetBricks: Taking the V out of NFV. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 203–216

  65. [65]

    Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Jonathan Stringer, Pravin Shelar, Keith Amidon, and Martin Casado

    Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan J. Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Jonathan Stringer, Pravin Shelar, Keith Amidon, and Martin Casado. 2015. The Design and Imple- mentation of Open vSwitch. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15). USENIX Association, 117–130. https://www....

  66. [66]

    Chenchen Qi, Wenfei Wu, Yongcan Wang, Keqiang He, Yu-Hsiang Kao, Zongying He, Chen-Yu Yen, Zhuo Jiang, Feng Luo, Surendra Anubolu, et al. 2025. SGLB: Scalable and Robust Global Load Balancing in Commodity AI Clusters. InProceedings of the ACM SIGCOMM 2025 16 EPIC Conference. 626–644

  67. [67]

    Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, et al. 2024. Alibaba hpn: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference. 691–706

  68. [68]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  69. [69]

    InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

    Zero: Memory optimizations toward training trillion param- eter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

  70. [70]

    Ori Rottenstreich and Jose Yallouz. 2024. Edge-disjoint tree allocation for multi-tenant cloud security in datacenter topologies.IEEE/ACM Transactions on Networking32, 4 (2024), 2858–2874

  71. [71]

    Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtárik. 2021. Scaling distributed machine learning with In-Network aggregation. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 785–808

  72. [72]

    Raz Segal, Chen Avin, and Gabriel Scalosub. 2021. SOAR: Minimiz- ing network utilization with bounded in-network computing. InPro- ceedings of the 17th International Conference on emerging Networking EXperiments and Technologies. 16–29

  73. [73]

    Raz Segal, Chen Avin, and Gabriel Scalosub. 2022. Constrained in- network computing with low congestion in datacenter networks. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 1639–1648

  74. [74]

    Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. TACCL: Guiding Collective Algorithm Synthe- sis using Communication Sketches. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 593–612

  75. [75]

    The Tcpdump Group. [n. d.]. libpcap: Portable packet capture library. ([n. d.]). https://www.tcpdump.org/

  76. [76]

    UAlink Consortium. 2024. UAlink Consortium. Online Consortium Website. (2024). https://ualinkconsortium.org/

  77. [77]

    Ultra Ethernet Consortium. 2024. Ultra Ethernet Specification Update. Ultra Ethernet Consortium Blog. (29 August 2024). https://ultraether net.org/ultra-ethernet-specification-update/ Accessed: 2026-02-06

  78. [78]

    Xinchen Wan, Luyang Li, Han Tian, Xudong Liao, Xinyang Huang, Chaoliang Zeng, Zilong Wang, Xinyu Yang, Ke Cheng, Qingsong Ning, et al. 2025. A Generic and Efficient Communication Framework for Message-level In-Network Computing. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

  79. [79]

    Ruiqi Wang, Dezun Dong, Fei Lei, Junchao Ma, Ke Wu, and Kai Lu

  80. [80]

    In Proceedings of the 37th International Conference on Supercomputing

    Roar: A router microarchitecture for in-network allreduce. In Proceedings of the 37th International Conference on Supercomputing. 423–436

Showing first 80 references.