arxiv: 2604.14690 · v1 · submitted 2026-04-16 · 💻 cs.NI

Recognition: unknown

Switching Efficiency: A Novel Framework for Dissecting AI Data Center Network Efficiency

Niangen Ye , Jiawen Zhu , Baojun Chen , Dong Wang , Jiang Sun , Weiqiang Sun , Weisheng Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:25 UTC · model grok-4.3

classification 💻 cs.NI

keywords switching efficiencyAI data center networksLLM training communicationnetwork efficiency metrics3D-Torus architectureRail-Optimized designMixture-of-Experts trafficcommunication bottlenecks

0 comments

The pith

Switching Efficiency Framework quantifies effective data throughput per unit switching capacity to identify bottlenecks in AI data center networks for LLM training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Switching Efficiency Framework to measure communication performance in networks that support large language model training. Its central metric tracks useful data movement relative to the total switching capacity available in the cluster. Breaking the metric into Data, Routing Efficiency, and Port Utilization factors lets analysts isolate which part of the network is limiting progress under different traffic conditions. The framework shows that 3D-Torus and Rail-Optimized architectures handle sparse or imbalanced traffic better than other patterns, while all-to-all communication from Mixture-of-Experts models reduces utilization and routing performance. Design adjustments such as resource reallocation, larger servers, in-network computing, and multi-plane layouts each improve particular factors, giving concrete guidance for building more efficient clusters.

Core claim

The Switching Efficiency Framework introduces the core metric η, which quantifies computationally effective data throughput per unit switching capacity. It further decomposes η into three factors—Data, Routing Efficiency, and Port Utilization—to isolate distinct communication bottlenecks. Application of the framework shows that the symmetric distributed switching of 3D-Torus and the centralized hierarchical switching of Rail-Optimized architectures align with sparse or imbalanced LLM training traffic, whereas All-to-All traffic from Mixture-of-Experts models severely degrades port utilization and routing efficiency. The analysis further demonstrates that design choices such as adjusting the

What carries the argument

Switching Efficiency metric η, defined as computationally effective data throughput per unit switching capacity and decomposed into Data, Routing Efficiency, and Port Utilization factors that isolate distinct communication bottlenecks.

If this is right

3D-Torus symmetric distributed switching aligns with sparse LLM training traffic patterns.
Rail-Optimized centralized hierarchical switching suits imbalanced traffic patterns.
All-to-All traffic from Mixture-of-Experts models reduces port utilization and routing efficiency.
Adjusting switching resource allocation, expanding server size, adopting in-network computing, and using multi-plane designs each improve specific efficiency factors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied during early simulation stages to compare proposed network topologies before physical construction.
Training software could incorporate monitoring of the three factors to dynamically adjust job placement and reduce specific bottlenecks.
The same decomposition might apply to other collective communication patterns beyond training, such as distributed inference serving.
Collecting traces from production AI clusters would allow direct calibration of how closely the three factors predict observed slowdowns.

Load-bearing premise

The decomposition of switching efficiency into Data, Routing Efficiency, and Port Utilization factors accurately isolates independent communication bottlenecks that match real LLM training traffic patterns.

What would settle it

Measurements from actual LLM training runs on different network architectures in which predicted gains from changes in one factor do not appear in measured effective throughput would show the decomposition fails to capture real bottlenecks.

Figures

Figures reproduced from arXiv: 2604.14690 by Baojun Chen, Dong Wang, Jiang Sun, Jiawen Zhu, Niangen Ye, Weiqiang Sun, Weisheng Hu.

**Figure 2.** Figure 2: Selective alignment of topology with different LLM [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The effective data volume (∆D) for each communication primitive. Assuming uniform token distribution for All-to-All dispatch and All-to-All combine. Grid-shaped shards represent the data that did not undergo network communication [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: In a two-layer AIDC network, switching resources are [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Efficiency dissection of 3D-Torus vs. Rail-Optimized for (a) dense and (b) MoE workloads on a 4096-GPU cluster. The [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of tiered bandwidth ratio on generalized Port Utilization ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Influence of server size on communication efficiency under MoE training workload, with trends for generalized (a) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Efficiency improvement from the In-network com [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Generalized Switching Efficiency (η¯) of 3D-Torus and Rail-Optimized architectures as cluster scales from 512 to 65,536 GPUs under (a) dense and (b) MoE model training workloads. its symmetric fabric imposes a uniform bandwidth allocation, which is ill-suited to the imbalanced LLM training traffic. In contrast, while Rail-Optimized architectures exhibit a stepwise decline in efficiency, multi-plane designs… view at source ↗

read the original abstract

Communication is pivotal in LLM training, and a thorough analysis of the communication efficiency of AI data center (AIDC) network is essential for guiding the design of these capital-intensive clusters. However, conventional metrics are inadequate for such analysis, as they do not directly link network activity to computational progress and lack granularity to diagnose the impact of different network design patterns. To address this, we introduce a metric framework, the Switching Efficiency Framework, whose core metric - Switching Efficiency ($\eta$) - quantifies computationally effective data throughput per unit switching capacity. We further decompose $\eta$ into three factors - Data, Routing Efficiency, and Port Utilization to facilitate analysis of distinct communication bottlenecks. Using this metric framework, we demonstrate how the symmetric, distributed switching of 3D-Torus and the centralized, hierarchical switching of Rail-Optimized architecture align with sparse or imbalanced LLM training traffic, and show that All-to-All traffic from Mixture-of-Experts models severely degrades their port utilization and routing efficiency. Our analysis also demonstrates how key design choices - such as adjusting switching resource allocation, expanding server size, adopting in-network computing, and multi-plane design - positively influence distinct facets of communication efficiency. Ultimately, the Switching Efficiency Framework provides an analytical tool for analyzing efficiency bottlenecks, thereby informing the design of future-generation AIDC networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a Switching Efficiency metric with a three-factor breakdown to link AIDC networks to LLM progress, but the factors need real traffic validation to be more than an algebraic split.

read the letter

The main takeaway is that this work defines Switching Efficiency η as effective data throughput per unit switching capacity and splits it into Data, Routing Efficiency, and Port Utilization factors. That decomposition is the fresh element, letting them map network designs to computational outcomes instead of stopping at raw utilization numbers. They apply it to show 3D-Torus handling sparse or imbalanced training traffic better than hierarchical Rail-Optimized setups, and they flag how MoE All-to-All traffic tanks port utilization and routing. Design moves like multi-plane layouts, larger servers, or in-network compute get tied to specific factor improvements. This is useful because it gives architects a structured way to talk about why one topology might suit LLM workloads over another. The analysis stays grounded in the kinds of traffic patterns that actually show up in training clusters. The soft spot is the lack of clear evidence that the three factors stay independent and map directly to measurable effects in packet traces or collective operations. If the split is mostly definitional without calibration against real LLM runs, then statements about architectures aligning or degrading particular factors stay interpretive. The paper would benefit from side-by-side comparisons to existing traces or sensitivity checks on the decomposition. This is aimed at network designers and researchers focused on AI data center infrastructure who already work with these topologies. Someone building or evaluating next-gen clusters could pick up the framework and test it on their own traffic. It deserves peer review because the angle is timely and the metric could become practical once the empirical links are tightened.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Switching Efficiency Framework for AI data center (AIDC) networks. Its core metric η quantifies computationally effective data throughput per unit switching capacity and is decomposed into three factors (Data, Routing Efficiency, and Port Utilization) to diagnose communication bottlenecks. The paper claims this framework shows how symmetric 3D-Torus and hierarchical Rail-Optimized architectures align with sparse/imbalanced LLM training traffic, how MoE All-to-All traffic degrades port utilization and routing efficiency, and how design choices (switching allocation, server scaling, in-network compute, multi-plane) positively affect distinct efficiency facets, ultimately providing an analytical tool to inform future AIDC network design.

Significance. If the decomposition isolates distinct, measurable bottlenecks that align with real LLM collective patterns without additional calibration, the framework could address gaps in conventional metrics by directly linking network activity to computational progress and guiding architecture choices such as torus vs. rail-optimized or in-network compute. No machine-checked proofs, reproducible code, or falsifiable predictions are present to strengthen the assessment.

major comments (2)

[Abstract and decomposition of η] Abstract and decomposition section: the central claim that the three-factor decomposition of η accurately isolates distinct communication bottlenecks and aligns with real LLM training traffic (sparse/imbalanced vs. MoE All-to-All) lacks any derivations, packet-trace validation, error analysis, or empirical calibration; without this, demonstrations of architectural alignment or degradation remain interpretive rather than predictive, directly undermining the utility as a design-guiding analytical tool.
[Analysis of design choices] Claims on design choices (e.g., multi-plane, server-size scaling, in-network compute): these are asserted to positively influence distinct facets of η, but no quantitative results, sensitivity analysis, or comparison against baselines are supplied to show the factors are orthogonal and load-bearing for the efficiency conclusions.

minor comments (2)

[Introduction] Notation for η and its factors should be defined with explicit equations early in the manuscript to avoid ambiguity in how 'computationally effective data throughput' is computed from switching capacity.
[Conclusion] The manuscript would benefit from a clear statement of assumptions (e.g., traffic model parameters) and limitations of the framework to set reader expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional rigor would strengthen the presentation of the Switching Efficiency Framework. We address each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract and decomposition of η] Abstract and decomposition section: the central claim that the three-factor decomposition of η accurately isolates distinct communication bottlenecks and aligns with real LLM training traffic (sparse/imbalanced vs. MoE All-to-All) lacks any derivations, packet-trace validation, error analysis, or empirical calibration; without this, demonstrations of architectural alignment or degradation remain interpretive rather than predictive, directly undermining the utility as a design-guiding analytical tool.

Authors: We agree that the current manuscript presents the decomposition at a conceptual level without explicit step-by-step derivations or new empirical calibration. The three factors follow directly from rewriting η = (computationally effective throughput) / (switching capacity) by separating the numerator into data volume transferred, the fraction of traffic that follows efficient routes, and the fraction of ports actively carrying useful traffic. In the revised version we will add a dedicated subsection that derives each factor algebraically from the definition of η and maps them to observable quantities (e.g., bytes of model gradients versus total bytes on the wire). The alignment claims rest on standard traffic patterns reported in the LLM-training literature rather than new packet traces; we will cite those sources explicitly and note that the framework is intended as an analytical lens that can be validated against traces in follow-on work. This revision will make the demonstrations less purely interpretive while preserving the paper’s scope as a framework introduction. revision: partial
Referee: [Analysis of design choices] Claims on design choices (e.g., multi-plane, server-size scaling, in-network compute): these are asserted to positively influence distinct facets of η, but no quantitative results, sensitivity analysis, or comparison against baselines are supplied to show the factors are orthogonal and load-bearing for the efficiency conclusions.

Authors: The design-choice analysis is currently qualitative, showing directional effects on individual factors (for example, multi-plane topologies increase the routing-efficiency term by providing additional low-diameter paths). We accept that quantitative support and explicit checks for orthogonality would improve the claims. In revision we will add a short analytical section containing simplified closed-form expressions and numerical sensitivity examples that quantify the impact of each design choice on its target factor while holding the others constant. We will also compare the resulting η values against a single-plane baseline under the same traffic matrices. These additions will demonstrate that the factors are separable in the model and that the conclusions rest on the decomposition rather than on unexamined interactions. revision: partial

Circularity Check

0 steps flagged

Switching Efficiency Framework introduced as novel definition with no circular derivation steps

full rationale

The paper presents the Switching Efficiency metric η and its decomposition into Data, Routing Efficiency, and Port Utilization as an original definitional framework for analyzing AIDC networks. No load-bearing equations, predictions, or uniqueness claims reduce to fitted inputs, self-citations, or prior ansatzes by construction. The abstract and described structure treat the decomposition as an algebraic partitioning introduced to diagnose bottlenecks, with all demonstrations following from this new definition rather than circularly presupposing the target results. This is a self-contained definitional contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the definition of a new metric and assumptions about LLM traffic patterns; limited information available from abstract prevents exhaustive enumeration of free parameters or axioms.

axioms (1)

domain assumption Conventional metrics are inadequate because they do not directly link network activity to computational progress and lack granularity for different design patterns.
Stated explicitly in the abstract as the motivation for the new framework.

invented entities (1)

Switching Efficiency (η) no independent evidence
purpose: Quantifies computationally effective data throughput per unit switching capacity
Newly defined core metric of the framework; no independent evidence or external validation provided in abstract.

pith-pipeline@v0.9.0 · 5551 in / 1320 out tokens · 40204 ms · 2026-05-10T10:25:35.355278+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Alibaba HPN: A Data Center Network for Large Language Model Training,

K. Qian, Y . Xi, J. Cao, J. Gao, Y . Xu, Y . Guan, B. Fu, X. Shi, F. Zhu, R. Miao, et al., “Alibaba HPN: A Data Center Network for Large Language Model Training,” inProceedings of the ACM SIGCOMM 2024 Conference, ser. ACM SIGCOMM ’24, New York, NY , USA: Association for Computing Machinery, 2024, pp. 691–706

2024
[2]

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs,

Z. Jiang, H. Lin, Y . Zhong, Q. Huang, Y . Chen, Z. Zhang, Y . Peng, X. Li, C. Xie, S. Nong, et al., “MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 745–760

2024
[3]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al.,The Llama 3 Herd of Models, 2024. arXiv: 2407.21783[cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Comanici, E

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al.,Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities, 2025. arXiv: 2507. 06261

2025
[5]

Astral: A Datacenter Infrastructure for Large Language Model Training at Scale,

Q. Meng, H. Zheng, Z. Zhang, C. Lao, C. Huang, B. Li, Z. Zhu, H. Lu, W. Dang, Z. Lin, et al., “Astral: A Datacenter Infrastructure for Large Language Model Training at Scale,” inProceedings of the ACM SIGCOMM 2025 Conference, S˜ao Francisco Convent Coimbra Portugal: ACM, 2025, pp. 609–625

2025
[6]

xAI,Colossus, https://x.ai/colossus
[7]

Is Network the Bottleneck of Distributed Training?

Z. Zhang, C. Chang, H. Lin, Y . Wang, R. Arora, and X. Jin, “Is Network the Bottleneck of Distributed Training?” InProceedings of the Workshop on Network Meets AI & ML, ser. NetAI ’20, New York, NY , USA: Association for Computing Machinery, 2020, pp. 8–13

2020
[8]

Erdil and D

E. Erdil and D. Schneider-Joseph,Data movement limits to frontier model training, 2024. arXiv: 2411.01137[cs]

work page arXiv 2024
[9]

Mosaic: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs,

K. Benyahya, A. G. Diaz, J. Liu, V . Lyutsarev, M. Pantouvaki, K. Shi, S. Y . Siew, H. Ballani, T. Burridge, D. Cletheroe, et al., “Mosaic: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25, New York, NY , USA: Associa- tion for Computing Machine...

2025
[10]

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities,

Y . Wei, T. Hu, C. Liang, and Y . Cui, “Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities,”IEEE Network, vol. 39, no. 3, pp. 241–248, 2025,ISSN: 1558-156X

2025
[11]

The ris- ing costs of training frontier ai models.arXiv preprint arXiv:2405.21015, 2024

B. Cottier, R. Rahman, L. Fattorini, N. Maslej, T. Besiroglu, and D. Owen,The rising costs of training frontier AI models, 2025. arXiv: 2405.21015[cs]

work page arXiv 2025
[12]

K. F. Pilz, J. Sanders, R. Rahman, and L. Heim,Trends in AI Super- computers, 2025. arXiv: 2504.16026[cs]

work page arXiv 2025
[13]

MSC- CLang: Microsoft Collective Communication Language,

M. Cowan, S. Maleki, M. Musuvathi, O. Saarikivi, and Y . Xiong, “MSC- CLang: Microsoft Collective Communication Language,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Vancouver BC Canada: ACM, 2023, pp. 502–514

2023
[14]

TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches,

A. Shah, V . Chidambaram, M. Cowan, S. Maleki, M. Musuvathi, T. Mytkowicz, J. Nelson, O. Saarikivi, and R. Singh, “TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 593–612

2023
[15]

TCCL: Co-optimizing Collective Communi- cation and Traffic Routing for GPU-centric Clusters,

B. Li, X. Wang, J. Wang, Y . Liu, Y . Gong, H. Lu, W. Dang, W. Zhang, X. Huang, M. Chen, et al., “TCCL: Co-optimizing Collective Communi- cation and Traffic Routing for GPU-centric Clusters,” inProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing, Sydney NSW Australia: ACM, 2024, pp. 48–53

2024
[16]

Swing: Short-cutting Rings for Higher Bandwidth Allreduce,

D. D. Sensi, T. Bonato, D. Saam, and T. Hoefler, “Swing: Short-cutting Rings for Higher Bandwidth Allreduce,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1445–1462

2024
[17]

In-Network AllReduce Optimization with Virtual Aggregation Trees,

H. Song, “In-Network AllReduce Optimization with Virtual Aggregation Trees,” inProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing, ser. NAIC ’24, New York, NY , USA: Association for Computing Machinery, 2024, pp. 54–60

2024
[18]

AdapCC: Making Collective Commu- nication in Distributed Machine Learning Adaptive,

X. Zhao, Z. Zhang, and C. Wu, “AdapCC: Making Collective Commu- nication in Distributed Machine Learning Adaptive,” in2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), Jersey City, NJ, USA: IEEE, 2024, pp. 25–35

2024
[19]

L. Zhao, S. Maleki, Z. Yang, H. Pourreza, and A. Krishnamurthy, ForestColl: Throughput-Optimal Collective Communications on Hetero- geneous Network Fabrics, 2025. arXiv: 2402.06787[cs]

work page arXiv 2025
[20]

EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform,

J. Dong, Z. Cao, T. Zhang, J. Ye, S. Wang, F. Feng, L. Zhao, X. Liu, L. Song, L. Peng, et al., “EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform,” in2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 610–622

2020
[21]

HammingMesh: A network topology for large-scale deep learning,

T. Hoefler, T. Bonato, D. De Sensi, S. Di Girolamo, S. Li, M. Hed- des, J. Belk, D. Goel, M. Castro, and S. Scott, “HammingMesh: A network topology for large-scale deep learning,” inProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’22, Dallas, Texas: IEEE Press, 2022, pp. 1–18

2022
[22]

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs,

W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y . Zhang, and A. Kewitsch, “TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs,” in20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17-19, 2023, M. Balakrishnan and M. Ghobadi, Eds., USENIX Ass...

work page arXiv 2023
[23]

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings,

N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, et al., “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings,” inProceedings of the 50th Annual Interna- tional Symposium on Computer Architecture, ser. ISCA ’23, New York, NY , USA: Association f...

2023
[24]

Rep., 2024

NVIDIA, “NVIDIA DGX SuperPOD: Next Generation Scalable Infras- tructure for AI Leadership Reference Architecture Featuring NVIDIA DGX B200 — NVIDIA DGX SuperPOD: Next Generation Scalable In- frastructure for AI Leadership Reference Architecture Featuring NVDIA DGX B200,” NVIDIA Corporation, Tech. Rep., 2024

2024
[25]

Rail- only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters,

W. Wang, M. Ghobadi, K. Shakeri, Y . Zhang, and N. Hasani, “Rail- only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters,” inIEEE Symposium on High-Performance Interconnects, HOTI 2024, Albuquerque, NM, USA, August 21-23, 2024, IEEE, 2024, pp. 1–10. arXiv: 2307.12169

work page arXiv 2024
[26]

Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer,

Y . Zu, A. Ghaffarkhah, H.-V . Dang, B. Towles, S. Hand, S. Huda, A. Bello, A. Kolbasov, A. Rezaei, D. Du, et al., “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 761–774

2024
[27]

H. Liao, B. Liu, X. Chen, Z. Guo, C. Cheng, J. Wang, X. Chen, P. Dong, R. Meng, W. Liu, et al.,UB-Mesh: A Hierarchically Localized nD- FullMesh Datacenter Network Architecture, 2025. arXiv: 2503.20377 [cs]

work page arXiv 2025
[28]

From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training,

Z. Yan, D. Li, L. Chen, D. Xiong, K. Gao, Y . Zhang, R. Yan, M. Zhang, B. Zhang, Z. Jiang, et al., “From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25, New York, NY , USA: Association for Computing Machinery,...

2025
[29]

MixNet: A Runtime Reconfigurable Optical- Electrical Fabric for Distributed Mixture-of-Experts Training,

X. Liao, Y . Sun, H. Tian, X. Wan, Y . Jin, Z. Wang, Z. Ren, X. Huang, W. Li, K. F. Tse, et al., “MixNet: A Runtime Reconfigurable Optical- Electrical Fabric for Distributed Mixture-of-Experts Training,” inPro- ceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25, New York, NY , USA: Association for Computing Machinery, 2025, pp. 554–574

2025
[30]

InfiniteHBD: Building Datacenter- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers,

C. Shou, G. Liu, H. Nie, H. Meng, Y . Zhou, Y . Jiang, W. Lv, Y . Xu, Y . Lu, Z. Chen, et al., “InfiniteHBD: Building Datacenter- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25, New York, NY , U...

2021
[31]

Understanding Communication Characteristics of Distributed Training,

W. Li, X. Liu, Y . Li, Y . Jin, H. Tian, Z. Zhong, G. Liu, Y . Zhang, and K. Chen, “Understanding Communication Characteristics of Distributed Training,” inProceedings of the 8th Asia-Pacific Workshop on Network- ing, ser. APNet ’24, New York, NY , USA: Association for Computing Machinery, 2024, pp. 1–8

2024
[32]

Bandwidth optimal all-reduce algorithms for clusters of workstations,

P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,”Journal of Parallel and Distributed Comput- ing, vol. 69, no. 2, pp. 117–124, 2009,ISSN: 0743-7315

2009
[33]

An In-Network Ar- chitecture for Accelerating Shared-Memory Multiprocessor Collectives,

B. Klenk, N. Jiang, G. Thorson, and L. Dennison, “An In-Network Ar- chitecture for Accelerating Shared-Memory Multiprocessor Collectives,” in2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 996–1009

2020
[34]

2022 Fast Inter-GPU Communication with NCCL for Deep Learning Training, and More (a Magnum IO session) — GTC Digital Spring 2022 — NVIDIA On-Demand,

“2022 Fast Inter-GPU Communication with NCCL for Deep Learning Training, and More (a Magnum IO session) — GTC Digital Spring 2022 — NVIDIA On-Demand,” NVIDIA

2022
[35]

NVIDIA,Scaling Deep Learning Training: Fast Inter-GPU Communi- cation with NCCL, https://www.nvidia.com/en-us/on-demand/session/ gtcspring23-s51111/
[36]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al.,DeepSeek-V3 Technical Report, 2024. arXiv: 2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

SiP-ML: High-bandwidth optical network interconnects for machine learning training,

M. Khani, M. Ghobadi, M. Alizadeh, Z. Zhu, M. Glick, K. Bergman, A. Vahdat, B. Klenk, and E. Ebrahimi, “SiP-ML: High-bandwidth optical network interconnects for machine learning training,” inProceedings of the 2021 ACM SIGCOMM 2021 Conference, ser. SIGCOMM ’21, New York, NY , USA: Association for Computing Machinery, 2021, pp. 657– 675

2021
[38]

A Unified Architecture for Accelerating Distributed{DNN}Training in Heteroge- neous{GPU/CPU}Clusters,

Y . Jiang, Y . Zhu, C. Lan, B. Yi, Y . Cui, and C. Guo, “A Unified Architecture for Accelerating Distributed{DNN}Training in Heteroge- neous{GPU/CPU}Clusters,” in14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 463–479

2020
[39]

Is Network the Bottleneck of Distributed Training?

Z. Zhang, C. Chang, H. Lin, Y . Wang, R. Arora, and X. Jin, “Is Network the Bottleneck of Distributed Training?” InProceedings of the Workshop on Network Meets AI & ML, Virtual Event USA: ACM, 2020, pp. 8–13

2020
[40]

RDMA over Ethernet for Distributed Training at Meta Scale,

A. Gangidi, R. Miao, S. Zheng, S. J. Bondu, G. Goes, H. Morsy, R. Puri, M. Riftadi, A. J. Shetty, J. Yang, et al., “RDMA over Ethernet for Distributed Training at Meta Scale,” inProceedings of the ACM SIGCOMM 2024 Conference, Sydney NSW Australia: ACM, 2024, pp. 57–70

2024
[41]

Bonato, A

T. Bonato, A. Kabbani, A. Ghalayini, M. Papamichael, M. Dohadwala, L. Gianinazzi, M. Khalilov, E. Achermann, D. De Sensi, and T. Hoefler, REPS: Recycled Entropy Packet Spraying for Adaptive Load Balancing and Failure Mitigation, 2025

2025
[42]

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network,

N. Blach, M. Besta, D. D. Sensi, J. Domke, H. Harake, S. Li, P. Iff, M. Konieczny, K. Lakhotia, A. Kubicek, et al., “A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1025–1044

2024
[43]

Y . Feng, T. Chen, Y . Wei, S. Shen, S. Wang, W. Li, K. Ma, and T. Hoefler,RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems, 2025. arXiv: 2507.18889[cs]

work page arXiv 2025
[44]

X. Han, Y . Lv, S. Zhao, Z. Liu, X. Liu, and X. Wang,LumosCore: Highly Scalable LLM Clusters with Optical Interconnect, 2025. arXiv: 2411.01503[cs]

work page arXiv 2025
[45]

Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems,

H. Liu, R. Urata, K. Yasumura, X. Zhou, R. Bannon, J. Berger, P. Dashti, N. Jouppi, C. Lam, S. Li, et al., “Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems,” in Proceedings of the ACM SIGCOMM 2023 Conference, New York NY USA: ACM, 2023, pp. 499–515

2023
[46]

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale,

S. Cheng, J.-L. Lin, M. Emani, S. Raskar, S. Foreman, Z. Xie, V . Vishwanath, and M. T. Kandemir, “Thorough Characterization and Analysis of Large Transformer Model Training At-Scale,”Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 8, no. 1, pp. 1–25, 2024,ISSN: 2476-1249

2024
[47]

Demystifying the communication char- acteristics for distributed transformer models,

Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbai, A. Shafi, H. Subramoni, and D. K. Panda, “Demystifying the communication char- acteristics for distributed transformer models,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI), IEEE, 2024, pp. 57–65

2024
[48]

MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

C. Jin, Z. Jiang, Z. Bai, Z. Zhong, J. Liu, X. Li, N. Zheng, X. Wang, C. Xie, Q. Huang, et al. “MegaScale-MoE: Large-Scale Communication- Efficient Training of Mixture-of-Experts Models in Production.” arXiv: 2505.11432[cs], pre-published

work page arXiv
[49]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv: 1909.08053, pre-published

work page internal anchor Pith review arXiv 1909
[50]

Efficient large-scale language model training on GPU clusters using megatron-LM,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al., “Efficient large-scale language model training on GPU clusters using megatron-LM,” inInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, N...

2021
[51]

Reducing activation recomputation in large transformer models,

V . Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” inProceedings of Machine Learning and Systems, vol. 5, 2023, pp. 341–353

2023
[52]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “ZeRO: Memory optimizations toward training trillion parameter models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, C. Cuicchi, I. Qualters, and W. T. Kramer, Eds., IEEE/...

work page arXiv 2020
[53]

The Nvlink-Network Switch: Nvidia’s Switch Chip for High Communication-Bandwidth Superpods,

A. Ishii and R. Wells, “The Nvlink-Network Switch: Nvidia’s Switch Chip for High Communication-Bandwidth Superpods,” in2022 IEEE Hot Chips 34 Symposium (HCS), Cupertino, CA, USA: IEEE, 21, 2022, pp. 1–23

2022
[54]

E. Ding, C. Ouyang, and R. Singh,Photonic Rails in ML Datacenters,
[55]

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures,

C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma, et al., “Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25, New York, NY , USA: Association for Computing Machinery, 2025, pp. ...

2025
[56]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al.,Gemini: A Family of Highly Capable Multimodal Models, 2025. arXiv: 2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2025