Recognition: unknown
Switching Efficiency: A Novel Framework for Dissecting AI Data Center Network Efficiency
Pith reviewed 2026-05-10 10:25 UTC · model grok-4.3
The pith
Switching Efficiency Framework quantifies effective data throughput per unit switching capacity to identify bottlenecks in AI data center networks for LLM training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Switching Efficiency Framework introduces the core metric η, which quantifies computationally effective data throughput per unit switching capacity. It further decomposes η into three factors—Data, Routing Efficiency, and Port Utilization—to isolate distinct communication bottlenecks. Application of the framework shows that the symmetric distributed switching of 3D-Torus and the centralized hierarchical switching of Rail-Optimized architectures align with sparse or imbalanced LLM training traffic, whereas All-to-All traffic from Mixture-of-Experts models severely degrades port utilization and routing efficiency. The analysis further demonstrates that design choices such as adjusting the
What carries the argument
Switching Efficiency metric η, defined as computationally effective data throughput per unit switching capacity and decomposed into Data, Routing Efficiency, and Port Utilization factors that isolate distinct communication bottlenecks.
If this is right
- 3D-Torus symmetric distributed switching aligns with sparse LLM training traffic patterns.
- Rail-Optimized centralized hierarchical switching suits imbalanced traffic patterns.
- All-to-All traffic from Mixture-of-Experts models reduces port utilization and routing efficiency.
- Adjusting switching resource allocation, expanding server size, adopting in-network computing, and using multi-plane designs each improve specific efficiency factors.
Where Pith is reading between the lines
- The framework could be applied during early simulation stages to compare proposed network topologies before physical construction.
- Training software could incorporate monitoring of the three factors to dynamically adjust job placement and reduce specific bottlenecks.
- The same decomposition might apply to other collective communication patterns beyond training, such as distributed inference serving.
- Collecting traces from production AI clusters would allow direct calibration of how closely the three factors predict observed slowdowns.
Load-bearing premise
The decomposition of switching efficiency into Data, Routing Efficiency, and Port Utilization factors accurately isolates independent communication bottlenecks that match real LLM training traffic patterns.
What would settle it
Measurements from actual LLM training runs on different network architectures in which predicted gains from changes in one factor do not appear in measured effective throughput would show the decomposition fails to capture real bottlenecks.
Figures
read the original abstract
Communication is pivotal in LLM training, and a thorough analysis of the communication efficiency of AI data center (AIDC) network is essential for guiding the design of these capital-intensive clusters. However, conventional metrics are inadequate for such analysis, as they do not directly link network activity to computational progress and lack granularity to diagnose the impact of different network design patterns. To address this, we introduce a metric framework, the Switching Efficiency Framework, whose core metric - Switching Efficiency ($\eta$) - quantifies computationally effective data throughput per unit switching capacity. We further decompose $\eta$ into three factors - Data, Routing Efficiency, and Port Utilization to facilitate analysis of distinct communication bottlenecks. Using this metric framework, we demonstrate how the symmetric, distributed switching of 3D-Torus and the centralized, hierarchical switching of Rail-Optimized architecture align with sparse or imbalanced LLM training traffic, and show that All-to-All traffic from Mixture-of-Experts models severely degrades their port utilization and routing efficiency. Our analysis also demonstrates how key design choices - such as adjusting switching resource allocation, expanding server size, adopting in-network computing, and multi-plane design - positively influence distinct facets of communication efficiency. Ultimately, the Switching Efficiency Framework provides an analytical tool for analyzing efficiency bottlenecks, thereby informing the design of future-generation AIDC networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Switching Efficiency Framework for AI data center (AIDC) networks. Its core metric η quantifies computationally effective data throughput per unit switching capacity and is decomposed into three factors (Data, Routing Efficiency, and Port Utilization) to diagnose communication bottlenecks. The paper claims this framework shows how symmetric 3D-Torus and hierarchical Rail-Optimized architectures align with sparse/imbalanced LLM training traffic, how MoE All-to-All traffic degrades port utilization and routing efficiency, and how design choices (switching allocation, server scaling, in-network compute, multi-plane) positively affect distinct efficiency facets, ultimately providing an analytical tool to inform future AIDC network design.
Significance. If the decomposition isolates distinct, measurable bottlenecks that align with real LLM collective patterns without additional calibration, the framework could address gaps in conventional metrics by directly linking network activity to computational progress and guiding architecture choices such as torus vs. rail-optimized or in-network compute. No machine-checked proofs, reproducible code, or falsifiable predictions are present to strengthen the assessment.
major comments (2)
- [Abstract and decomposition of η] Abstract and decomposition section: the central claim that the three-factor decomposition of η accurately isolates distinct communication bottlenecks and aligns with real LLM training traffic (sparse/imbalanced vs. MoE All-to-All) lacks any derivations, packet-trace validation, error analysis, or empirical calibration; without this, demonstrations of architectural alignment or degradation remain interpretive rather than predictive, directly undermining the utility as a design-guiding analytical tool.
- [Analysis of design choices] Claims on design choices (e.g., multi-plane, server-size scaling, in-network compute): these are asserted to positively influence distinct facets of η, but no quantitative results, sensitivity analysis, or comparison against baselines are supplied to show the factors are orthogonal and load-bearing for the efficiency conclusions.
minor comments (2)
- [Introduction] Notation for η and its factors should be defined with explicit equations early in the manuscript to avoid ambiguity in how 'computationally effective data throughput' is computed from switching capacity.
- [Conclusion] The manuscript would benefit from a clear statement of assumptions (e.g., traffic model parameters) and limitations of the framework to set reader expectations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional rigor would strengthen the presentation of the Switching Efficiency Framework. We address each major comment below and outline the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract and decomposition of η] Abstract and decomposition section: the central claim that the three-factor decomposition of η accurately isolates distinct communication bottlenecks and aligns with real LLM training traffic (sparse/imbalanced vs. MoE All-to-All) lacks any derivations, packet-trace validation, error analysis, or empirical calibration; without this, demonstrations of architectural alignment or degradation remain interpretive rather than predictive, directly undermining the utility as a design-guiding analytical tool.
Authors: We agree that the current manuscript presents the decomposition at a conceptual level without explicit step-by-step derivations or new empirical calibration. The three factors follow directly from rewriting η = (computationally effective throughput) / (switching capacity) by separating the numerator into data volume transferred, the fraction of traffic that follows efficient routes, and the fraction of ports actively carrying useful traffic. In the revised version we will add a dedicated subsection that derives each factor algebraically from the definition of η and maps them to observable quantities (e.g., bytes of model gradients versus total bytes on the wire). The alignment claims rest on standard traffic patterns reported in the LLM-training literature rather than new packet traces; we will cite those sources explicitly and note that the framework is intended as an analytical lens that can be validated against traces in follow-on work. This revision will make the demonstrations less purely interpretive while preserving the paper’s scope as a framework introduction. revision: partial
-
Referee: [Analysis of design choices] Claims on design choices (e.g., multi-plane, server-size scaling, in-network compute): these are asserted to positively influence distinct facets of η, but no quantitative results, sensitivity analysis, or comparison against baselines are supplied to show the factors are orthogonal and load-bearing for the efficiency conclusions.
Authors: The design-choice analysis is currently qualitative, showing directional effects on individual factors (for example, multi-plane topologies increase the routing-efficiency term by providing additional low-diameter paths). We accept that quantitative support and explicit checks for orthogonality would improve the claims. In revision we will add a short analytical section containing simplified closed-form expressions and numerical sensitivity examples that quantify the impact of each design choice on its target factor while holding the others constant. We will also compare the resulting η values against a single-plane baseline under the same traffic matrices. These additions will demonstrate that the factors are separable in the model and that the conclusions rest on the decomposition rather than on unexamined interactions. revision: partial
Circularity Check
Switching Efficiency Framework introduced as novel definition with no circular derivation steps
full rationale
The paper presents the Switching Efficiency metric η and its decomposition into Data, Routing Efficiency, and Port Utilization as an original definitional framework for analyzing AIDC networks. No load-bearing equations, predictions, or uniqueness claims reduce to fitted inputs, self-citations, or prior ansatzes by construction. The abstract and described structure treat the decomposition as an algebraic partitioning introduced to diagnose bottlenecks, with all demonstrations following from this new definition rather than circularly presupposing the target results. This is a self-contained definitional contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Conventional metrics are inadequate because they do not directly link network activity to computational progress and lack granularity for different design patterns.
invented entities (1)
-
Switching Efficiency (η)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Alibaba HPN: A Data Center Network for Large Language Model Training,
K. Qian, Y . Xi, J. Cao, J. Gao, Y . Xu, Y . Guan, B. Fu, X. Shi, F. Zhu, R. Miao, et al., “Alibaba HPN: A Data Center Network for Large Language Model Training,” inProceedings of the ACM SIGCOMM 2024 Conference, ser. ACM SIGCOMM ’24, New York, NY , USA: Association for Computing Machinery, 2024, pp. 691–706
2024
-
[2]
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs,
Z. Jiang, H. Lin, Y . Zhong, Q. Huang, Y . Chen, Z. Zhang, Y . Peng, X. Li, C. Xie, S. Nong, et al., “MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 745–760
2024
-
[3]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al.,The Llama 3 Herd of Models, 2024. arXiv: 2407.21783[cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Comanici, E
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al.,Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities, 2025. arXiv: 2507. 06261
2025
-
[5]
Astral: A Datacenter Infrastructure for Large Language Model Training at Scale,
Q. Meng, H. Zheng, Z. Zhang, C. Lao, C. Huang, B. Li, Z. Zhu, H. Lu, W. Dang, Z. Lin, et al., “Astral: A Datacenter Infrastructure for Large Language Model Training at Scale,” inProceedings of the ACM SIGCOMM 2025 Conference, S˜ao Francisco Convent Coimbra Portugal: ACM, 2025, pp. 609–625
2025
-
[6]
xAI,Colossus, https://x.ai/colossus
-
[7]
Is Network the Bottleneck of Distributed Training?
Z. Zhang, C. Chang, H. Lin, Y . Wang, R. Arora, and X. Jin, “Is Network the Bottleneck of Distributed Training?” InProceedings of the Workshop on Network Meets AI & ML, ser. NetAI ’20, New York, NY , USA: Association for Computing Machinery, 2020, pp. 8–13
2020
-
[8]
E. Erdil and D. Schneider-Joseph,Data movement limits to frontier model training, 2024. arXiv: 2411.01137[cs]
-
[9]
Mosaic: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs,
K. Benyahya, A. G. Diaz, J. Liu, V . Lyutsarev, M. Pantouvaki, K. Shi, S. Y . Siew, H. Ballani, T. Burridge, D. Cletheroe, et al., “Mosaic: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25, New York, NY , USA: Associa- tion for Computing Machine...
2025
-
[10]
Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities,
Y . Wei, T. Hu, C. Liang, and Y . Cui, “Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities,”IEEE Network, vol. 39, no. 3, pp. 241–248, 2025,ISSN: 1558-156X
2025
-
[11]
The ris- ing costs of training frontier ai models.arXiv preprint arXiv:2405.21015, 2024
B. Cottier, R. Rahman, L. Fattorini, N. Maslej, T. Besiroglu, and D. Owen,The rising costs of training frontier AI models, 2025. arXiv: 2405.21015[cs]
- [12]
-
[13]
MSC- CLang: Microsoft Collective Communication Language,
M. Cowan, S. Maleki, M. Musuvathi, O. Saarikivi, and Y . Xiong, “MSC- CLang: Microsoft Collective Communication Language,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Vancouver BC Canada: ACM, 2023, pp. 502–514
2023
-
[14]
TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches,
A. Shah, V . Chidambaram, M. Cowan, S. Maleki, M. Musuvathi, T. Mytkowicz, J. Nelson, O. Saarikivi, and R. Singh, “TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 593–612
2023
-
[15]
TCCL: Co-optimizing Collective Communi- cation and Traffic Routing for GPU-centric Clusters,
B. Li, X. Wang, J. Wang, Y . Liu, Y . Gong, H. Lu, W. Dang, W. Zhang, X. Huang, M. Chen, et al., “TCCL: Co-optimizing Collective Communi- cation and Traffic Routing for GPU-centric Clusters,” inProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing, Sydney NSW Australia: ACM, 2024, pp. 48–53
2024
-
[16]
Swing: Short-cutting Rings for Higher Bandwidth Allreduce,
D. D. Sensi, T. Bonato, D. Saam, and T. Hoefler, “Swing: Short-cutting Rings for Higher Bandwidth Allreduce,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1445–1462
2024
-
[17]
In-Network AllReduce Optimization with Virtual Aggregation Trees,
H. Song, “In-Network AllReduce Optimization with Virtual Aggregation Trees,” inProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing, ser. NAIC ’24, New York, NY , USA: Association for Computing Machinery, 2024, pp. 54–60
2024
-
[18]
AdapCC: Making Collective Commu- nication in Distributed Machine Learning Adaptive,
X. Zhao, Z. Zhang, and C. Wu, “AdapCC: Making Collective Commu- nication in Distributed Machine Learning Adaptive,” in2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), Jersey City, NJ, USA: IEEE, 2024, pp. 25–35
2024
- [19]
-
[20]
EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform,
J. Dong, Z. Cao, T. Zhang, J. Ye, S. Wang, F. Feng, L. Zhao, X. Liu, L. Song, L. Peng, et al., “EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform,” in2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 610–622
2020
-
[21]
HammingMesh: A network topology for large-scale deep learning,
T. Hoefler, T. Bonato, D. De Sensi, S. Di Girolamo, S. Li, M. Hed- des, J. Belk, D. Goel, M. Castro, and S. Scott, “HammingMesh: A network topology for large-scale deep learning,” inProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’22, Dallas, Texas: IEEE Press, 2022, pp. 1–18
2022
-
[22]
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs,
W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y . Zhang, and A. Kewitsch, “TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs,” in20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17-19, 2023, M. Balakrishnan and M. Ghobadi, Eds., USENIX Ass...
-
[23]
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings,
N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, et al., “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings,” inProceedings of the 50th Annual Interna- tional Symposium on Computer Architecture, ser. ISCA ’23, New York, NY , USA: Association f...
2023
-
[24]
Rep., 2024
NVIDIA, “NVIDIA DGX SuperPOD: Next Generation Scalable Infras- tructure for AI Leadership Reference Architecture Featuring NVIDIA DGX B200 — NVIDIA DGX SuperPOD: Next Generation Scalable In- frastructure for AI Leadership Reference Architecture Featuring NVDIA DGX B200,” NVIDIA Corporation, Tech. Rep., 2024
2024
-
[25]
Rail- only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters,
W. Wang, M. Ghobadi, K. Shakeri, Y . Zhang, and N. Hasani, “Rail- only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters,” inIEEE Symposium on High-Performance Interconnects, HOTI 2024, Albuquerque, NM, USA, August 21-23, 2024, IEEE, 2024, pp. 1–10. arXiv: 2307.12169
-
[26]
Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer,
Y . Zu, A. Ghaffarkhah, H.-V . Dang, B. Towles, S. Hand, S. Huda, A. Bello, A. Kolbasov, A. Rezaei, D. Du, et al., “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 761–774
2024
- [27]
-
[28]
From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training,
Z. Yan, D. Li, L. Chen, D. Xiong, K. Gao, Y . Zhang, R. Yan, M. Zhang, B. Zhang, Z. Jiang, et al., “From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25, New York, NY , USA: Association for Computing Machinery,...
2025
-
[29]
MixNet: A Runtime Reconfigurable Optical- Electrical Fabric for Distributed Mixture-of-Experts Training,
X. Liao, Y . Sun, H. Tian, X. Wan, Y . Jin, Z. Wang, Z. Ren, X. Huang, W. Li, K. F. Tse, et al., “MixNet: A Runtime Reconfigurable Optical- Electrical Fabric for Distributed Mixture-of-Experts Training,” inPro- ceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25, New York, NY , USA: Association for Computing Machinery, 2025, pp. 554–574
2025
-
[30]
InfiniteHBD: Building Datacenter- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers,
C. Shou, G. Liu, H. Nie, H. Meng, Y . Zhou, Y . Jiang, W. Lv, Y . Xu, Y . Lu, Z. Chen, et al., “InfiniteHBD: Building Datacenter- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25, New York, NY , U...
2021
-
[31]
Understanding Communication Characteristics of Distributed Training,
W. Li, X. Liu, Y . Li, Y . Jin, H. Tian, Z. Zhong, G. Liu, Y . Zhang, and K. Chen, “Understanding Communication Characteristics of Distributed Training,” inProceedings of the 8th Asia-Pacific Workshop on Network- ing, ser. APNet ’24, New York, NY , USA: Association for Computing Machinery, 2024, pp. 1–8
2024
-
[32]
Bandwidth optimal all-reduce algorithms for clusters of workstations,
P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,”Journal of Parallel and Distributed Comput- ing, vol. 69, no. 2, pp. 117–124, 2009,ISSN: 0743-7315
2009
-
[33]
An In-Network Ar- chitecture for Accelerating Shared-Memory Multiprocessor Collectives,
B. Klenk, N. Jiang, G. Thorson, and L. Dennison, “An In-Network Ar- chitecture for Accelerating Shared-Memory Multiprocessor Collectives,” in2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 996–1009
2020
-
[34]
2022 Fast Inter-GPU Communication with NCCL for Deep Learning Training, and More (a Magnum IO session) — GTC Digital Spring 2022 — NVIDIA On-Demand,
“2022 Fast Inter-GPU Communication with NCCL for Deep Learning Training, and More (a Magnum IO session) — GTC Digital Spring 2022 — NVIDIA On-Demand,” NVIDIA
2022
-
[35]
NVIDIA,Scaling Deep Learning Training: Fast Inter-GPU Communi- cation with NCCL, https://www.nvidia.com/en-us/on-demand/session/ gtcspring23-s51111/
-
[36]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al.,DeepSeek-V3 Technical Report, 2024. arXiv: 2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
SiP-ML: High-bandwidth optical network interconnects for machine learning training,
M. Khani, M. Ghobadi, M. Alizadeh, Z. Zhu, M. Glick, K. Bergman, A. Vahdat, B. Klenk, and E. Ebrahimi, “SiP-ML: High-bandwidth optical network interconnects for machine learning training,” inProceedings of the 2021 ACM SIGCOMM 2021 Conference, ser. SIGCOMM ’21, New York, NY , USA: Association for Computing Machinery, 2021, pp. 657– 675
2021
-
[38]
A Unified Architecture for Accelerating Distributed{DNN}Training in Heteroge- neous{GPU/CPU}Clusters,
Y . Jiang, Y . Zhu, C. Lan, B. Yi, Y . Cui, and C. Guo, “A Unified Architecture for Accelerating Distributed{DNN}Training in Heteroge- neous{GPU/CPU}Clusters,” in14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 463–479
2020
-
[39]
Is Network the Bottleneck of Distributed Training?
Z. Zhang, C. Chang, H. Lin, Y . Wang, R. Arora, and X. Jin, “Is Network the Bottleneck of Distributed Training?” InProceedings of the Workshop on Network Meets AI & ML, Virtual Event USA: ACM, 2020, pp. 8–13
2020
-
[40]
RDMA over Ethernet for Distributed Training at Meta Scale,
A. Gangidi, R. Miao, S. Zheng, S. J. Bondu, G. Goes, H. Morsy, R. Puri, M. Riftadi, A. J. Shetty, J. Yang, et al., “RDMA over Ethernet for Distributed Training at Meta Scale,” inProceedings of the ACM SIGCOMM 2024 Conference, Sydney NSW Australia: ACM, 2024, pp. 57–70
2024
-
[41]
Bonato, A
T. Bonato, A. Kabbani, A. Ghalayini, M. Papamichael, M. Dohadwala, L. Gianinazzi, M. Khalilov, E. Achermann, D. De Sensi, and T. Hoefler, REPS: Recycled Entropy Packet Spraying for Adaptive Load Balancing and Failure Mitigation, 2025
2025
-
[42]
A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network,
N. Blach, M. Besta, D. D. Sensi, J. Domke, H. Harake, S. Li, P. Iff, M. Konieczny, K. Lakhotia, A. Kubicek, et al., “A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1025–1044
2024
- [43]
- [44]
-
[45]
Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems,
H. Liu, R. Urata, K. Yasumura, X. Zhou, R. Bannon, J. Berger, P. Dashti, N. Jouppi, C. Lam, S. Li, et al., “Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems,” in Proceedings of the ACM SIGCOMM 2023 Conference, New York NY USA: ACM, 2023, pp. 499–515
2023
-
[46]
Thorough Characterization and Analysis of Large Transformer Model Training At-Scale,
S. Cheng, J.-L. Lin, M. Emani, S. Raskar, S. Foreman, Z. Xie, V . Vishwanath, and M. T. Kandemir, “Thorough Characterization and Analysis of Large Transformer Model Training At-Scale,”Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 8, no. 1, pp. 1–25, 2024,ISSN: 2476-1249
2024
-
[47]
Demystifying the communication char- acteristics for distributed transformer models,
Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbai, A. Shafi, H. Subramoni, and D. K. Panda, “Demystifying the communication char- acteristics for distributed transformer models,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI), IEEE, 2024, pp. 57–65
2024
-
[48]
C. Jin, Z. Jiang, Z. Bai, Z. Zhong, J. Liu, X. Li, N. Zheng, X. Wang, C. Xie, Q. Huang, et al. “MegaScale-MoE: Large-Scale Communication- Efficient Training of Mixture-of-Experts Models in Production.” arXiv: 2505.11432[cs], pre-published
-
[49]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv: 1909.08053, pre-published
work page internal anchor Pith review arXiv 1909
-
[50]
Efficient large-scale language model training on GPU clusters using megatron-LM,
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al., “Efficient large-scale language model training on GPU clusters using megatron-LM,” inInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, N...
2021
-
[51]
Reducing activation recomputation in large transformer models,
V . Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” inProceedings of Machine Learning and Systems, vol. 5, 2023, pp. 341–353
2023
-
[52]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “ZeRO: Memory optimizations toward training trillion parameter models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, C. Cuicchi, I. Qualters, and W. T. Kramer, Eds., IEEE/...
-
[53]
The Nvlink-Network Switch: Nvidia’s Switch Chip for High Communication-Bandwidth Superpods,
A. Ishii and R. Wells, “The Nvlink-Network Switch: Nvidia’s Switch Chip for High Communication-Bandwidth Superpods,” in2022 IEEE Hot Chips 34 Symposium (HCS), Cupertino, CA, USA: IEEE, 21, 2022, pp. 1–23
2022
-
[54]
E. Ding, C. Ouyang, and R. Singh,Photonic Rails in ML Datacenters,
-
[55]
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures,
C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma, et al., “Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25, New York, NY , USA: Association for Computing Machinery, 2025, pp. ...
2025
-
[56]
Gemini: A Family of Highly Capable Multimodal Models
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al.,Gemini: A Family of Highly Capable Multimodal Models, 2025. arXiv: 2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.