pith. machine review for the scientific record. sign in

arxiv: 2604.02473 · v1 · submitted 2026-04-02 · 💻 cs.DC · cs.AR

Recognition: no theorem link

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:34 UTC · model grok-4.3

classification 💻 cs.DC cs.AR
keywords reverse address translationTLB missesmulti-GPUcollective communicationscale-up fabricsNVLinkall-to-all
0
0 comments X

The pith

Cold TLB misses in reverse address translation slow small collectives by up to 1.4 times in multi-GPU scale-up systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the latency cost of translating network physical addresses to system physical addresses at the receiving GPU in fabrics such as NVLink. It shows that this reverse translation step, handled by Link MMUs and TLBs, is dominated by cold misses when collectives are small and latency-sensitive. Larger collectives warm the caches quickly, so extra TLB capacity brings little further gain. The work uses cycle-accurate simulation to quantify the effect and outlines two software approaches to hide the remaining translation cost.

Core claim

Reverse address translation at the destination side of scale-up links is performed by Link MMUs and Link TLBs; cold misses in these TLBs account for the bulk of added latency on small all-to-all collectives and produce up to 1.4x slowdown, while larger transfers see diminishing returns from bigger TLBs once the working set is cached.

What carries the argument

Link TLB, the destination-side cache that holds translations from Network Physical Addresses to System Physical Addresses and whose miss penalty dominates small-collective latency.

If this is right

  • Small, latency-critical collectives experience the largest slowdown from reverse translation.
  • Once the TLB working set fits in the cache, further increases in TLB size yield only marginal returns.
  • Fused pre-translation kernels that overlap address translation with computation can hide most of the overhead.
  • Software-guided TLB prefetching can proactively load likely entries before the network request arrives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Inference workloads that rely on many small collectives would see the largest benefit from the proposed prefetch and pre-translation techniques.
  • Designers of future scale-up fabrics may need to expose explicit translation control or larger shared TLBs to software.
  • The same reverse-translation bottleneck is likely to appear in any direct-access interconnect that uses separate network and system address spaces.

Load-bearing premise

The extended ASTRA-sim plus Omnet++ model correctly reproduces the timing and behavior of real Link MMUs and Link TLBs.

What would settle it

Hardware counter measurements of Link TLB miss rates and end-to-end latency for small all-to-all transfers on an actual multi-GPU system connected by NVLink or UALink.

Figures

Figures reproduced from arXiv: 2604.02473 by Amel Fatima, Bradford M. Beckmann, Tuan Ta.

Figure 3
Figure 3. Figure 3: Our baseline Reverse Address Translation hierarchy [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reverse Address Translation of a Network Physical [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance overhead of Reverse Address Transla [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average Reverse Address Translation latency per [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fraction of the round trip latency per request spent [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Stacked breakdown of Reverse Address Translation [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reverse Address Translation latency per request [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reverse Address Translation latency per request [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance overhead of Reverse Address Transla [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗
read the original abstract

Distributed ML workloads rely heavily on collective communication across multi-GPU, multi-node systems. Emerging scale-up fabrics, such as NVLink and UALink, enable direct memory access across nodes but introduce a critical destination-side translation step: translating Network Physical Addresses (NPAs) to System Physical Addresses (SPAs), which we term Reverse Translation (Reverse Address Translation). Despite its importance, the performance impact of Reverse Address Translation remains poorly understood. In this work, we present the first systematic study of Reverse Address Translation in large-scale GPU clusters. Using an extended ASTRA-sim framework with Omnet++ as the network backend, we model Link MMUs and Link TLBs and evaluate their effect on All-to-All collective communication across varying input sizes and GPU counts. Our analysis shows that cold TLB misses dominate latency for small, latency-sensitive collectives, causing up to 1.4x performance degradation, while larger collectives benefit from warmed caches and experience diminishing returns from over sized TLBs. Based on these observations, we propose two avenues for optimization: fused pre-translation kernels that overlap Reverse Address Translation with computation and software-guided TLB prefetching to proactively populate likely-needed entries. These techniques aim to hide translation latency, particularly for small collectives, improving throughput and scalability for inference workloads. Our study establishes a foundation for designing efficient destination-side translation mechanisms in large-scale multi-GPU systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents the first systematic simulation study of reverse address translation (NPA-to-SPA) overheads in multi-GPU scale-up pods. Using an extended ASTRA-sim framework with Omnet++ as the network backend, it models Link MMUs and Link TLBs and evaluates their impact on All-to-All collectives across input sizes and GPU counts. The central findings are that cold TLB misses dominate latency for small, latency-sensitive collectives (up to 1.4x degradation) while larger collectives benefit from warmed caches with diminishing returns from oversized TLBs; the authors propose fused pre-translation kernels and software-guided TLB prefetching to hide translation latency.

Significance. If the simulation model is shown to be accurate, the work would be significant as the first quantitative characterization of destination-side translation costs in emerging scale-up fabrics such as NVLink and UALink. It identifies a concrete performance bottleneck for latency-sensitive collectives and outlines actionable optimization directions, providing a useful foundation for hardware and runtime designers. The simulation-based approach allows exploration of parameter spaces not yet available in silicon, but the lack of hardware grounding currently limits the strength of the quantitative claims.

major comments (2)
  1. [Methodology and Evaluation sections] The Link TLB/MMU latency model (described in the methodology section) is not calibrated or validated against silicon measurements from real NVLink or UALink systems. Because the reported 1.4x degradation for small collectives and the diminishing-returns conclusion for larger collectives rest directly on the modeled miss penalties and cache-warming behavior, the absence of such validation makes the performance numbers sensitive to unverified parameter choices rather than observed system properties.
  2. [Abstract and Conclusion] The proposed optimizations (fused pre-translation kernels and software-guided TLB prefetching) are introduced in the abstract and conclusion but receive no quantitative evaluation within the simulation framework. Without results showing their impact on the reported TLB-miss overheads, it is unclear whether these techniques would meaningfully mitigate the identified bottlenecks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and describe the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [Methodology and Evaluation sections] The Link TLB/MMU latency model (described in the methodology section) is not calibrated or validated against silicon measurements from real NVLink or UALink systems. Because the reported 1.4x degradation for small collectives and the diminishing-returns conclusion for larger collectives rest directly on the modeled miss penalties and cache-warming behavior, the absence of such validation makes the performance numbers sensitive to unverified parameter choices rather than observed system properties.

    Authors: We agree that the model is not calibrated against silicon measurements from deployed NVLink or UALink systems, as these fabrics are still emerging and detailed public measurements of Link MMU/TLB behavior are not yet available. Our latency and size parameters are based on architectural specifications and analogous structures reported in the literature. To address the concern, we will add a sensitivity analysis subsection that varies the TLB miss penalty and TLB capacity over plausible ranges and shows that the central conclusions (cold-miss dominance for small collectives and diminishing returns for large collectives) remain qualitatively stable. We will also expand the methodology with explicit justification and citations for each parameter choice. revision: yes

  2. Referee: [Abstract and Conclusion] The proposed optimizations (fused pre-translation kernels and software-guided TLB prefetching) are introduced in the abstract and conclusion but receive no quantitative evaluation within the simulation framework. Without results showing their impact on the reported TLB-miss overheads, it is unclear whether these techniques would meaningfully mitigate the identified bottlenecks.

    Authors: We acknowledge that the two optimizations are introduced without quantitative evaluation in the current study. In the revised version we will modify the abstract and conclusion to state clearly that these are proposed directions motivated by the observed bottlenecks, not evaluated techniques. We will add a short Future Work subsection that outlines how the optimizations could be modeled in extensions of the framework, while removing any implication of measured benefit in this manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct simulation outputs from an extended framework, with no equations, fitted predictions, or self-referential derivations

full rationale

The paper presents a simulation study using an extended ASTRA-sim + Omnet++ model to measure TLB miss effects on collectives. No mathematical derivation chain exists; performance numbers (e.g., 1.4x degradation) are reported as direct outputs of the simulator runs across input sizes and GPU counts. The modeling assumptions are stated explicitly but do not reduce to self-definition or self-citation; the central claims rest on the simulation results themselves rather than any fitted parameter renamed as prediction or ansatz smuggled via prior work. This is a standard empirical modeling paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the unverified assumption that the simulation model faithfully reproduces real hardware translation latency and miss behavior.

axioms (1)
  • domain assumption The extended ASTRA-sim with Omnet++ accurately captures Link MMU and TLB behavior in NVLink/UALink fabrics.
    This assumption is required for all reported slowdown numbers and optimization suggestions but receives no validation in the abstract.

pith-pipeline@v0.9.0 · 5549 in / 1251 out tokens · 48531 ms · 2026-05-13T20:34:28.832923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 10 internal anchors

  1. [1]

    https://www.concise.app

    2023.Concise: The New Way to Read News. https://www.concise.app

  2. [2]

    Introducing Microsoft 365 Copilot: Your Copilot for Work

    2023. Introducing Microsoft 365 Copilot: Your Copilot for Work. https: //blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot- your-copilot-for-work/. Accessed: 2023-08-03

  3. [3]

    https://github.com/microsoft/msccl-tools Accessed: Jul

    2025.GitHub - microsoft/msccl-tools: Synthesizer for optimal collective communi- cation algorithms. https://github.com/microsoft/msccl-tools Accessed: Jul. 09, 2025

  4. [4]

    Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man- junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system f...

  5. [5]

    Advanced Micro Devices

    Inc. Advanced Micro Devices. 2023. RCCL. https : / / github . com / ROCmSoftwarePlatform/rccl

  6. [6]

    Advanced Micro Devices, Inc. 2025. AMD Instinct MI350X GPU Product Brief. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/ product-briefs/amd-instinct-mi350x-gpu-brochure.pdf. Accessed: 2025-08-16

  7. [7]

    Palwisha Akhtar, Erhan Tezcan, Fareed Mohammad Qararyah, and Didem Unat

  8. [8]

    InInternational Symposium on Benchmarking, Measuring and Optimization

    ComScribe: identifying intra-node GPU communication. InInternational Symposium on Benchmarking, Measuring and Optimization. Springer, 157–174

  9. [9]

    Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes.CoRR abs/1711.04325 (2017). arXiv:1711.04325 http://arxiv.org/abs/1711.04325

  10. [10]

    Rajesh Arsid. 2025. Ultra Ethernet and UALink: Next-Generation Interconnects for AI Infrastructure.IJSAT-International Journal on Science and Technology16, 2 (2025)

  11. [11]

    Rogers, Evan Schneider, Jean-Luc Vay, and P

    Scott Atchley, Christopher Zimmer, John Lange, David Bernholdt, Veronica Melesse Vergara, Thomas Beck, Michael Brim, Reuben Budiardja, Sunita Chan- drasekaran, Markus Eisenbach, Thomas Evans, Matthew Ezell, Nicholas Fron- tiere, Antigoni Georgiadou, Joe Glenski, Philipp Grete, Steven Hamilton, John Holmen, Axel Huebl, Daniel Jacobson, Wayne Joubert, Kim M...

  12. [12]

    Barr, Alan L

    Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2011. SpecTLB: A Mechanism for Speculative Address Translation.SIGARCH Comput. Archit. News(2011)

  13. [13]

    Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis.CoRRabs/1802.09941 (2018). arXiv:1802.09941 http://arxiv.org/abs/1802.09941

  14. [14]

    Abishek Bhattacharjee. 2017. Advanced concepts on address translation.Com- puter architecture—A quantitative approach (6th ed.), John L. Hennessy and David A. Patterson (Eds.). Morgan Kaufmann, Cambridge, MA, USA, Appendix L(2017), 1–69

  15. [15]

    Abhishek Bhattacharjee. 2017. Translation-Triggered Prefetching.SIGPLAN Not.(2017)

  16. [16]

    Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In2011 IEEE 17th International Sympo- sium on High Performance Computer Architecture

  17. [17]

    Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-Core Cooperative TLB for Chip Multiprocessors. InProceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  18. [18]

    Borg, J.B

    A. Borg, J.B. Chen, and N.P. Jouppi. 1992. A Simulation Based Study of TLB Per- formance. InProceedings the 19th Annual International Symposium on Computer Architecture (ISCA)

  19. [19]

    Ali Borji. 2023. Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2. arXiv:2210.00586 [cs.CV] https: //arxiv.org/abs/2210.00586

  20. [20]

    Dave Brown and Kent Lusted. 2025. UALink 200G 1.0 Specification Overview: Data Link Layer (DL) and Physical Layer (PL). https://www.ieee802.org/3/ad_ hoc/E4AI/public/25_0624/lusted_e4ai_01_250624.pdf. Accessed: 2025-08-05

  21. [21]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  22. [22]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  23. [23]

    Jehoshua Bruck, Ching-Tien Ho, Shlomo Kipnis, and Derrick Weathersby. 1994. Efficient algorithms for all-to-all communications in multi-port message-passing systems. InProceedings of the sixth annual ACM symposium on Parallel algo- rithms and architectures. 298–309

  24. [24]

    Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. 2024. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communi- cation Partitioning. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating ...

  25. [25]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James...

  26. [26]

    2025.How does the latency of NVIDIA NVLink compare to other high-speed interconnects like Ethernet?Massed Compute

    Massed Compute. 2025.How does the latency of NVIDIA NVLink compare to other high-speed interconnects like Ethernet?Massed Compute. https: //massedcompute.com/faq- answers/?question=How+does+the+latency+ of+NVIDIA+NVLink+compare+to+other+high-speed+interconnects+like+ Ethernet%3F&utm_source=chatgpt.com

  27. [27]

    NVIDIA Corporation. 2022. The NVLink-Network Switch: NVIDIA’s Switch Chip for High Communication-Bandwidth Superpods. InProceedings of the 34th IEEE Hot Chips Symposium (HotChips 34). Stanford, CA, USA. https://hc34. hotchips.org/assets/program/conference/day2/Network%20and%20Switches/ NVSwitch%20HotChips%202022%20r5.pdf

  28. [28]

    Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. 2023. MSCCLang: Microsoft Collective Communication Language. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association for Computing Machinery, Ne...

  29. [29]

    Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient Address Translation for Architectures with Multiple Page Sizes. InProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  30. [30]

    Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Duncan Roweth, Filippo Spiga, Salvatore Di Girolamo, and Torsten Hoefler

  31. [31]

    InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24)

    Exploring GPU-to-GPU Communication: Insights into Supercomputer In- terconnects. InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24). IEEE Press, Article 33, 15 pages. doi:10.1109/SC41406.2024.00039

  32. [32]

    Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V

    Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Fatima, Ta, and Beckmann and Andrew Y. Ng. 2012. Large scale distributed deep networks. InProceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1(Lake Tahoe, ...

  33. [33]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing.CoRRabs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/ 1810.04805

  34. [34]

    Ege Erdil. 2025. Inference economics of language models.arXiv preprint arXiv:2506.04645(2025)

  35. [35]

    Oak Ridge Leadership Computing Facility. 2022. Frontier User Guide - System Overview. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#id2

  36. [36]

    Amel Fatima, Sihang Liu, Korakit Seemakhupt, Rachata Ausavarungnirun, and Samira Khan. 2023. vPIM: Efficient virtual address translation for scalable processing-in-memory architectures. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

  37. [37]

    Amel Fatima, Yang Yang, Yifan Sun, Rachata Ausavarungnirun, and Adwait Jog. 2025. NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 1064–1078. doi:10.1145/3695053.3731040

  38. [38]

    Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect.IEEE Micro37, 2 (2017), 7–17. doi:10.1109/MM.2017.37

  39. [39]

    Nitin A Gawande, Jeff A Daily, Charles Siegel, Nathan R Tallent, and Abhinav Vishnu. 2020. Scaling deep learning workloads: Nvidia dgx-1/pascal and intel knights landing.Future Generation Computer Systems108 (2020), 1162–1172

  40. [40]

    Amir Gholami, Ariful Azad, Peter Jin, Kurt Keutzer, and Aydin Buluc. 2018. Integrated Model, Batch, and Domain Parallelism in Training Neural Networks. InProceedings of the 30th on Symposium on Parallelism in Algorithms and Archi- tectures(Vienna, Austria)(SPAA ’18). Association for Computing Machinery, New York, NY, USA, 77–86. doi:10.1145/3210377.3210394

  41. [41]

    Amir Gholami, Ariful Azad, Kurt Keutzer, and Aydin Buluç. 2017. Integrated Model and Data Parallelism in Training Neural Networks.CoRRabs/1712.04432 (2017). arXiv:1712.04432 http://arxiv.org/abs/1712.04432

  42. [42]

    Raja Gond, Nipun Kwatra, and Ramachandran Ramjee. 2025. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference. arXiv preprint arXiv:2505.11329(2025)

  43. [43]

    Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He

    Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He

  44. [44]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.ArXiv abs/1706.02677 (2017)

  45. [45]

    Cliff Grossner. 2025. Open Compute Project Foundation and UALink Consor- tium Announce a New Collaboration. https://www.opencompute.org/blog/ open-compute-project-foundation-and-ualinktm-consortium-announce-a- new-collaboration. Accessed: 2025-08-05

  46. [46]

    Mert Hidayetoglu, Simon Garcia De Gonzalo, Elliott Slaughter, Yu Li, Christo- pher Zimmer, Tekin Bicer, Bin Ren, William Gropp, Wen-Mei Hwu, and Alex Aiken. 2024. CommBench: Micro-Benchmarking Hierarchical Networks with Multi-GPU, Multi-NIC Nodes. InProceedings of the 38th ACM International Conference on Supercomputing (ICS ’24). Association for Computing...

  47. [47]

    Jinbin Hu, Houqiang Shen, Xuchong Liu, and Jin Wang. 2024. RDMA transports in datacenter networks: survey.IEEE Network38, 6 (2024), 380–387

  48. [48]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.CoRRabs/1811.06965 (2018). arXiv:1811.06965 http://arxiv.org/abs/1811.06965

  49. [49]

    2021.Machine Learning Frameworks Interoperability, Part 2: Data Loading and Data Transfer Bottlenecks

    Christian Hundt and Miguel Martinez. 2021.Machine Learning Frameworks Interoperability, Part 2: Data Loading and Data Transfer Bottlenecks. https: //developer.nvidia.com/blog/machine-learning-frameworks-interoperability- part-2-data-loading-and-data-transfer-bottlenecks/ NVIDIA Developer Blog

  50. [50]

    Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al . 2023. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems5 (2023), 269–287

  51. [51]

    Intel. 2023. oneCCL. https://github.com/oneapi-src/oneCCL. Accessed: 2025-08-06

  52. [52]

    Qi, and Alex Aiken

    Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. 2018. Explor- ing Hidden Dimensions in Parallelizing Convolutional Neural Networks. arXiv:1802.04924 [cs.LG] https://arxiv.org/abs/1802.04924

  53. [53]

    Scarpazza

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] https://arxiv.org/abs/1804.06826

  54. [54]

    Kandiraju and A

    G.B. Kandiraju and A. Sivasubramaniam. 2002. Going the distance for TLB prefetching: An application-driven study. InProceedings 29th Annual Interna- tional Symposium on Computer Architecture. 195–206. doi:10.1109/ISCA.2002. 1003578

  55. [55]

    Hill, Kathryn S

    Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman Ünsal

  56. [56]

    InPro- ceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA)

    Redundant Memory Mappings for Fast Access to Large Memories. InPro- ceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA)

  57. [57]

    2022.NVIDIA NVLink4 NVSwitch at Hot Chips 34

    Patrick Kennedy. 2022.NVIDIA NVLink4 NVSwitch at Hot Chips 34. https:// www.servethehome.com/nvidia-nvlink4-nvswitch-at-hot-chips-34/ Accessed: 2025-08-05

  58. [58]

    2024.This Is the NVIDIA DGX GB200 NVL72

    Patrick Kennedy. 2024.This Is the NVIDIA DGX GB200 NVL72. https://www. servethehome.com/this-is-the-nvidia-dgx-gb200-nvl72/ Accessed: 2025-08-05

  59. [59]

    Hyeyoung Ko, Suyeon Lee, Yoonseo Park, and Anna Choi. 2022. A Survey of Rec- ommendation Systems: Recommendation Models, Techniques, and Application Fields.Electronics11, 1 (2022). doi:10.3390/electronics11010141

  60. [60]

    Xinhao Kong, Jingrong Chen, Wei Bai, Yechen Xu, Mahmoud Elhaddad, Shachar Raindel, Jitendra Padhye, Alvin R Lebeck, and Danyang Zhuo. 2023. Understand- ing {RDMA} microarchitecture resources for performance isolation. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 31–48

  61. [61]

    Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 [cs.NE] https://arxiv.org/abs/1404.5997

  62. [62]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks.Commun. ACM60, 6 (may 2017), 84–90. doi:10.1145/3065386

  63. [63]

    Tallent, and Kevin J

    Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect.IEEE Transactions on Parallel and Distributed Systems31, 1 (2020), 94–110. doi:10.1109/TPDS.2019.2928289

  64. [64]

    Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In2018 IEEE International Symposium on Workload Charac- terization (IISWC). 191–202. doi:10.1109/IISWC.2018.8573483

  65. [65]

    Bingyao Li, Jieming Yin, Anup Holey, Youtao Zhang, Jun Yang, and Xulong Tang

  66. [66]

    In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

    Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 456–470. doi:10.1109/HPCA56546.2023.10071054

  67. [67]

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala

  68. [68]

    PyTorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow.13, 12 (Aug. 2020), 3005–3018. doi:10.14778/3415478.3415530

  69. [69]

    Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing : Industry Track Paper. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 789–801. doi:10. 1109/HPCA51647.2021.00071

  70. [70]

    Zhongyi Lin, Ning Sun, Pallab Bhattacharya, Xizhou Feng, Louis Feng, and John D. Owens. 2025. Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms.IEEE Transactions on Parallel and Distributed Systems36, 2 (2025), 226–238. doi:10.1109/TPDS.2024.3507814

  71. [71]

    Tobias Mann. 2025. With Tomahawk Ultra, Broadcom asks who needs UALink when there’s Ethernet?The Register(15 July 2025). https://www.theregister. com/2025/07/15/broadcom_ethernet_scale_up/ Accessed: 2025-08-14

  72. [72]

    Anish Mathew, Arif Khan, Joe Chen, and Gautam Singampalli. 2025. UALink™ 200G 1.0 Specification Overview. https://ualinkconsortium.org/blog/ualink- 200g-1-0-specification-overview-802/

  73. [73]

    2024.Building Meta’s GenAI Infrastructure

    Meta. 2024.Building Meta’s GenAI Infrastructure. https://engineering.fb.com/ 2024/03/12/data- center-engineering/building- metas- genai- infrastructure/ Accessed: 2024-03-19

  74. [74]

    Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. InProceedings of the 2013 USENIX Conference on Annual Technical Conference(San Jose, CA)(USENIX ATC’13). USENIX Association, USA, 103–114

  75. [75]

    Harini Muthukrishnan, Daniel Lustig, Oreste Villa, Thomas Wenisch, and David Nellans. 2023. FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems. In2023 IEEE International Symposium on High- Performance Computer Architecture (HPCA). 516–529. doi:10.1109/HPCA56546. 2023.10070949

  76. [76]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron- LM. InProceedings of the International Conference for High ...

  77. [77]

    NVIDIA. 2017. NVIDIA Collective Communications Library (NCCL). https: //developer.nvidia.com/nccl. Accessed: 2025-08-06

  78. [78]

    NVIDIA. 2017. NVIDIA NVLINK. http://www.nvidia.com/object/nvlink.html. Accessed: Aug. 1, 2025. Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

  79. [79]

    Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 724–736. doi:10.1109...

  80. [80]

    Papadopoulou, X

    M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos. 2015. Prediction-based superpage-friendly TLB designs. InIEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

Showing first 80 references.