pith. sign in

arxiv: 2605.19169 · v1 · pith:ISLN2XONnew · submitted 2026-05-18 · 💻 cs.PF · cs.DC

Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training

Pith reviewed 2026-05-20 08:13 UTC · model grok-4.3

classification 💻 cs.PF cs.DC
keywords geo-distributed AI trainingfiber latencycompute-communication overlaphollow-core fiberdata parallelismdiscrete-event simulationmulti-datacenter trainingAI cluster placement
0
0 comments X

The pith

Geo-distributed AI training reaches peak compute-communication overlap at 10-100 km cluster distances, where hollow-core fiber adds a 25% gain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses discrete-event simulation to measure how fiber latency affects the ability to overlap computation and communication during data-parallel training across separate datacenters. It identifies a practical operating window of 10 to 100 km between clusters as the range where latency remains manageable. Within that window, switching to hollow-core fiber produces a clear 25% improvement in overlap efficiency compared with standard fiber. A sympathetic reader cares because the result points to concrete infrastructure choices that could scale large-model training without forcing all resources into a single site.

Core claim

Through discrete-event simulation of data-parallel training, the study finds that inter-cluster distances of 10 to 100 km maximize compute-communication overlap, and that hollow-core fiber further increases this overlap by 25% relative to conventional fiber.

What carries the argument

A discrete-event simulation model that tracks fiber propagation delays and their effect on the timing of gradient exchanges versus local computation in data-parallel setups.

If this is right

  • Clusters placed inside the 10-100 km window can maintain higher overall training throughput than those farther apart.
  • Hollow-core fiber deployment yields a quantifiable overlap advantage that grows with the number of gradient synchronization steps.
  • Distances beyond 100 km increasingly serialize communication and computation, lowering effective utilization.
  • Infrastructure planners can use the reported overlap curves to decide between co-location and modest geographic spread.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latency-overlap framework could be applied to model-parallel or pipeline-parallel regimes to check whether the 10-100 km window remains optimal.
  • Real deployments may encounter additional congestion or routing effects that the simulation omits, suggesting a need for hardware validation.
  • The 25% gain implies potential reductions in total training time or power draw when hollow-core fiber is paired with the identified distance range.

Load-bearing premise

The discrete-event simulation model and its parameters accurately capture real-world fiber latency, network behavior, and compute-communication overlap dynamics in data-parallel training without major unmodeled effects or inaccuracies.

What would settle it

Direct measurement of achieved compute-communication overlap in a real two-cluster training run at 50 km separation using hollow-core fiber, compared against the simulation output for the same parameters.

Figures

Figures reproduced from arXiv: 2605.19169 by Indu Kant Deo, Ioannis Papavasileiou, Sairam Prabhakar, Sergejs Makovejs.

Figure 1
Figure 1. Figure 1: Multi-DC simulation architecture with lumped DC representation. Each DC contains up to 4096 GPUs with high-bandwidth intra-DC connectivity; inter-DC communication occurs over SMF or HCF links at variable distances. bandwidth is set to 100 GB/s or 200 GB/s, with latency determined by distance and fiber type. Tab. 1: Simulation parameter space Parameter Values Model GPT-3 13B, 175B Total GPUs 256, 2048, 8192… view at source ↗
Figure 2
Figure 2. Figure 2: Compute-communication overlap (ηoverlap) vs. inter-DC distance for 8192 GPUs. Left: GPT-3 13B; right: GPT-3 175B. hollow markers: HCF; filled markers: SMF. Blue: A100; red: H100. HCF consistently achieves higher overlap than SMF across all configurations [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Absolute improvement in compute-communication overlap for HCF over SMF (∆η = ηHCF − ηSMF) vs. inter-DC distance for 8192 GPUs. Left: GPT-3 13B; right: GPT-3 175B. Blue: A100; red: H100. The benefit peaks at intermediate distances where the system transitions from bandwidth-limited to latency-limited. where the system transitions from bandwidth￾limited to latency-limited operation. In the latency￾dominated … view at source ↗
read the original abstract

We use discrete-event simulation to quantify the impact of fiber latency on the efficacy of geo-distributed AI model training with data parallelism. We conclude that the optimum distances between two AI clusters is 10-100km, over which hollow-core fiber enables 25% higher compute-communication overlap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript employs discrete-event simulation to quantify how fiber latency affects compute-communication overlap during data-parallel training of AI models across geo-distributed datacenters. It concludes that the optimal inter-cluster distance lies in the 10-100 km range, within which hollow-core fiber yields approximately 25% higher overlap than conventional fiber.

Significance. If the simulation model and its parameters prove accurate and the results generalize, the findings could inform practical decisions on datacenter placement and fiber infrastructure for large-scale distributed training. The discrete-event approach is a reasonable tool for exploring latency-bandwidth trade-offs, but the absence of model specification, parameter values, validation, and sensitivity analysis prevents a firm assessment of whether the reported optimum and percentage improvement are robust.

major comments (3)
  1. Abstract: The central claim of a 10-100 km optimum and 25% overlap improvement is presented without any description of the discrete-event simulation model, network topology, all-reduce collective implementation, fiber latency parameters for standard versus hollow-core fiber, or the compute/communication ratio used in the sweeps.
  2. Simulation setup (inferred from abstract and skeptic note): The reported optimum distance is determined by the crossover where added latency exceeds available slack for hiding communication. Because all-reduce time depends on both latency and bandwidth terms while compute time scales with model size, layer count, and GPU count, the 10-100 km window is likely an artifact of a single unstated operating point; no parameter sweeps over these factors are described.
  3. Validation and error analysis: No comparison of simulated all-reduce times against measurements on real fiber links is mentioned, leaving open whether unmodeled effects (TCP slow-start, switch queuing, or choice of collective algorithm) would shift the claimed optimum outside the 10-100 km range.
minor comments (1)
  1. The abstract would benefit from a brief statement of the key simulation parameters and the baseline fiber type used for the 25% comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of clarity, robustness, and validation in our simulation study. We address each major comment below and have revised the manuscript to strengthen the presentation of our discrete-event model and results.

read point-by-point responses
  1. Referee: Abstract: The central claim of a 10-100 km optimum and 25% overlap improvement is presented without any description of the discrete-event simulation model, network topology, all-reduce collective implementation, fiber latency parameters for standard versus hollow-core fiber, or the compute/communication ratio used in the sweeps.

    Authors: We agree that the abstract would benefit from more context on the underlying model to support the central claims. In the revised manuscript, we have expanded the abstract to briefly describe the discrete-event simulation framework, the assumed two-cluster geo-distributed topology, the ring-based all-reduce implementation, the latency and bandwidth parameters for standard single-mode fiber versus hollow-core fiber, and the compute-to-communication ratios used in the parameter sweeps. revision: yes

  2. Referee: Simulation setup (inferred from abstract and skeptic note): The reported optimum distance is determined by the crossover where added latency exceeds available slack for hiding communication. Because all-reduce time depends on both latency and bandwidth terms while compute time scales with model size, layer count, and GPU count, the 10-100 km window is likely an artifact of a single unstated operating point; no parameter sweeps over these factors are described.

    Authors: The full manuscript includes parameter sweeps over model sizes, layer counts, GPU counts per cluster, and compute/communication ratios (detailed in Sections 3 and 4), showing that the 10-100 km optimum persists across these variations because it arises from the fundamental latency-bandwidth tradeoff in all-reduce time relative to compute slack. To improve clarity, we have added explicit text and a new sensitivity figure demonstrating that the reported range is not an artifact of a single operating point but holds for realistic ranges of model and system scales. revision: partial

  3. Referee: Validation and error analysis: No comparison of simulated all-reduce times against measurements on real fiber links is mentioned, leaving open whether unmodeled effects (TCP slow-start, switch queuing, or choice of collective algorithm) would shift the claimed optimum outside the 10-100 km range.

    Authors: We acknowledge that direct empirical validation against production geo-distributed fiber links is absent from this work, as our study focuses on simulation-based exploration of latency sensitivity. In the revision, we have added a dedicated limitations subsection discussing potential unmodeled effects such as TCP slow-start, queuing delays, and collective algorithm choices, along with an analysis of how these might influence the optimum distance. We also outline plans for future empirical calibration. revision: partial

Circularity Check

0 steps flagged

No circularity detected in simulation-based quantification

full rationale

The paper describes a discrete-event simulation approach to model fiber latency effects on compute-communication overlap for geo-distributed data-parallel training. The central results on optimum inter-cluster distances (10-100 km) and overlap improvements are generated as outputs from parameter sweeps within the simulation, using inputs for latency, bandwidth, and compute/communication ratios. No equations, self-citations, uniqueness theorems, or ansatzes are shown that would reduce these outputs to the inputs by construction, nor is there renaming of known results or fitted parameters relabeled as predictions. The derivation chain remains a forward modeling exercise that is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone provides no information on free parameters, axioms, or invented entities used in the simulation model.

pith-pipeline@v0.9.0 · 5587 in / 1049 out tokens · 56504 ms · 2026-05-20T08:13:42.508886+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774. 2023. doi:10.48550/arXiv.2303.08774

  2. [2]

    Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms

    Saeed Rashidi and Srinivas Sridharan and Sudarshan Srinivasan and Tushar Krishna. ASTRA-sim : Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2020. doi:10.1109/ISPASS48437.2020.00018

  3. [3]

    Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

    William Won and Taekyung Heo and Saeed Rashidi and Srinivas Sridharan and Sudarshan Srinivasan and Tushar Krishna. ASTRA-sim2.0 : Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2023. doi:10.1109/ISPASS57527...

  4. [4]

    BlueConnect : Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy

    Minsoo Cho and Ulrich Finkler and Marcelo Serrano and David Kung and Hillery Hunter. BlueConnect : Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy. IBM Journal of Research and Development. 2019. doi:10.1147/JRD.2019.2947013

  5. [5]

    CO2 : Efficient Distributed Training with Full Communication-Computation Overlap

    Weigao Sun and Zhen Qin and Weixuan Sun and Shidi Li and Dong Li and Xuyang Shen and Yu Qiao and Yiran Zhong. CO2 : Efficient Distributed Training with Full Communication-Computation Overlap. arXiv preprint arXiv:2401.16265. 2024. doi:10.48550/arXiv.2401.16265

  6. [6]

    PyTorch Distributed: Experiences on Accelerating Data Parallel Training

    Shen Li and Yanli Zhao and Rohan Varma and Omkar Salpekar and Pieter Noordhuis and Teng Li and Adam Paszke and Jeff Smith and Brian Vaughan and Pritam Damania and Soumith Chintala. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. arXiv preprint arXiv:2006.15704. 2020. doi:10.48550/arXiv.2006.15704

  7. [7]

    Language Models are Few-Shot Learners

    Tom Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and others. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems. 2020

  8. [8]

    NCCL : NVIDIA Collective Communications Library

    NVIDIA. NCCL : NVIDIA Collective Communications Library. 2024

  9. [9]

    NVIDIA DGX-2 System Datasheet

    NVIDIA. NVIDIA DGX-2 System Datasheet. 2018

  10. [10]

    Nested Antiresonant Nodeless Hollow Core Fiber

    Francesco Poletti. Nested Antiresonant Nodeless Hollow Core Fiber. Optics Express. 2014. doi:10.1364/OE.22.023807

  11. [11]

    Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking

    Leon Poutievski and Omid Mashayekhi and Joon Ong and Arjun Singh and Mukarram Tariq and Rui Wang and Jianan Zhang and Virginia Beauregard and Patrick Conner and Steve Gribble and Rishi Kapoor and Stephen Kratzer and Nanfang Li and Hong Liu and Karthik Nagaraj and Jason Ornstein and Samir Sawhney and Ryohei Urata and Lorenzo Vicisano and Kevin Yasumura and...

  12. [12]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi and Mostofa Patwary and Raul Puri and Patrick LeGresley and Jared Casper and Bryan Catanzaro. Megatron-LM : Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053. 2020. doi:10.48550/arXiv.1909.08053

  13. [13]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

    Deepak Narayanan and Mohammad Shoeybi and Jared Casper and Patrick LeGresley and Mostofa Patwary and Vijay Anand Korthikanti and Dmitri Vainbrand and Prethvi Kashinkunti and Julie Bernauer and Bryan Catanzaro and Amar Phanishayee and Matei Zaharia. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. Proc. Int. Conf. High Perfo...

  14. [14]

    Data Centres and Data Transmission Networks

    International Energy Agency. Data Centres and Data Transmission Networks. 2024

  15. [15]

    400ZR Implementation Agreement

    Optical Internetworking Forum. 400ZR Implementation Agreement. 2020

  16. [16]

    New Phase Stable Optical Fiber

    Michael Bousonville and Marie Czwalinna and Matthias Felber and Thomas Ladwig and Holger Schlarb and Sebastian Schulz and Cezary Sydlo and Piotr Kownacki and Sebastian Jablonski. New Phase Stable Optical Fiber. Proceedings of BIW. 2012