A New Broadcast Model for Several Network Topologies

Bernard Tenreiro; Darren Hamilton; Hongbo Lu; Junsung Hwang; Nabila Jaman Tripti; Yuefan Deng

arxiv: 2510.18058 · v2 · submitted 2025-10-20 · 💻 cs.NI · cs.DC

A New Broadcast Model for Several Network Topologies

Hongbo Lu , Junsung Hwang , Bernard Tenreiro , Nabila Jaman Tripti , Darren Hamilton , Yuefan Deng This is my paper

Pith reviewed 2026-05-18 05:32 UTC · model grok-4.3

classification 💻 cs.NI cs.DC

keywords broadcast algorithmnetwork topologieslatency reductionnode utilizationdata propagationsimulation resultsbalanced saturationcommunication efficiency

0 comments

The pith

The Broadcast by Balanced Saturation algorithm reduces broadcast latency by keeping nodes active throughout the process in various network topologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Broadcast by Balanced Saturation as a new general broadcast algorithm meant to improve communication in networks with different structures. It focuses on maximizing the use of all nodes by ensuring they remain involved in sending data at every step rather than waiting. This targets problems in large systems where broadcasts can be slowed by topology limits, bandwidth, and syncing needs. A sympathetic reader would care because faster broadcasts could speed up many parallel computing tasks. Simulations indicate that this approach beats usual broadcast methods by a good amount in several tested setups.

Core claim

BBS maximizes node utilization by means of a precise communication cycle that delivers a repeatable stepwise broadcasting framework, ensuring sustained activity with nodes throughout the broadcast to enhance data propagation and significantly reduce latency, with simulation results showing consistent outperformance of common general broadcast algorithms across various topologies.

What carries the argument

The Broadcast by Balanced Saturation (BBS) algorithm and its balanced saturation mechanism that maintains continuous node participation in the broadcast cycle.

If this is right

Broadcast operations complete with lower latency in large-scale systems.
Node utilization stays high across different network topologies.
Data propagation improves without additional synchronization costs.
The stepwise framework makes broadcasts more predictable and efficient.
Performance gains appear substantial compared to standard algorithms in simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the balanced saturation idea holds, it could be adapted for other collective operations like all-reduce in distributed training.
Real hardware deployments might show whether the latency benefits persist under variable network conditions not captured in simulation.
Extending the model to include fault tolerance could address practical deployment in unreliable networks.

Load-bearing premise

That the simulation results on the tested topologies and traffic patterns accurately reflect performance in real-world networks.

What would settle it

A direct comparison of broadcast completion times using BBS versus a standard algorithm on a real supercomputer with one of the simulated topologies, where lack of significant latency reduction would disprove the outperformance claim.

read the original abstract

We present Broadcast by Balanced Saturation (BBS), a general broadcast algorithm designed to optimize communication efficiency across diverse network topologies. BBS maximizes node utilization, addressing challenges in broadcast operations such as topology constraints, bandwidth limitations, and synchronization overhead, particularly in large-scale systems like supercomputers. The algorithm ensures sustained activity with nodes throughout the broadcast, thereby enhancing data propagation and significantly reducing latency. Through a precise communication cycle, BBS provides a repeatable, streamlined, stepwise broadcasting framework. Simulation results across various topologies demonstrate that the BBS algorithm consistently outperforms common general broadcast algorithms, often by a substantial margin. These findings suggest that BBS is a versatile and robust framework with the potential to redefine broadcast strategies across network topologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BBS is a straightforward new broadcast algorithm that keeps nodes busy to cut latency, with simulation claims that look plausible but rest on thin details.

read the letter

The main thing to know is that this paper puts forward Broadcast by Balanced Saturation as a general algorithm meant to maximize node activity during broadcasts, which they say reduces latency across different network topologies in large systems like supercomputers. Their simulations reportedly show consistent gains over standard methods, sometimes by a wide margin. That is the core claim worth checking. They do a reasonable job spelling out the practical problems—topology limits, bandwidth, and idle nodes—and they frame the solution as a clean, repeatable communication cycle that could be easier to implement than ad-hoc approaches. The balanced saturation idea itself is the piece that feels new, at least as presented, since it targets sustained activity rather than just tree depth or pipelining. The paper stays focused on the algorithm and its potential use in collective operations, which keeps it grounded. The soft spot is the evidence. The performance numbers come from simulations, but the abstract gives almost no information on the exact topologies tested, the baseline algorithms, the traffic patterns, or any error bars. That leaves the outperformance claim hard to judge without the full methods section. If the full paper supplies those details and shows the mechanism is distinct from existing variants, the result strengthens. Otherwise it risks looking like a tuned version of known techniques. This is for people working on network algorithms and high-performance computing collectives. A reader who needs practical broadcast improvements in distributed systems could pick up usable ideas here. It deserves peer review so the simulation setup and novelty relative to prior tree or pipelined work can be examined directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Broadcast by Balanced Saturation (BBS), a general-purpose broadcast algorithm for diverse network topologies. It claims that a balanced saturation mechanism and a precise communication cycle keep nodes active throughout the broadcast, thereby improving data propagation, reducing latency, and outperforming standard broadcast algorithms across multiple topologies as shown by simulation results.

Significance. If the simulation results are reproducible and the tested topologies are representative, BBS could provide a practical, topology-agnostic improvement in broadcast efficiency for large-scale systems such as supercomputers and data-center networks. The emphasis on sustained node utilization and a repeatable stepwise framework is a potentially useful engineering contribution, though its impact depends on the strength of the empirical evidence.

major comments (2)

[Simulation Results / Evaluation section] The central performance claims rest on simulation results whose methodology is not described in sufficient detail. No information is given on the concrete topologies (e.g., hypercube, torus, fat-tree dimensions), the baseline algorithms, the exact metrics (latency, completion time, bandwidth utilization), number of runs, or statistical measures. This absence directly undermines the assertion that BBS “consistently outperforms … often by a substantial margin.”
[Algorithm Description / BBS Mechanism] The Balanced Saturation mechanism is introduced as the key innovation, yet its formal definition, termination conditions, and interaction with the communication cycle are not specified with sufficient rigor to allow independent verification or reproduction. Without these details the claim of “sustained activity with nodes throughout the broadcast” remains an unverified assertion rather than a demonstrated property.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise statement of the quantitative improvements (e.g., percentage latency reduction) rather than the qualitative phrase “substantial margin.”
[Preliminaries / Algorithm section] Notation for the communication cycle and saturation parameters should be introduced consistently and early; currently the text mixes descriptive language with occasional undefined symbols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate additional details as outlined.

read point-by-point responses

Referee: [Simulation Results / Evaluation section] The central performance claims rest on simulation results whose methodology is not described in sufficient detail. No information is given on the concrete topologies (e.g., hypercube, torus, fat-tree dimensions), the baseline algorithms, the exact metrics (latency, completion time, bandwidth utilization), number of runs, or statistical measures. This absence directly undermines the assertion that BBS “consistently outperforms … often by a substantial margin.”

Authors: We agree that the current description of the simulation methodology is insufficient for reproducibility. In the revised manuscript we will expand the Evaluation section with concrete topology specifications (including dimensions for hypercubes, tori, and fat-trees), the specific baseline algorithms used for comparison, the precise metrics recorded, the number of independent runs, and statistical measures such as means and standard deviations. These additions will directly support the performance claims. revision: yes
Referee: [Algorithm Description / BBS Mechanism] The Balanced Saturation mechanism is introduced as the key innovation, yet its formal definition, termination conditions, and interaction with the communication cycle are not specified with sufficient rigor to allow independent verification or reproduction. Without these details the claim of “sustained activity with nodes throughout the broadcast” remains an unverified assertion rather than a demonstrated property.

Authors: We acknowledge the need for greater rigor in describing the Balanced Saturation mechanism. We will revise the relevant section to provide a formal definition, explicit termination conditions, and a detailed account of how the mechanism interacts with the communication cycle to maintain node activity. Additional pseudocode will be included to enable independent verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes the Broadcast by Balanced Saturation (BBS) algorithm as a general broadcast method optimized for diverse network topologies and validates performance claims exclusively through simulation results on various topologies. No derivation chain, equations, fitted parameters, or self-citations are described in the available text that would reduce any prediction or result to the inputs by construction. The central claims rest on algorithmic design choices and empirical outperformance metrics rather than any self-referential definitions or load-bearing internal loops, rendering the presentation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unstated details of the BBS communication cycle and the assumption that simulation outcomes reflect practical gains; no explicit free parameters or invented physical entities are described.

axioms (1)

domain assumption Network topologies admit a balanced saturation schedule that keeps nodes active without violating bandwidth or connectivity constraints.
Implicit foundation for the stepwise broadcasting framework described in the abstract.

invented entities (1)

Balanced Saturation mechanism no independent evidence
purpose: To maximize node utilization and minimize idle time during broadcast.
Core novel concept introduced to organize the communication cycle.

pith-pipeline@v0.9.0 · 5655 in / 1361 out tokens · 41219 ms · 2026-05-18T05:32:37.425841+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BBS maximizes node utilization... occupancy constraints... balanced solution where incoming efficiency equals constant C... BIA via multigraph edge coloring
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2: min T(A_cc) = min T(A_b) for balanced BBS solutions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

The Journal of Supercomputing 81, 795 (2025) https://doi.org/10.1007/s11227-025-07281-z

Almeida, F., Okon, E.: Assessing the impact of high-performance computing on digital trans- formation: benefits, challenges, and size-dependent differences. The Journal of Supercomputing 81, 795 (2025) https://doi.org/10.1007/s11227-025-07281-z

work page doi:10.1007/s11227-025-07281-z 2025
[2]

Generalized Slow Roll for Tensors

Jia, W., Wang, H., Chen, M., Lu, D., Lin, L., Car, R., Weinan, E., Zhang, L.: Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2020). https://doi.org/10.1109/SC41405.2020.00009

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00009 2020
[3]

The International Journal of High Performance Computing Applications37(5), 600–625 (2023) https://doi.org/10.1177/ 10943420231183688

Watkins, J., Carlson, M., Shan, K., Tezaur, I., Perego, M., Bertagna, L., Kao, C., Hoffman, M.J., Price, S.F.: Performance portable ice-sheet modeling with mali. The International Journal of High Performance Computing Applications37(5), 600–625 (2023) https://doi.org/10.1177/ 10943420231183688

work page 2023
[4]

Journal of Chemical Theory and Computation5(6), 1632–1639 (2009) https://doi.org/10.1021/ct9000685 https://doi.org/10.1021/ct9000685

Harvey, M.J., Giupponi, G., Fabritiis, G.D.: Acemd: Accelerating biomolecular dynamics in the microsecond time scale. Journal of Chemical Theory and Computation5(6), 1632–1639 (2009) https://doi.org/10.1021/ct9000685 https://doi.org/10.1021/ct9000685. PMID: 26609855

work page doi:10.1021/ct9000685 2009
[5]

Applied Sciences10(19) (2020) https://doi.org/10.3390/app10196717

Woo, J., Choi, H., Lee, J.: Empirical performance analysis of collective communication for distributed deep learning in a many-core cpu environment. Applied Sciences10(19) (2020) https://doi.org/10.3390/app10196717

work page doi:10.3390/app10196717 2020
[6]

Technical report, USA (1995)

Mitra, P., Payne, D., Shuler, L., Geijn, R., Watts, J.: Fast collective communication libraries, please. Technical report, USA (1995)

work page 1995
[7]

In: Cunha, J.C., Medeiros, P.D

Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.J.C., Germain, R.: Performance measure- ments of the 3d fft on the blue gene/l supercomputer. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005 Parallel Processing, pp. 795–803. Springer, Berlin, Heidelberg (2005)

work page 2005
[8]

Concurrency and Computation: Practice and Experience15, 803–820 (2003) https://doi.org/ 10.1002/cpe.728 14

Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience15, 803–820 (2003) https://doi.org/ 10.1002/cpe.728 14

work page doi:10.1002/cpe.728 2003
[9]

Simulation Modelling Practice and Theory58, 30–39 (2015) https://doi.org/10.1016/j.simpat.2015.03.005

Hasanov, K., Quintin, J.-N., Lastovetsky, A.: Topology-oblivious optimization of mpi broadcast algorithms on extreme-scale platforms. Simulation Modelling Practice and Theory58, 30–39 (2015) https://doi.org/10.1016/j.simpat.2015.03.005 . Special Issue on TECHNIQUES AND APPLICATIONS FOR SUSTAINABLE ULTRASCALE COMPUTING SYSTEMS

work page doi:10.1016/j.simpat.2015.03.005 2015
[10]

The Journal of Supercomputing37, 115–144 (2006) https://doi.org/10.1007/s11227-006-6255-3

Sinha, K., Srimani, P.: Deterministic broadcast and gossiping algorithms for ad hoc networks. The Journal of Supercomputing37, 115–144 (2006) https://doi.org/10.1007/s11227-006-6255-3

work page doi:10.1007/s11227-006-6255-3 2006
[11]

In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp

Dorier, M., Mubarak, M., Ross, R., Li, J.K., Carothers, C.D., Ma, K.-L.: Evaluation of topology- aware broadcast algorithms for dragonfly networks. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 40–49 (2016). https://doi.org/10.1109/CLUSTER.2016. 26

work page doi:10.1109/cluster.2016 2016
[12]

In: Kranzlm¨ uller, D., Kacsuk, P., Dongarra, J

Tr¨ aff, J.L.: A simple work-optimal broadcast algorithm for message-passing parallel systems. In: Kranzlm¨ uller, D., Kacsuk, P., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 173–180. Springer, Berlin, Heidelberg (2004)

work page 2004
[13]

In: Yang, L.T., Rana, O.F., Di Martino, B., Dongarra, J

Tr¨ aff, J.L., Ripke, A.: Optimal broadcast for fully connected networks. In: Yang, L.T., Rana, O.F., Di Martino, B., Dongarra, J. (eds.) High Performance Computing and Communications, pp. 45–56. Springer, Berlin, Heidelberg (2005)

work page 2005
[14]

In: 19th IEEE International Parallel and Distributed Processing Symposium, p

Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Perfor- mance analysis of mpi collective operations. In: 19th IEEE International Parallel and Distributed Processing Symposium, p. 8 (2005). https://doi.org/10.1109/IPDPS.2005.335

work page doi:10.1109/ipdps.2005.335 2005
[15]

IEEE Transactions on Control of Network Systems6(2), 474–486 (2019) https://doi.org/10.1109/TCNS.2018.2839341

Silvestre, D., Hespanha, J.P., Silvestre, C.: Broadcast and gossip stochastic average consensus algorithms in directed topologies. IEEE Transactions on Control of Network Systems6(2), 474–486 (2019) https://doi.org/10.1109/TCNS.2018.2839341

work page doi:10.1109/tcns.2018.2839341 2019
[16]

In: Proceedings of the Twenty-Seventh ACM Symposium on Principles of Distributed Computing

Berenbrink, P., Elsaesser, R., Friedetzky, T.: Efficient randomised broadcasting in random reg- ular networks with applications in peer-to-peer systems. In: Proceedings of the Twenty-Seventh ACM Symposium on Principles of Distributed Computing. PODC ’08, pp. 155–164. Associa- tion for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/...

work page doi:10.1145/1400751 2008
[17]

IEEE Transactions on Parallel and Distributed Systems9(5), 497–512 (1998) https://doi.org/10.1109/71.679219

Louri, A., Weech, B., Neocleous, C.: A spanning multichannel linked hypercube: a gradually scalable optical interconnection network for massively parallel computing. IEEE Transactions on Parallel and Distributed Systems9(5), 497–512 (1998) https://doi.org/10.1109/71.679219

work page doi:10.1109/71.679219 1998
[18]

In: Proceedings of the 34th Annual International Symposium on Computer Architecture

Kim, J., Dally, W.J., Abts, D.: Flattened butterfly: a cost-efficient topology for high-radix net- works. In: Proceedings of the 34th Annual International Symposium on Computer Architecture. ISCA ’07, pp. 126–137. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1250662.1250679 .https://doi.org/10.1145/1250662.1250679

work page doi:10.1145/1250662.1250679 2007
[19]

In: SC17: International Conference for High Performance Computing, Networking, Storage and Analysis, pp

Jain, N., Bhatele, A., Howell, L.H., B¨ ohme, D., Karlin, I., Le´ on, E.A., Mubarak, M., Wolfe, N., Gamblin, T., Leininger, M.L.: Predicting the performance impact of different fat-tree con- figurations. In: SC17: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)

work page 2017
[20]

In: 2012 41st International Conference on Parallel Processing, pp

Garc´ ıa, M., Vallejo, E., Beivide, R., Odriozola, M., Camarero, C., Valero, M., Rodr´ ıguez, G., Labarta, J., Minkenberg, C.: On-the-fly adaptive routing in high-radix hierarchical networks. In: 2012 41st International Conference on Parallel Processing, pp. 279–288 (2012). https://doi. org/10.1109/ICPP.2012.46

work page doi:10.1109/icpp.2012.46 2012
[21]

Parallel and Distributed Systems, IEEE Transactions on23, 2245–2253 (2012) https://doi.org/10.1109/TPDS.2012.93

Zhang, P., Deng, Y.: Design and analysis of pipelined broadcast algorithms for the all-port interlaced bypass torus networks. Parallel and Distributed Systems, IEEE Transactions on23, 2245–2253 (2012) https://doi.org/10.1109/TPDS.2012.93

work page doi:10.1109/tpds.2012.93 2012
[22]

In: Proceedings of the Tenth Annual ACM Symposium on Theory of Computing

Gabow, H.N., Kariv, O.: Algorithms for edge coloring bipartite graphs. In: Proceedings of the Tenth Annual ACM Symposium on Theory of Computing. STOC ’78, pp. 184–192. Association 15 for Computing Machinery, New York, NY, USA (1978). https://doi.org/10.1145/800133.804346 .https://doi.org/10.1145/800133.804346

work page doi:10.1145/800133.804346 1978
[23]

The Journal of Supercomputing76(2020) https://doi.org/10.1007/s11227-020-03216-y

Deng, Y., Guo, M., Ramos, A., Huang, X., Xu, Z., Liu, W.: Optimal low-latency network topologies for cluster performance enhancement. The Journal of Supercomputing76(2020) https://doi.org/10.1007/s11227-020-03216-y

work page doi:10.1007/s11227-020-03216-y 2020
[24]

Journal of Parallel and Distributed Computing165, 1–16 (2022)

Nuriyev, E., Rico-Gallego, J.-A., Lastovetsky, A.: Model-based selection of optimal mpi broad- cast algorithms for multi-core clusters. Journal of Parallel and Distributed Computing165, 1–16 (2022)

work page 2022
[25]

Thakur, R., Gropp, W.: Improving the performance of collective operations in mpich. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lec- ture Notes in Bioinformatics)2840, 257–267 (2003) https://doi.org/10.1007/978-3-540-39924-7 38

work page doi:10.1007/978-3-540-39924-7 2003
[26]

In: Tenth International Conference on Computer Modeling and Simulation (uksim 2008), pp

Casanova, H., Legrand, A., Quinson, M.: Simgrid: A generic framework for large-scale distributed experiments. In: Tenth International Conference on Computer Modeling and Simulation (uksim 2008), pp. 126–131 (2008). https://doi.org/10.1109/UKSIM.2008.28 16 Appendix A Performance Figures Fig. A1: Number of active edges per step of the lowest relative perfor...

work page doi:10.1109/uksim.2008.28 2008

[1] [1]

The Journal of Supercomputing 81, 795 (2025) https://doi.org/10.1007/s11227-025-07281-z

Almeida, F., Okon, E.: Assessing the impact of high-performance computing on digital trans- formation: benefits, challenges, and size-dependent differences. The Journal of Supercomputing 81, 795 (2025) https://doi.org/10.1007/s11227-025-07281-z

work page doi:10.1007/s11227-025-07281-z 2025

[2] [2]

Generalized Slow Roll for Tensors

Jia, W., Wang, H., Chen, M., Lu, D., Lin, L., Car, R., Weinan, E., Zhang, L.: Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2020). https://doi.org/10.1109/SC41405.2020.00009

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00009 2020

[3] [3]

The International Journal of High Performance Computing Applications37(5), 600–625 (2023) https://doi.org/10.1177/ 10943420231183688

Watkins, J., Carlson, M., Shan, K., Tezaur, I., Perego, M., Bertagna, L., Kao, C., Hoffman, M.J., Price, S.F.: Performance portable ice-sheet modeling with mali. The International Journal of High Performance Computing Applications37(5), 600–625 (2023) https://doi.org/10.1177/ 10943420231183688

work page 2023

[4] [4]

Journal of Chemical Theory and Computation5(6), 1632–1639 (2009) https://doi.org/10.1021/ct9000685 https://doi.org/10.1021/ct9000685

Harvey, M.J., Giupponi, G., Fabritiis, G.D.: Acemd: Accelerating biomolecular dynamics in the microsecond time scale. Journal of Chemical Theory and Computation5(6), 1632–1639 (2009) https://doi.org/10.1021/ct9000685 https://doi.org/10.1021/ct9000685. PMID: 26609855

work page doi:10.1021/ct9000685 2009

[5] [5]

Applied Sciences10(19) (2020) https://doi.org/10.3390/app10196717

Woo, J., Choi, H., Lee, J.: Empirical performance analysis of collective communication for distributed deep learning in a many-core cpu environment. Applied Sciences10(19) (2020) https://doi.org/10.3390/app10196717

work page doi:10.3390/app10196717 2020

[6] [6]

Technical report, USA (1995)

Mitra, P., Payne, D., Shuler, L., Geijn, R., Watts, J.: Fast collective communication libraries, please. Technical report, USA (1995)

work page 1995

[7] [7]

In: Cunha, J.C., Medeiros, P.D

Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.J.C., Germain, R.: Performance measure- ments of the 3d fft on the blue gene/l supercomputer. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005 Parallel Processing, pp. 795–803. Springer, Berlin, Heidelberg (2005)

work page 2005

[8] [8]

Concurrency and Computation: Practice and Experience15, 803–820 (2003) https://doi.org/ 10.1002/cpe.728 14

Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience15, 803–820 (2003) https://doi.org/ 10.1002/cpe.728 14

work page doi:10.1002/cpe.728 2003

[9] [9]

Simulation Modelling Practice and Theory58, 30–39 (2015) https://doi.org/10.1016/j.simpat.2015.03.005

Hasanov, K., Quintin, J.-N., Lastovetsky, A.: Topology-oblivious optimization of mpi broadcast algorithms on extreme-scale platforms. Simulation Modelling Practice and Theory58, 30–39 (2015) https://doi.org/10.1016/j.simpat.2015.03.005 . Special Issue on TECHNIQUES AND APPLICATIONS FOR SUSTAINABLE ULTRASCALE COMPUTING SYSTEMS

work page doi:10.1016/j.simpat.2015.03.005 2015

[10] [10]

The Journal of Supercomputing37, 115–144 (2006) https://doi.org/10.1007/s11227-006-6255-3

Sinha, K., Srimani, P.: Deterministic broadcast and gossiping algorithms for ad hoc networks. The Journal of Supercomputing37, 115–144 (2006) https://doi.org/10.1007/s11227-006-6255-3

work page doi:10.1007/s11227-006-6255-3 2006

[11] [11]

In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp

Dorier, M., Mubarak, M., Ross, R., Li, J.K., Carothers, C.D., Ma, K.-L.: Evaluation of topology- aware broadcast algorithms for dragonfly networks. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 40–49 (2016). https://doi.org/10.1109/CLUSTER.2016. 26

work page doi:10.1109/cluster.2016 2016

[12] [12]

In: Kranzlm¨ uller, D., Kacsuk, P., Dongarra, J

Tr¨ aff, J.L.: A simple work-optimal broadcast algorithm for message-passing parallel systems. In: Kranzlm¨ uller, D., Kacsuk, P., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 173–180. Springer, Berlin, Heidelberg (2004)

work page 2004

[13] [13]

In: Yang, L.T., Rana, O.F., Di Martino, B., Dongarra, J

Tr¨ aff, J.L., Ripke, A.: Optimal broadcast for fully connected networks. In: Yang, L.T., Rana, O.F., Di Martino, B., Dongarra, J. (eds.) High Performance Computing and Communications, pp. 45–56. Springer, Berlin, Heidelberg (2005)

work page 2005

[14] [14]

In: 19th IEEE International Parallel and Distributed Processing Symposium, p

Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Perfor- mance analysis of mpi collective operations. In: 19th IEEE International Parallel and Distributed Processing Symposium, p. 8 (2005). https://doi.org/10.1109/IPDPS.2005.335

work page doi:10.1109/ipdps.2005.335 2005

[15] [15]

IEEE Transactions on Control of Network Systems6(2), 474–486 (2019) https://doi.org/10.1109/TCNS.2018.2839341

Silvestre, D., Hespanha, J.P., Silvestre, C.: Broadcast and gossip stochastic average consensus algorithms in directed topologies. IEEE Transactions on Control of Network Systems6(2), 474–486 (2019) https://doi.org/10.1109/TCNS.2018.2839341

work page doi:10.1109/tcns.2018.2839341 2019

[16] [16]

In: Proceedings of the Twenty-Seventh ACM Symposium on Principles of Distributed Computing

Berenbrink, P., Elsaesser, R., Friedetzky, T.: Efficient randomised broadcasting in random reg- ular networks with applications in peer-to-peer systems. In: Proceedings of the Twenty-Seventh ACM Symposium on Principles of Distributed Computing. PODC ’08, pp. 155–164. Associa- tion for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/...

work page doi:10.1145/1400751 2008

[17] [17]

IEEE Transactions on Parallel and Distributed Systems9(5), 497–512 (1998) https://doi.org/10.1109/71.679219

Louri, A., Weech, B., Neocleous, C.: A spanning multichannel linked hypercube: a gradually scalable optical interconnection network for massively parallel computing. IEEE Transactions on Parallel and Distributed Systems9(5), 497–512 (1998) https://doi.org/10.1109/71.679219

work page doi:10.1109/71.679219 1998

[18] [18]

In: Proceedings of the 34th Annual International Symposium on Computer Architecture

Kim, J., Dally, W.J., Abts, D.: Flattened butterfly: a cost-efficient topology for high-radix net- works. In: Proceedings of the 34th Annual International Symposium on Computer Architecture. ISCA ’07, pp. 126–137. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1250662.1250679 .https://doi.org/10.1145/1250662.1250679

work page doi:10.1145/1250662.1250679 2007

[19] [19]

In: SC17: International Conference for High Performance Computing, Networking, Storage and Analysis, pp

Jain, N., Bhatele, A., Howell, L.H., B¨ ohme, D., Karlin, I., Le´ on, E.A., Mubarak, M., Wolfe, N., Gamblin, T., Leininger, M.L.: Predicting the performance impact of different fat-tree con- figurations. In: SC17: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)

work page 2017

[20] [20]

In: 2012 41st International Conference on Parallel Processing, pp

Garc´ ıa, M., Vallejo, E., Beivide, R., Odriozola, M., Camarero, C., Valero, M., Rodr´ ıguez, G., Labarta, J., Minkenberg, C.: On-the-fly adaptive routing in high-radix hierarchical networks. In: 2012 41st International Conference on Parallel Processing, pp. 279–288 (2012). https://doi. org/10.1109/ICPP.2012.46

work page doi:10.1109/icpp.2012.46 2012

[21] [21]

Parallel and Distributed Systems, IEEE Transactions on23, 2245–2253 (2012) https://doi.org/10.1109/TPDS.2012.93

Zhang, P., Deng, Y.: Design and analysis of pipelined broadcast algorithms for the all-port interlaced bypass torus networks. Parallel and Distributed Systems, IEEE Transactions on23, 2245–2253 (2012) https://doi.org/10.1109/TPDS.2012.93

work page doi:10.1109/tpds.2012.93 2012

[22] [22]

In: Proceedings of the Tenth Annual ACM Symposium on Theory of Computing

Gabow, H.N., Kariv, O.: Algorithms for edge coloring bipartite graphs. In: Proceedings of the Tenth Annual ACM Symposium on Theory of Computing. STOC ’78, pp. 184–192. Association 15 for Computing Machinery, New York, NY, USA (1978). https://doi.org/10.1145/800133.804346 .https://doi.org/10.1145/800133.804346

work page doi:10.1145/800133.804346 1978

[23] [23]

The Journal of Supercomputing76(2020) https://doi.org/10.1007/s11227-020-03216-y

Deng, Y., Guo, M., Ramos, A., Huang, X., Xu, Z., Liu, W.: Optimal low-latency network topologies for cluster performance enhancement. The Journal of Supercomputing76(2020) https://doi.org/10.1007/s11227-020-03216-y

work page doi:10.1007/s11227-020-03216-y 2020

[24] [24]

Journal of Parallel and Distributed Computing165, 1–16 (2022)

Nuriyev, E., Rico-Gallego, J.-A., Lastovetsky, A.: Model-based selection of optimal mpi broad- cast algorithms for multi-core clusters. Journal of Parallel and Distributed Computing165, 1–16 (2022)

work page 2022

[25] [25]

Thakur, R., Gropp, W.: Improving the performance of collective operations in mpich. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lec- ture Notes in Bioinformatics)2840, 257–267 (2003) https://doi.org/10.1007/978-3-540-39924-7 38

work page doi:10.1007/978-3-540-39924-7 2003

[26] [26]

In: Tenth International Conference on Computer Modeling and Simulation (uksim 2008), pp

Casanova, H., Legrand, A., Quinson, M.: Simgrid: A generic framework for large-scale distributed experiments. In: Tenth International Conference on Computer Modeling and Simulation (uksim 2008), pp. 126–131 (2008). https://doi.org/10.1109/UKSIM.2008.28 16 Appendix A Performance Figures Fig. A1: Number of active edges per step of the lowest relative perfor...

work page doi:10.1109/uksim.2008.28 2008