Characterization of Real Communication Patterns and Congestion Dynamics in HPC Interconnection Networks
Pith reviewed 2026-05-10 07:41 UTC · model grok-4.3
The pith
This paper develops an extended VEF Traces framework to characterize communication patterns and congestion from real HPC application traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a methodology based primarily on the VEF Traces framework to characterize, model, and simulate the communication patterns of representative computing- and data-intensive applications. The framework is extended with tools that characterize network congestion either directly from VEF traces or via simulations. Analysis of VEF traces from runs of NEST, GROMACS, LAMMPS, and PATMOS on several supercomputers identifies potential congestion scenarios that arise in realistic network configurations when certain collective operations are performed.
What carries the argument
The VEF Traces framework extended with congestion characterization tools that process execution traces to extract traffic patterns and detect congestion points.
Load-bearing premise
The selected traces from NEST, GROMACS, LAMMPS, and PATMOS on the studied supercomputers represent the communication patterns and congestion dynamics found in general HPC workloads.
What would settle it
Running the same applications on a supercomputer with a different network topology or routing algorithm and finding no congestion during the same collective operations would show that the identified scenarios are not general.
Figures
read the original abstract
The interconnection network is a key component of Supercomputers and Data centers, and its design must cope with the increasing communication demands of current applications and services; otherwise, it may become a system bottleneck. The most challenging network design issues are the topology, routing algorithm, flow control, and power efficiency. However, even the most efficient interconnection networks may suffer severe performance degradation due to congestion, especially under specific network traffic patterns generated by communication operations in high-performance computing~(HPC), deep learning training, or online data-intensive services. In this context, characterizing and modeling these communication operations and the network traffic patterns they generate is a fundamental challenge for studying their impact on network performance. This paper presents a methodology, based primarily on the VEF Traces framework, to characterize, model, and simulate the communication patterns of representative computing- and data-intensive applications. More precisely, we have extended the VEF traces framework with tools that enable us to characterize network congestion, either directly from VEF traces or via simulations. We have analyzed a set of VEF traces obtained from runs of NEST, GROMACS, LAMMPS, and PATMOS on several Supercomputers. In these studies, we identify potential congestion scenarios that arise in realistic network configurations when certain collective operations are performed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a methodology based primarily on the VEF Traces framework to characterize, model, and simulate the communication patterns of representative computing- and data-intensive applications. It extends the framework with tools to characterize network congestion either directly from VEF traces or via simulations, analyzes traces from runs of NEST, GROMACS, LAMMPS, and PATMOS on several supercomputers, and identifies potential congestion scenarios that arise in realistic network configurations when certain collective operations are performed.
Significance. If the methodology is sound and the identified scenarios prove generalizable, the work could contribute empirical insights into real HPC communication patterns and congestion dynamics, supporting better interconnection network design. The grounding in actual supercomputer traces from multiple applications is a strength compared to purely synthetic models. However, the narrow application set and lack of quantitative validation metrics limit the potential impact to case-specific observations rather than broadly applicable findings.
major comments (2)
- [Application trace analysis section] The section describing the analyzed VEF traces from NEST, GROMACS, LAMMPS, and PATMOS: the central claim that these traces enable identification of 'potential congestion scenarios' in realistic configurations rests on an unexamined assumption of representativeness; no quantitative comparison is provided of message-size distributions, collective operation frequencies, or spatial traffic patterns against other common HPC workloads (e.g., deep-learning training collectives or irregular graph analytics), which is load-bearing for any claim of generalizable congestion dynamics.
- [Methodology and tools extension] The description of the extended VEF framework tools for congestion characterization: the methodology is outlined but supplies no concrete metrics for quantifying congestion (e.g., queue occupancy thresholds, latency inflation factors, or link utilization thresholds) nor any validation results from the trace-based or simulation-based studies, leaving the identification of scenarios without supporting evidence.
minor comments (2)
- [Introduction] Clarify the definition and scope of 'VEF Traces framework' on first use, including any assumptions about trace fidelity to actual network behavior.
- [Results and figures] Ensure all figures showing congestion scenarios include axis labels, legends, and quantitative scales rather than qualitative descriptions alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating the revisions made to strengthen the paper.
read point-by-point responses
-
Referee: [Application trace analysis section] The section describing the analyzed VEF traces from NEST, GROMACS, LAMMPS, and PATMOS: the central claim that these traces enable identification of 'potential congestion scenarios' in realistic configurations rests on an unexamined assumption of representativeness; no quantitative comparison is provided of message-size distributions, collective operation frequencies, or spatial traffic patterns against other common HPC workloads (e.g., deep-learning training collectives or irregular graph analytics), which is load-bearing for any claim of generalizable congestion dynamics.
Authors: We agree that representativeness is key to supporting broader claims about congestion dynamics. The applications (NEST, GROMACS, LAMMPS, PATMOS) were selected because they are established workloads in neuroscience, molecular dynamics, and particle transport, with communication patterns documented in prior HPC studies. However, our dataset did not include equivalent VEF traces from deep-learning or graph analytics workloads, precluding direct quantitative comparisons of message sizes, collectives, or spatial patterns. In the revised manuscript, we have added a dedicated paragraph in the application trace analysis section that discusses the selection rationale with supporting references, qualitatively contrasts the observed patterns (e.g., irregular point-to-point vs. known all-to-all in DL), and explicitly scopes our findings to scientific computing applications rather than claiming generalizability across all HPC workloads. revision: partial
-
Referee: [Methodology and tools extension] The description of the extended VEF framework tools for congestion characterization: the methodology is outlined but supplies no concrete metrics for quantifying congestion (e.g., queue occupancy thresholds, latency inflation factors, or link utilization thresholds) nor any validation results from the trace-based or simulation-based studies, leaving the identification of scenarios without supporting evidence.
Authors: We accept this critique; the original methodology section described the tool extensions at a high level without sufficient operational detail. In the revised version, we have expanded this section to define explicit congestion metrics: link utilization >75%, average queue occupancy >50 packets, and latency inflation factor >1.2 relative to baseline. We have also incorporated validation results, including direct comparisons of trace-derived indicators against simulation outputs for the four applications, confirming that the metrics reliably flag the collective-induced contention scenarios. These additions supply the quantitative grounding and evidence previously missing. revision: yes
Circularity Check
No circularity; empirical trace analysis with no derivations or self-defined reductions
full rationale
The paper presents an empirical methodology that extends the existing VEF Traces framework to collect and analyze communication traces from external application runs (NEST, GROMACS, LAMMPS, PATMOS) on supercomputers, then identifies congestion scenarios from those traces. No mathematical derivations, equations, fitted parameters, or predictions are described that could reduce to inputs by construction. The work relies on external benchmarks and trace data rather than any self-citation chain or ansatz that would make central claims tautological. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VEF Traces framework accurately captures communication patterns from HPC applications
Reference graph
Works this paper leans on
-
[1]
Structural Simulation Toolkit (SST) DUMPI Trace Library,
“Structural Simulation Toolkit (SST) DUMPI Trace Library, ” (Accessed July 5, 2024). [Online]. Available: https://github.com/sstsimulator/sst-dumpi
work page 2024
-
[2]
F. J. Andujar, J. A. Villar, F. J. Alfaro, J. L. Sánchez, and J. Escudero-Sahuquillo, “An open-source family of tools to reproduce mpi-based workloads in interconnection network simulators, ”J. Supercomput., vol. 72, no. 12, pp. 4601–4628, 2016. [Online]. Available: https://doi.org/10.1007/s11227-016-1757-0
-
[3]
W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale, ” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 283–294
work page 2023
-
[4]
S. Shen, T. Bonato, Z. Hu, P. Jordan, T. Chen, and T. Hoefler, “Atlahs: An application-centric network simulator toolchain for ai, hpc, and distributed storage, ” 2025. [Online]. Available: https://arxiv.org/abs/2505.08936
-
[5]
Extending the VEF traces framework to model data center network workloads,
F. J. Andújar, M. S. de la Rosa, J. Escudero-Sahuquillo, and J. L. Sánchez, “Extending the VEF traces framework to model data center network workloads, ”J. Supercomput., vol. 79, no. 1, pp. 814–831, 2023. [Online]. Available: https://doi.org/10.1007/s11227-022-04692-0
-
[6]
Understanding PCIe performance for end host networking,
B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout, “Homa: a receiver-driven low-latency transport protocol using network priorities, ” inProceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, ser. SIGCOMM ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 221–235. [Online]. Available: https://doi.o...
-
[7]
A., Bustos, B., & Hitschfeld, N
L. Gonzalez-Naharro, J. Escudero-Sahuquillo, P. J. García, F. J. Quiles, J. Duato, W. Sun, X. Yu, and H. Zheng, “Modeling traffic workloads in data-center network simulation tools, ” in17th International Conference on High Performance Computing & Simulation, HPCS 2019, Dublin, Ireland, July 15-19, 2019. IEEE, 2019, pp. 1036–1042. [Online]. Available: http...
-
[8]
Networks of exascale systems with omnet++
P. Yebenes, J. Escudero-Sahuquillo, P. J. Garcia, and F. J. Quiles, “Networks of exascale systems with omnet++. ” inEuromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2013, pp. 203–207
work page 2013
-
[9]
An overview of the omnet++ simulation environment
A. Varga and R. Hornig, “An overview of the omnet++ simulation environment. ” ICST, 5 2010
work page 2010
-
[10]
P. Yébenes, G. Maglione-Mathey, J. Escudero-Sahuquillo, P. J. García, and F. J. Quiles, “Modeling a switch architecture with virtual output queues and virtual channels in hpc-systems simulators, ” in2016 International Conference on High Performance Computing & Simulation (HPCS), 2016, pp. 380–386
work page 2016
-
[11]
Hybrid congestion control for bxi-based interconnection networks,
G. Gomez-Lopez, M. S. de la Rosa, J. Escudero-Sahuquillo, P. J. García, F. J. Quiles, and P. Lagadec, “Hybrid congestion control for bxi-based interconnection networks, ” inEuro-Par 2024: Parallel Processing - 30th European Conference on Parallel and Distributed Processing, Madrid, Spain, August 26-30, 2024, Proceedings, Part II, ser. Lecture Notes in Com...
-
[12]
Quality-of-service provision for bxiv3-based interconnection networks,
M. S. de la Rosa, G. Gomez-Lopez, F. J. Andújar, J. Escudero-Sahuquillo, J. L. Sánchez, F. J. Alfaro-Cortés, and P. Lagadec, “Quality-of-service provision for bxiv3-based interconnection networks, ”J. Supercomput., vol. 81, no. 4, p. 601, 2025. [Online]. Available: https://doi.org/10.1007/s11227-025-07069-1
-
[13]
M.-O. Gewaltig and M. Diesmann, “Nest (neural simulation tool), ”Scholarpedia, vol. 2, no. 4, p. 1430, 2007. [Online]. Available: https://doi.org/10.4249/scholarpedia.1430
-
[14]
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lindahl, “Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers, ”SoftwareX, vol. 1-2, pp. 19–25, 2015. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2352711015000059
work page 2015
-
[15]
A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton, “LAMMPS - a flexible simulation tool for particle-based materials modeling at the Manuscript submitted to ACM 24 Sanchez de la Rosa et al. a...
-
[16]
Patmos: A prototype monte carlo transport code to test high performance architectures,
E. Brun, S. Chauveau, and F. Malvagi, “Patmos: A prototype monte carlo transport code to test high performance architectures, ” 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:237524013
work page 2017
- [17]
-
[18]
Managing work flows with ecflow,
A. Bahra, “Managing work flows with ecflow, ” pp. 30–32, 2011 2011. [Online]. Available: https://www.ecmwf.int/node/17434
work page 2011
-
[19]
The VEF Traces Repository homepage,
“The VEF Traces Repository homepage, ” (Accessed August 1, 2025). [Online]. Available: https://gitraap.i3a.info/jesus.escudero/vef-traces-repository
work page 2025
-
[20]
Design and implementation of enhanced crossbar CIOQ switch architecture,
A. Awan and R. Venkatesan, “Design and implementation of enhanced crossbar CIOQ switch architecture, ” inCanadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No.04CH37513), vol. 2, 2004, pp. 1045–1048 Vol.2
work page 2004
-
[21]
Input versus output queueing on a space-division packet switch,
M. Karol, M. Hluchyj, and S. Morgan, “Input versus output queueing on a space-division packet switch, ”IEEE Transactions on communications, vol. 35, no. 12, pp. 1347–1356, 1987
work page 1987
-
[22]
802.1Qbb, “IEEE Standard for Local and Metropolitan Area Networks—Virtual Bridged Local Area Networks – Amendment: Priority-based Flow Control.” IEEE, 2011. [Online]. Available: https://1.ieee802.org/dcb/802-1qbb/
work page 2011
-
[23]
Credit-based flow control for atm networks,
N. Kung and R. Morris, “Credit-based flow control for atm networks, ”IEEE Network, vol. 9, no. 2, pp. 40–48, 1995
work page 1995
-
[24]
Megafly: A topology for exascale systems,
M. Flajslik, E. Borch, and M. A. Parker, “Megafly: A topology for exascale systems, ” inHigh Performance Computing: 33rd International Conference, ISC High Performance 2018, Frankfurt, Germany, June 24-28, 2018, Proceedings 33. Springer, 2018, pp. 289–310
work page 2018
-
[25]
Technology-driven, highly-scalable dragonfly topology,
J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable dragonfly topology, ” in2008 International Symposium on Computer Architecture, 2008, pp. 77–88
work page 2008
-
[26]
Dragonfly+: Low Cost Topology for Scaling Datacenters,
A. Shpiner, Z. Haramaty, S. Eliad, V. Zdornov, B. Gafni, and E. Zahavi, “Dragonfly+: Low Cost Topology for Scaling Datacenters, ” in2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB), 2017, pp. 1–8
work page 2017
-
[27]
Fat-tree routing and node ordering providing contention free traffic for mpi global collectives,
E. Zahavi, “Fat-tree routing and node ordering providing contention free traffic for mpi global collectives, ”Journal of Parallel and Distributed Computing, vol. 72, no. 11, pp. 1423–1432, 2012, communication Architectures for Scalable Systems. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731512000305
work page 2012
-
[28]
A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies,
C. Gómez, F. Gilabert, M. E. Gómez, P. López, and J. Duato, “A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies, ” The Journal of Supercomputing, vol. 71, no. 7, pp. 2339–2364, Jul. 2015. [Online]. Available: https://doi.org/10.1007/s11227-014-1303-x
-
[29]
The vampir performance analysis tool-set,
A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel, “The vampir performance analysis tool-set, ” inTools for High Performance Computing, M. Resch, R. Keller, V. Himmler, B. Krammer, and A. Schulz, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 139–155
work page 2008
-
[30]
A. Knupfer, C. Rossel, D. an Mey, S. Biersdorff, K. Diethelm, D. Eschweiler, M. Geimer, M. Gerndt, D. Lorenz, A. Malony, and W. E. Nagel, “Score-p: A joint performance measurement run-time infrastructure for periscope, scalasca, tau, and vampir. ” 8 2012. [Online]. Available: https://www.osti.gov/biblio/1567522
-
[31]
The scalasca performance toolset architecture,
M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr, “The scalasca performance toolset architecture, ”Concurrency and Computation: Practice and Experience, vol. 22, no. 6, p. 702–719, apr 2010
work page 2010
-
[32]
Extrae documentation — Extrae 3.8.3 documentation,
“Extrae documentation — Extrae 3.8.3 documentation, ” (Accessed July 5, 2024). [Online]. Available: https://tools.bsc.es/doc/html/extrae/index.html
work page 2024
-
[33]
VEF traces: A framework for modelling MPI traffic in interconnection network simulators,
F. J. Andujar, J. A. Villar, J. L. Sánchez, F. J. Alfaro, and J. Escudero-Sahuquillo, “VEF traces: A framework for modelling MPI traffic in interconnection network simulators, ” in2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015. IEEE Computer Society, 2015, pp. 841–848. [Online]. Available: htt...
-
[34]
VEF-Prospector repository homepage,
“VEF-Prospector repository homepage, ” (Accessed July 5, 2024). [Online]. Available: https://gitraap.i3a.info/fandujar/VEF-Prospector
work page 2024
-
[35]
VEF-TraceLib repository homepage,
“VEF-TraceLib repository homepage, ” (Accessed July 5, 2024). [Online]. Available: https://gitraap.i3a.info/fandujar/VEF-TraceLIB
work page 2024
-
[36]
Topaz: An open-source interconnection network simulator for chip multiprocessors and supercomputers,
P. Abad, P. Prieto, L. G. Menezo, A. Colaso, V. Puente, and J.-A. Gregorio, “Topaz: An open-source interconnection network simulator for chip multiprocessors and supercomputers, ” in2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, 2012, pp. 99–106
work page 2012
-
[37]
G. F. Riley and T. R. Henderson,The ns-3 Network Simulator. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 15–34. [Online]. Available: https://doi.org/10.1007/978-3-642-12331-3_2
-
[38]
The structural simulation toolkit,
A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “The structural simulation toolkit, ”SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, Mar. 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225
-
[39]
J. Vejražka, Z. Csaba, and A. Varga, “The INET Framework, ” inProceedings of the 6th International ICST Conference on Simulation Tools and Techniques (SIMUTOOLS ’13). ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2013, pp. 1–10
work page 2013
-
[40]
Distributed fast and accurate simulation platform for advanced ARM- and risc-v-based HPC systems,
N. Tampouratzis, I. Papaefstathiou, G. Gomez-Lopez, M. S. de la Rosa, J. Escudero-Sahuquillo, and P. J. García, “Distributed fast and accurate simulation platform for advanced ARM- and risc-v-based HPC systems, ”J. Supercomput., vol. 81, no. 16, p. 1484, 2025. [Online]. Available: https://doi.org/10.1007/s11227-025-07972-7 Manuscript submitted to ACM
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.