Recognition: unknown
Characterization of Real Communication Patterns and Congestion Dynamics in HPC Interconnection Networks
Pith reviewed 2026-05-10 07:41 UTC · model grok-4.3
The pith
This paper develops an extended VEF Traces framework to characterize communication patterns and congestion from real HPC application traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a methodology based primarily on the VEF Traces framework to characterize, model, and simulate the communication patterns of representative computing- and data-intensive applications. The framework is extended with tools that characterize network congestion either directly from VEF traces or via simulations. Analysis of VEF traces from runs of NEST, GROMACS, LAMMPS, and PATMOS on several supercomputers identifies potential congestion scenarios that arise in realistic network configurations when certain collective operations are performed.
What carries the argument
The VEF Traces framework extended with congestion characterization tools that process execution traces to extract traffic patterns and detect congestion points.
Load-bearing premise
The selected traces from NEST, GROMACS, LAMMPS, and PATMOS on the studied supercomputers represent the communication patterns and congestion dynamics found in general HPC workloads.
What would settle it
Running the same applications on a supercomputer with a different network topology or routing algorithm and finding no congestion during the same collective operations would show that the identified scenarios are not general.
Figures
read the original abstract
The interconnection network is a key component of Supercomputers and Data centers, and its design must cope with the increasing communication demands of current applications and services; otherwise, it may become a system bottleneck. The most challenging network design issues are the topology, routing algorithm, flow control, and power efficiency. However, even the most efficient interconnection networks may suffer severe performance degradation due to congestion, especially under specific network traffic patterns generated by communication operations in high-performance computing~(HPC), deep learning training, or online data-intensive services. In this context, characterizing and modeling these communication operations and the network traffic patterns they generate is a fundamental challenge for studying their impact on network performance. This paper presents a methodology, based primarily on the VEF Traces framework, to characterize, model, and simulate the communication patterns of representative computing- and data-intensive applications. More precisely, we have extended the VEF traces framework with tools that enable us to characterize network congestion, either directly from VEF traces or via simulations. We have analyzed a set of VEF traces obtained from runs of NEST, GROMACS, LAMMPS, and PATMOS on several Supercomputers. In these studies, we identify potential congestion scenarios that arise in realistic network configurations when certain collective operations are performed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a methodology based primarily on the VEF Traces framework to characterize, model, and simulate the communication patterns of representative computing- and data-intensive applications. It extends the framework with tools to characterize network congestion either directly from VEF traces or via simulations, analyzes traces from runs of NEST, GROMACS, LAMMPS, and PATMOS on several supercomputers, and identifies potential congestion scenarios that arise in realistic network configurations when certain collective operations are performed.
Significance. If the methodology is sound and the identified scenarios prove generalizable, the work could contribute empirical insights into real HPC communication patterns and congestion dynamics, supporting better interconnection network design. The grounding in actual supercomputer traces from multiple applications is a strength compared to purely synthetic models. However, the narrow application set and lack of quantitative validation metrics limit the potential impact to case-specific observations rather than broadly applicable findings.
major comments (2)
- [Application trace analysis section] The section describing the analyzed VEF traces from NEST, GROMACS, LAMMPS, and PATMOS: the central claim that these traces enable identification of 'potential congestion scenarios' in realistic configurations rests on an unexamined assumption of representativeness; no quantitative comparison is provided of message-size distributions, collective operation frequencies, or spatial traffic patterns against other common HPC workloads (e.g., deep-learning training collectives or irregular graph analytics), which is load-bearing for any claim of generalizable congestion dynamics.
- [Methodology and tools extension] The description of the extended VEF framework tools for congestion characterization: the methodology is outlined but supplies no concrete metrics for quantifying congestion (e.g., queue occupancy thresholds, latency inflation factors, or link utilization thresholds) nor any validation results from the trace-based or simulation-based studies, leaving the identification of scenarios without supporting evidence.
minor comments (2)
- [Introduction] Clarify the definition and scope of 'VEF Traces framework' on first use, including any assumptions about trace fidelity to actual network behavior.
- [Results and figures] Ensure all figures showing congestion scenarios include axis labels, legends, and quantitative scales rather than qualitative descriptions alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating the revisions made to strengthen the paper.
read point-by-point responses
-
Referee: [Application trace analysis section] The section describing the analyzed VEF traces from NEST, GROMACS, LAMMPS, and PATMOS: the central claim that these traces enable identification of 'potential congestion scenarios' in realistic configurations rests on an unexamined assumption of representativeness; no quantitative comparison is provided of message-size distributions, collective operation frequencies, or spatial traffic patterns against other common HPC workloads (e.g., deep-learning training collectives or irregular graph analytics), which is load-bearing for any claim of generalizable congestion dynamics.
Authors: We agree that representativeness is key to supporting broader claims about congestion dynamics. The applications (NEST, GROMACS, LAMMPS, PATMOS) were selected because they are established workloads in neuroscience, molecular dynamics, and particle transport, with communication patterns documented in prior HPC studies. However, our dataset did not include equivalent VEF traces from deep-learning or graph analytics workloads, precluding direct quantitative comparisons of message sizes, collectives, or spatial patterns. In the revised manuscript, we have added a dedicated paragraph in the application trace analysis section that discusses the selection rationale with supporting references, qualitatively contrasts the observed patterns (e.g., irregular point-to-point vs. known all-to-all in DL), and explicitly scopes our findings to scientific computing applications rather than claiming generalizability across all HPC workloads. revision: partial
-
Referee: [Methodology and tools extension] The description of the extended VEF framework tools for congestion characterization: the methodology is outlined but supplies no concrete metrics for quantifying congestion (e.g., queue occupancy thresholds, latency inflation factors, or link utilization thresholds) nor any validation results from the trace-based or simulation-based studies, leaving the identification of scenarios without supporting evidence.
Authors: We accept this critique; the original methodology section described the tool extensions at a high level without sufficient operational detail. In the revised version, we have expanded this section to define explicit congestion metrics: link utilization >75%, average queue occupancy >50 packets, and latency inflation factor >1.2 relative to baseline. We have also incorporated validation results, including direct comparisons of trace-derived indicators against simulation outputs for the four applications, confirming that the metrics reliably flag the collective-induced contention scenarios. These additions supply the quantitative grounding and evidence previously missing. revision: yes
Circularity Check
No circularity; empirical trace analysis with no derivations or self-defined reductions
full rationale
The paper presents an empirical methodology that extends the existing VEF Traces framework to collect and analyze communication traces from external application runs (NEST, GROMACS, LAMMPS, PATMOS) on supercomputers, then identifies congestion scenarios from those traces. No mathematical derivations, equations, fitted parameters, or predictions are described that could reduce to inputs by construction. The work relies on external benchmarks and trace data rather than any self-citation chain or ansatz that would make central claims tautological. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VEF Traces framework accurately captures communication patterns from HPC applications
Reference graph
Works this paper leans on
-
[1]
Structural Simulation Toolkit (SST) DUMPI Trace Library,
“Structural Simulation Toolkit (SST) DUMPI Trace Library, ” (Accessed July 5, 2024). [Online]. Available: https://github.com/sstsimulator/sst-dumpi
2024
-
[2]
F. J. Andujar, J. A. Villar, F. J. Alfaro, J. L. Sánchez, and J. Escudero-Sahuquillo, “An open-source family of tools to reproduce mpi-based workloads in interconnection network simulators, ”J. Supercomput., vol. 72, no. 12, pp. 4601–4628, 2016. [Online]. Available: https://doi.org/10.1007/s11227-016-1757-0
-
[3]
Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,
W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale, ” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 283–294
2023
-
[4]
S. Shen, T. Bonato, Z. Hu, P. Jordan, T. Chen, and T. Hoefler, “Atlahs: An application-centric network simulator toolchain for ai, hpc, and distributed storage, ” 2025. [Online]. Available: https://arxiv.org/abs/2505.08936
-
[5]
Extending the VEF traces framework to model data center network workloads,
F. J. Andújar, M. S. de la Rosa, J. Escudero-Sahuquillo, and J. L. Sánchez, “Extending the VEF traces framework to model data center network workloads, ”J. Supercomput., vol. 79, no. 1, pp. 814–831, 2023. [Online]. Available: https://doi.org/10.1007/s11227-022-04692-0
-
[6]
Homa: a receiver-driven low-latency transport protocol using network priorities,
B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout, “Homa: a receiver-driven low-latency transport protocol using network priorities, ” inProceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, ser. SIGCOMM ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 221–235. [Online]. Available: https://doi.o...
-
[7]
A., Bustos, B., & Hitschfeld, N
L. Gonzalez-Naharro, J. Escudero-Sahuquillo, P. J. García, F. J. Quiles, J. Duato, W. Sun, X. Yu, and H. Zheng, “Modeling traffic workloads in data-center network simulation tools, ” in17th International Conference on High Performance Computing & Simulation, HPCS 2019, Dublin, Ireland, July 15-19, 2019. IEEE, 2019, pp. 1036–1042. [Online]. Available: http...
-
[8]
Networks of exascale systems with omnet++
P. Yebenes, J. Escudero-Sahuquillo, P. J. Garcia, and F. J. Quiles, “Networks of exascale systems with omnet++. ” inEuromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2013, pp. 203–207
2013
-
[9]
An overview of the omnet++ simulation environment
A. Varga and R. Hornig, “An overview of the omnet++ simulation environment. ” ICST, 5 2010
2010
-
[10]
Modeling a switch architecture with virtual output queues and virtual channels in hpc-systems simulators,
P. Yébenes, G. Maglione-Mathey, J. Escudero-Sahuquillo, P. J. García, and F. J. Quiles, “Modeling a switch architecture with virtual output queues and virtual channels in hpc-systems simulators, ” in2016 International Conference on High Performance Computing & Simulation (HPCS), 2016, pp. 380–386
2016
-
[11]
Hybrid congestion control for bxi-based interconnection networks,
G. Gomez-Lopez, M. S. de la Rosa, J. Escudero-Sahuquillo, P. J. García, F. J. Quiles, and P. Lagadec, “Hybrid congestion control for bxi-based interconnection networks, ” inEuro-Par 2024: Parallel Processing - 30th European Conference on Parallel and Distributed Processing, Madrid, Spain, August 26-30, 2024, Proceedings, Part II, ser. Lecture Notes in Com...
-
[12]
Quality-of-service provision for bxiv3-based interconnection networks,
M. S. de la Rosa, G. Gomez-Lopez, F. J. Andújar, J. Escudero-Sahuquillo, J. L. Sánchez, F. J. Alfaro-Cortés, and P. Lagadec, “Quality-of-service provision for bxiv3-based interconnection networks, ”J. Supercomput., vol. 81, no. 4, p. 601, 2025. [Online]. Available: https://doi.org/10.1007/s11227-025-07069-1
-
[13]
Nest (neural simulation tool),
M.-O. Gewaltig and M. Diesmann, “Nest (neural simulation tool), ”Scholarpedia, vol. 2, no. 4, p. 1430, 2007. [Online]. Available: https://doi.org/10.4249/scholarpedia.1430
-
[14]
Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers,
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lindahl, “Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers, ”SoftwareX, vol. 1-2, pp. 19–25, 2015. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2352711015000059
2015
-
[15]
A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton, “LAMMPS - a flexible simulation tool for particle-based materials modeling at the Manuscript submitted to ACM 24 Sanchez de la Rosa et al. a...
-
[16]
Patmos: A prototype monte carlo transport code to test high performance architectures,
E. Brun, S. Chauveau, and F. Malvagi, “Patmos: A prototype monte carlo transport code to test high performance architectures, ” 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:237524013
2017
-
[17]
Diapasom,
M. Barnaba, “Diapasom, ” https://github.com/exactlab/diapasom, 2022
2022
-
[18]
Managing work flows with ecflow,
A. Bahra, “Managing work flows with ecflow, ” pp. 30–32, 2011 2011. [Online]. Available: https://www.ecmwf.int/node/17434
2011
-
[19]
The VEF Traces Repository homepage,
“The VEF Traces Repository homepage, ” (Accessed August 1, 2025). [Online]. Available: https://gitraap.i3a.info/jesus.escudero/vef-traces-repository
2025
-
[20]
Design and implementation of enhanced crossbar CIOQ switch architecture,
A. Awan and R. Venkatesan, “Design and implementation of enhanced crossbar CIOQ switch architecture, ” inCanadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No.04CH37513), vol. 2, 2004, pp. 1045–1048 Vol.2
2004
-
[21]
Input versus output queueing on a space-division packet switch,
M. Karol, M. Hluchyj, and S. Morgan, “Input versus output queueing on a space-division packet switch, ”IEEE Transactions on communications, vol. 35, no. 12, pp. 1347–1356, 1987
1987
-
[22]
IEEE Standard for Local and Metropolitan Area Networks—Virtual Bridged Local Area Networks – Amendment: Priority-based Flow Control
802.1Qbb, “IEEE Standard for Local and Metropolitan Area Networks—Virtual Bridged Local Area Networks – Amendment: Priority-based Flow Control.” IEEE, 2011. [Online]. Available: https://1.ieee802.org/dcb/802-1qbb/
2011
-
[23]
Credit-based flow control for atm networks,
N. Kung and R. Morris, “Credit-based flow control for atm networks, ”IEEE Network, vol. 9, no. 2, pp. 40–48, 1995
1995
-
[24]
Megafly: A topology for exascale systems,
M. Flajslik, E. Borch, and M. A. Parker, “Megafly: A topology for exascale systems, ” inHigh Performance Computing: 33rd International Conference, ISC High Performance 2018, Frankfurt, Germany, June 24-28, 2018, Proceedings 33. Springer, 2018, pp. 289–310
2018
-
[25]
Technology-driven, highly-scalable dragonfly topology,
J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable dragonfly topology, ” in2008 International Symposium on Computer Architecture, 2008, pp. 77–88
2008
-
[26]
Dragonfly+: Low Cost Topology for Scaling Datacenters,
A. Shpiner, Z. Haramaty, S. Eliad, V. Zdornov, B. Gafni, and E. Zahavi, “Dragonfly+: Low Cost Topology for Scaling Datacenters, ” in2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB), 2017, pp. 1–8
2017
-
[27]
Fat-tree routing and node ordering providing contention free traffic for mpi global collectives,
E. Zahavi, “Fat-tree routing and node ordering providing contention free traffic for mpi global collectives, ”Journal of Parallel and Distributed Computing, vol. 72, no. 11, pp. 1423–1432, 2012, communication Architectures for Scalable Systems. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731512000305
2012
-
[28]
A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies,
C. Gómez, F. Gilabert, M. E. Gómez, P. López, and J. Duato, “A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies, ” The Journal of Supercomputing, vol. 71, no. 7, pp. 2339–2364, Jul. 2015. [Online]. Available: https://doi.org/10.1007/s11227-014-1303-x
-
[29]
The vampir performance analysis tool-set,
A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel, “The vampir performance analysis tool-set, ” inTools for High Performance Computing, M. Resch, R. Keller, V. Himmler, B. Krammer, and A. Schulz, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 139–155
2008
-
[30]
A. Knupfer, C. Rossel, D. an Mey, S. Biersdorff, K. Diethelm, D. Eschweiler, M. Geimer, M. Gerndt, D. Lorenz, A. Malony, and W. E. Nagel, “Score-p: A joint performance measurement run-time infrastructure for periscope, scalasca, tau, and vampir. ” 8 2012. [Online]. Available: https://www.osti.gov/biblio/1567522
-
[31]
The scalasca performance toolset architecture,
M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr, “The scalasca performance toolset architecture, ”Concurrency and Computation: Practice and Experience, vol. 22, no. 6, p. 702–719, apr 2010
2010
-
[32]
Extrae documentation — Extrae 3.8.3 documentation,
“Extrae documentation — Extrae 3.8.3 documentation, ” (Accessed July 5, 2024). [Online]. Available: https://tools.bsc.es/doc/html/extrae/index.html
2024
-
[33]
VEF traces: A framework for modelling MPI traffic in interconnection network simulators,
F. J. Andujar, J. A. Villar, J. L. Sánchez, F. J. Alfaro, and J. Escudero-Sahuquillo, “VEF traces: A framework for modelling MPI traffic in interconnection network simulators, ” in2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015. IEEE Computer Society, 2015, pp. 841–848. [Online]. Available: htt...
-
[34]
VEF-Prospector repository homepage,
“VEF-Prospector repository homepage, ” (Accessed July 5, 2024). [Online]. Available: https://gitraap.i3a.info/fandujar/VEF-Prospector
2024
-
[35]
VEF-TraceLib repository homepage,
“VEF-TraceLib repository homepage, ” (Accessed July 5, 2024). [Online]. Available: https://gitraap.i3a.info/fandujar/VEF-TraceLIB
2024
-
[36]
Topaz: An open-source interconnection network simulator for chip multiprocessors and supercomputers,
P. Abad, P. Prieto, L. G. Menezo, A. Colaso, V. Puente, and J.-A. Gregorio, “Topaz: An open-source interconnection network simulator for chip multiprocessors and supercomputers, ” in2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, 2012, pp. 99–106
2012
-
[37]
G. F. Riley and T. R. Henderson,The ns-3 Network Simulator. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 15–34. [Online]. Available: https://doi.org/10.1007/978-3-642-12331-3_2
-
[38]
The structural simulation toolkit,
A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “The structural simulation toolkit, ”SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, Mar. 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225
-
[39]
The INET Framework,
J. Vejražka, Z. Csaba, and A. Varga, “The INET Framework, ” inProceedings of the 6th International ICST Conference on Simulation Tools and Techniques (SIMUTOOLS ’13). ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2013, pp. 1–10
2013
-
[40]
Distributed fast and accurate simulation platform for advanced ARM- and risc-v-based HPC systems,
N. Tampouratzis, I. Papaefstathiou, G. Gomez-Lopez, M. S. de la Rosa, J. Escudero-Sahuquillo, and P. J. García, “Distributed fast and accurate simulation platform for advanced ARM- and risc-v-based HPC systems, ”J. Supercomput., vol. 81, no. 16, p. 1484, 2025. [Online]. Available: https://doi.org/10.1007/s11227-025-07972-7 Manuscript submitted to ACM
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.