Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverage
Pith reviewed 2026-05-19 17:41 UTC · model grok-4.3
The pith
Heuristic matching of computation bursts merges hardware counter traces from separate HPC runs into a unified synthetic trace.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that by analyzing MPI structure, timing, and communication patterns to match computation bursts across different executions, it is possible to construct a unified dataset that includes a wider set of hardware counters. This synthetic trace maintains acceptable accuracy for the application and enables training machine learning models on an extended feature space without the need for prior counter selection.
What carries the argument
Heuristic-based matching of computation bursts across multiple executions using MPI structure, timing, and communication patterns to merge hardware counter data.
If this is right
- The merged counters maintain acceptable accuracy depending on the application.
- Merged data can be directly used to train ML models on a richer feature space without prior counter selection.
- The synthetic trace supports both HPC performance prediction and conventional performance analysis.
- Validation covers a range of kernels and real applications on MareNostrum5.
Where Pith is reading between the lines
- Optimizing which counter sets to collect per run could reduce the total number of executions needed for full coverage.
- The alignment approach might extend to trace merging in other distributed computing settings beyond MPI-based HPC.
- Testing ML prediction accuracy on merged versus original data would quantify gains from the richer feature space.
- Handling variations in workload phases or non-deterministic timing could improve robustness for more complex applications.
Load-bearing premise
That aligning computation bursts using MPI structure, timing, and communication patterns produces sufficiently accurate matches that do not distort the merged hardware counter values.
What would settle it
Direct comparison of merged counter values against values collected in a single run with all counters would show large discrepancies at aligned points, or ML models trained on merged data would show substantially higher prediction error than models using original limited counters.
read the original abstract
This work extends a framework for predicting the performance of High-Performance Computing (HPC) workloads using Machine Learning (ML). A common limitation in performance modeling is the restricted number of hardware counters that can be collected simultaneously. To address this, we propose a heuristic-based methodology to merge execution traces from multiple runs, each instrumented with a different set of hardware counters. Our approach matches computation bursts across executions by analyzing MPI structure, timing, and communication patterns. This process enables the construction of a unified dataset that includes a wider set of hardware features without relying on multiplexing. The output is a new synthetic trace with all merged counters, which can be used both for HPC performance prediction and for conventional performance analysis. The methodology has been validated on MareNostrum5 machine with a range of kernels and real applications. Results show that the merged counters maintain acceptable accuracy depending on the application, and can be directly used to train ML models on a richer feature space without prior counter selection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a heuristic-based methodology to merge HPC execution traces collected from multiple runs, each instrumented with a different set of hardware counters. Computation bursts are matched across executions by analyzing MPI structure, timing, and communication patterns to construct a unified synthetic trace containing a wider set of hardware features. This merged trace can be used for ML-based performance prediction and conventional analysis. The method is validated on MareNostrum5 using a range of kernels and real applications, with results showing that merged counters maintain acceptable accuracy in an application-dependent manner.
Significance. If the heuristic alignments preserve counter fidelity at a level sufficient for downstream ML training, the approach would remove the need for prior counter selection or multiplexing in HPC performance modeling, enabling richer feature spaces directly from merged traces. This is a practical contribution to the field of performance analysis and prediction.
major comments (2)
- [Abstract / Validation] Abstract and validation description: the claim that merged counters 'maintain acceptable accuracy' is presented without any quantitative error metrics, baseline comparisons against ground-truth simultaneous collection, or explicit description of how accuracy was measured (e.g., per-counter relative error, correlation coefficients). This information is load-bearing for the central claim that the merged traces can be directly used to train ML models.
- [Methodology] Methodology section on burst matching: the assumption that MPI-structure, timing, and communication-pattern matching produces alignments that do not distort hardware-counter values is stated but not accompanied by a sensitivity analysis or worst-case distortion bounds. Distortions would directly affect the richer feature space promised for ML training.
minor comments (2)
- [Methodology] Add explicit pseudocode or a worked example of the burst-matching heuristic to improve reproducibility.
- [Results] Clarify the exact set of kernels and applications used in the MareNostrum5 experiments and report per-application error statistics rather than a single qualitative statement.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving the clarity and rigor of our claims regarding accuracy and methodology. We address each major comment below and have revised the manuscript to incorporate quantitative details and additional analysis where feasible.
read point-by-point responses
-
Referee: [Abstract / Validation] Abstract and validation description: the claim that merged counters 'maintain acceptable accuracy' is presented without any quantitative error metrics, baseline comparisons against ground-truth simultaneous collection, or explicit description of how accuracy was measured (e.g., per-counter relative error, correlation coefficients). This information is load-bearing for the central claim that the merged traces can be directly used to train ML models.
Authors: We agree that the abstract and validation description would be strengthened by explicit quantitative metrics. The manuscript already evaluates accuracy via per-counter relative error and correlation coefficients against the original per-run traces, with results varying by application (typically under 15% relative error for most counters in the tested kernels). To directly address the comment, we will revise the abstract to include specific quantitative summaries (e.g., average relative errors and correlation ranges) and expand the validation section with a dedicated paragraph describing the exact accuracy measurement process, including any available ground-truth comparisons from counters that could be collected simultaneously. revision: yes
-
Referee: [Methodology] Methodology section on burst matching: the assumption that MPI-structure, timing, and communication-pattern matching produces alignments that do not distort hardware-counter values is stated but not accompanied by a sensitivity analysis or worst-case distortion bounds. Distortions would directly affect the richer feature space promised for ML training.
Authors: The matching heuristic is grounded in the observation that bursts with identical MPI structure, similar timing, and communication patterns exhibit consistent hardware counter behavior across runs, as validated empirically on the selected kernels and applications. We acknowledge that an explicit sensitivity analysis and distortion bounds are absent from the current text. We will add a new paragraph in the methodology section providing an empirical sensitivity study (varying timing windows by small percentages and measuring resulting counter deviations) and report observed distortion bounds from our experiments to quantify the impact on the merged feature space. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper proposes a heuristic methodology for merging HPC execution traces across multiple runs by matching computation bursts via MPI structure, timing, and communication patterns. This is an empirical technique validated on kernels and real applications executed on MareNostrum5, with accuracy results reported as application-dependent and suitable for downstream ML training on richer counter sets. No load-bearing mathematical derivation, first-principles prediction, or parameter fit is present that reduces to the inputs by construction. The approach relies on external experimental validation rather than self-referential definitions, fitted inputs renamed as predictions, or chains of self-citations, rendering the methodology self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our approach matches computation bursts across executions by analyzing MPI structure, timing, and communication patterns... The similarity score... S=0.6·D_temporal +0.2·D_size +0.2·D_partner
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results show that the merged counters maintain acceptable accuracy depending on the application
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the 19th Annual Inter- national Conference on Supercomputing, ICS ’05, p
Azimi, R., Stumm, M., Wisniewski, R.W.: Online performance analysis by statistical sam- pling of microprocessor performance counters. In: Proceedings of the 19th Annual Inter- national Conference on Supercomputing, ICS ’05, p. 101–110. Association for Computing Machinery, New York, NY, USA (2005). DOI 10.1145/1088149.1088163. URLhttps: //doi.org/10.1145/1...
-
[2]
Barcelona Supercomputing Center: Marenostrum V.https://www.bsc.es/ca/marenostrum/ marenostrum-5. Accessed: June 2025
work page 2025
-
[3]
Barcelona Supercomputing Center: SOD2D: Spectral high-Order coDe 2 solve partial Differen- tial equations.https://www.bsc.es/research-and-development/software-and-apps/ software-list/sod2d/downloads(2023). Accessed: October 2024
work page 2023
-
[4]
Das, S., Werner, J., Antonakakis, M., Polychronakis, M., Monrose, F.: Sok: The challenges, pitfalls, and perils of using hardware performance counters for security. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 20–38 (2019). DOI 10.1109/SP.2019.00021
-
[5]
Gabriel, A.A., Kurapati, V., Niu, Z., Schliwa, N., Schneller, D., Ulrich, T., Dorozhinskii, R., Krenz, L., Uphoff, C., Wolf, S., Breuer, A., Heinecke, A., Pelties, C., Rettenberger, S., Wollherr, S., Bader, M.: Seissol (2022). URLhttps://seissol.org. Open-source software for simulating 3D seismic and acoustic wave propagation, earthquake rupture dynamics,...
work page 2022
-
[6]
In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp
Gonzalez, J., Gimenez, J., Labarta, J.: Automatic detection of parallel applications computation phases. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–11 (2009)
work page 2009
-
[7]
Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Tech. Rep. LLNL-TR-641973 (2013)
work page 2013
-
[8]
IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)
work page 1995
-
[9]
URLhttps://upcommons.upc.edu/handle/2117/430331
Orteu, J.: A framework and methodology for performance prediction of hpc workloads (2024). URLhttps://upcommons.upc.edu/handle/2117/430331
work page 2024
-
[10]
Orteu, J., Clasc `a, M., Labarta, J., Jennings, E., Andersson, S., Garcia-Gasulla, M.: A framework and methodology for performance prediction of hpc workloads. In: R. Wyrzykowski, J. Dongarra, E. Deelman, K. Karczewski (eds.) Parallel Processing and Applied Mathematics, pp. 114–127. Springer Nature Switzerland, Cham (2025)
work page 2025
-
[11]
Orteu, J., Clasc `a, M., Garcia-Gasulla, M., Labarta, J., Jennings, E.: A framework and method- ology for performance prediction of hpc workloads. In: M.K. Chessey, C. Cuti ˜no, S.S. Mehta (eds.) 11th International BSC Severo Ochoa Doctoral Symposium: Book of Abstracts, pp. 64–
-
[12]
URLhttps://www.bsc.es/ sites/default/files/public/11thBSCDS_BoA.pdf
Barcelona Supercomputing Center, Barcelona, Spain (2024). URLhttps://www.bsc.es/ sites/default/files/public/11thBSCDS_BoA.pdf. Poster
work page 2024
-
[13]
Pandey, D., Bhowmick, S., Taufer, M.: Identifying regions of non-determinism in hpc simula- tions through event graph alignment. In: Poster, SC’24 (2024)
work page 2024
-
[14]
In: Proceedings of WoTUG-18: transputer and occam developments, vol
Pillet, V., Labarta, J., Cortes, T., Girona, S.: Paraver: A tool to visualize and analyze parallel code. In: Proceedings of WoTUG-18: transputer and occam developments, vol. 44, pp. 17–31 (1995)
work page 1995
-
[15]
Parallel Computing 39(8), 336–353 (2013)
Servat, H., et al.: Framework for a productive performance optimization. Parallel Computing 39(8), 336–353 (2013)
work page 2013
-
[16]
University of Tennessee, Knoxville: PAPI: Performance Application Programming Interface. https://icl.utk.edu/papi/. Accessed: September 2024
work page 2024
-
[17]
Journal of computational science14, 15–27 (2016)
V ´azquez, M., Houzeaux, G., Koric, S., Artigues, A., Aguado-Sierra, J., Ar´ıs, R., Mira, D., Calmet, H., Cucchietti, F., Owen, H., et al.: Alya: Multiphysics engineering simulation toward exascale. Journal of computational science14, 15–27 (2016)
work page 2016
-
[18]
Parallel Computing115, 102837 (2021)
Zhou, K., Adhianto, L., Anderson, J.,et al.: Measurement and analysis of gpu-accelerated applica- tions with hpctoolkit. Parallel Computing115, 102837 (2021). DOI 10.1016/j.parco.2021.102837
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.