Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverage

Fabio Banchelli; J\'ulia Orteu Aubach; Marc Clasc\`a Ram\'irez; Marta Garcia-Gasulla

arxiv: 2605.15832 · v1 · pith:VKCT2VHKnew · submitted 2026-05-15 · 💻 cs.PF · cs.LG

Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverage

J\'ulia Orteu Aubach , Fabio Banchelli , Marc Clasc\`a Ram\'irez , Marta Garcia-Gasulla This is my paper

Pith reviewed 2026-05-19 17:41 UTC · model grok-4.3

classification 💻 cs.PF cs.LG

keywords HPC performance modelinghardware counterstrace mergingmachine learningMPI tracessynthetic performance dataperformance predictioncomputation burst alignment

0 comments

The pith

Heuristic matching of computation bursts merges hardware counter traces from separate HPC runs into a unified synthetic trace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a method to overcome the limit on simultaneous hardware counter collection in HPC systems by merging traces from multiple executions. Each run collects a different set of counters, and the approach aligns matching computation bursts using MPI structure, timing, and communication patterns. The result is a single synthetic trace containing a broader set of hardware features. This matters for ML-based performance modeling because it provides a richer feature space without requiring counter selection or multiplexing, which can introduce inaccuracies. The method was tested on real applications and kernels on MareNostrum5, showing that merged counters maintain acceptable accuracy depending on the workload.

Core claim

The paper establishes that by analyzing MPI structure, timing, and communication patterns to match computation bursts across different executions, it is possible to construct a unified dataset that includes a wider set of hardware counters. This synthetic trace maintains acceptable accuracy for the application and enables training machine learning models on an extended feature space without the need for prior counter selection.

What carries the argument

Heuristic-based matching of computation bursts across multiple executions using MPI structure, timing, and communication patterns to merge hardware counter data.

If this is right

The merged counters maintain acceptable accuracy depending on the application.
Merged data can be directly used to train ML models on a richer feature space without prior counter selection.
The synthetic trace supports both HPC performance prediction and conventional performance analysis.
Validation covers a range of kernels and real applications on MareNostrum5.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimizing which counter sets to collect per run could reduce the total number of executions needed for full coverage.
The alignment approach might extend to trace merging in other distributed computing settings beyond MPI-based HPC.
Testing ML prediction accuracy on merged versus original data would quantify gains from the richer feature space.
Handling variations in workload phases or non-deterministic timing could improve robustness for more complex applications.

Load-bearing premise

That aligning computation bursts using MPI structure, timing, and communication patterns produces sufficiently accurate matches that do not distort the merged hardware counter values.

What would settle it

Direct comparison of merged counter values against values collected in a single run with all counters would show large discrepancies at aligned points, or ML models trained on merged data would show substantially higher prediction error than models using original limited counters.

read the original abstract

This work extends a framework for predicting the performance of High-Performance Computing (HPC) workloads using Machine Learning (ML). A common limitation in performance modeling is the restricted number of hardware counters that can be collected simultaneously. To address this, we propose a heuristic-based methodology to merge execution traces from multiple runs, each instrumented with a different set of hardware counters. Our approach matches computation bursts across executions by analyzing MPI structure, timing, and communication patterns. This process enables the construction of a unified dataset that includes a wider set of hardware features without relying on multiplexing. The output is a new synthetic trace with all merged counters, which can be used both for HPC performance prediction and for conventional performance analysis. The methodology has been validated on MareNostrum5 machine with a range of kernels and real applications. Results show that the merged counters maintain acceptable accuracy depending on the application, and can be directly used to train ML models on a richer feature space without prior counter selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical heuristic for merging HPC traces across runs to collect more hardware counters at once, but the validation evidence stays light on numbers and comparisons.

read the letter

The main thing here is a heuristic that aligns computation bursts from separate executions by matching MPI structure, timing, and communication patterns. This produces a single synthetic trace with a wider set of hardware counters than any one run can capture. The goal is to feed richer data into ML performance models without multiplexing overhead, and they tested it on MareNostrum5 across kernels and real applications. They note upfront that accuracy varies by workload, which keeps the claim grounded rather than overstated.

Referee Report

2 major / 2 minor

Summary. The paper proposes a heuristic-based methodology to merge HPC execution traces collected from multiple runs, each instrumented with a different set of hardware counters. Computation bursts are matched across executions by analyzing MPI structure, timing, and communication patterns to construct a unified synthetic trace containing a wider set of hardware features. This merged trace can be used for ML-based performance prediction and conventional analysis. The method is validated on MareNostrum5 using a range of kernels and real applications, with results showing that merged counters maintain acceptable accuracy in an application-dependent manner.

Significance. If the heuristic alignments preserve counter fidelity at a level sufficient for downstream ML training, the approach would remove the need for prior counter selection or multiplexing in HPC performance modeling, enabling richer feature spaces directly from merged traces. This is a practical contribution to the field of performance analysis and prediction.

major comments (2)

[Abstract / Validation] Abstract and validation description: the claim that merged counters 'maintain acceptable accuracy' is presented without any quantitative error metrics, baseline comparisons against ground-truth simultaneous collection, or explicit description of how accuracy was measured (e.g., per-counter relative error, correlation coefficients). This information is load-bearing for the central claim that the merged traces can be directly used to train ML models.
[Methodology] Methodology section on burst matching: the assumption that MPI-structure, timing, and communication-pattern matching produces alignments that do not distort hardware-counter values is stated but not accompanied by a sensitivity analysis or worst-case distortion bounds. Distortions would directly affect the richer feature space promised for ML training.

minor comments (2)

[Methodology] Add explicit pseudocode or a worked example of the burst-matching heuristic to improve reproducibility.
[Results] Clarify the exact set of kernels and applications used in the MareNostrum5 experiments and report per-application error statistics rather than a single qualitative statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving the clarity and rigor of our claims regarding accuracy and methodology. We address each major comment below and have revised the manuscript to incorporate quantitative details and additional analysis where feasible.

read point-by-point responses

Referee: [Abstract / Validation] Abstract and validation description: the claim that merged counters 'maintain acceptable accuracy' is presented without any quantitative error metrics, baseline comparisons against ground-truth simultaneous collection, or explicit description of how accuracy was measured (e.g., per-counter relative error, correlation coefficients). This information is load-bearing for the central claim that the merged traces can be directly used to train ML models.

Authors: We agree that the abstract and validation description would be strengthened by explicit quantitative metrics. The manuscript already evaluates accuracy via per-counter relative error and correlation coefficients against the original per-run traces, with results varying by application (typically under 15% relative error for most counters in the tested kernels). To directly address the comment, we will revise the abstract to include specific quantitative summaries (e.g., average relative errors and correlation ranges) and expand the validation section with a dedicated paragraph describing the exact accuracy measurement process, including any available ground-truth comparisons from counters that could be collected simultaneously. revision: yes
Referee: [Methodology] Methodology section on burst matching: the assumption that MPI-structure, timing, and communication-pattern matching produces alignments that do not distort hardware-counter values is stated but not accompanied by a sensitivity analysis or worst-case distortion bounds. Distortions would directly affect the richer feature space promised for ML training.

Authors: The matching heuristic is grounded in the observation that bursts with identical MPI structure, similar timing, and communication patterns exhibit consistent hardware counter behavior across runs, as validated empirically on the selected kernels and applications. We acknowledge that an explicit sensitivity analysis and distortion bounds are absent from the current text. We will add a new paragraph in the methodology section providing an empirical sensitivity study (varying timing windows by small percentages and measuring resulting counter deviations) and report observed distortion bounds from our experiments to quantify the impact on the merged feature space. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a heuristic methodology for merging HPC execution traces across multiple runs by matching computation bursts via MPI structure, timing, and communication patterns. This is an empirical technique validated on kernels and real applications executed on MareNostrum5, with accuracy results reported as application-dependent and suitable for downstream ML training on richer counter sets. No load-bearing mathematical derivation, first-principles prediction, or parameter fit is present that reduces to the inputs by construction. The approach relies on external experimental validation rather than self-referential definitions, fitted inputs renamed as predictions, or chains of self-citations, rendering the methodology self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no free parameters, axioms, or invented entities; the method relies on standard MPI structures, timing data, and existing trace analysis concepts.

pith-pipeline@v0.9.0 · 5717 in / 992 out tokens · 62068 ms · 2026-05-19T17:41:16.530628+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach matches computation bursts across executions by analyzing MPI structure, timing, and communication patterns... The similarity score... S=0.6·D_temporal +0.2·D_size +0.2·D_partner
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that the merged counters maintain acceptable accuracy depending on the application

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

In: Proceedings of the 19th Annual Inter- national Conference on Supercomputing, ICS ’05, p

Azimi, R., Stumm, M., Wisniewski, R.W.: Online performance analysis by statistical sam- pling of microprocessor performance counters. In: Proceedings of the 19th Annual Inter- national Conference on Supercomputing, ICS ’05, p. 101–110. Association for Computing Machinery, New York, NY, USA (2005). DOI 10.1145/1088149.1088163. URLhttps: //doi.org/10.1145/1...

work page doi:10.1145/1088149.1088163 2005
[2]

Accessed: June 2025

Barcelona Supercomputing Center: Marenostrum V.https://www.bsc.es/ca/marenostrum/ marenostrum-5. Accessed: June 2025

work page 2025
[3]

Accessed: October 2024

Barcelona Supercomputing Center: SOD2D: Spectral high-Order coDe 2 solve partial Differen- tial equations.https://www.bsc.es/research-and-development/software-and-apps/ software-list/sod2d/downloads(2023). Accessed: October 2024

work page 2023
[4]

Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering , pages =

Das, S., Werner, J., Antonakakis, M., Polychronakis, M., Monrose, F.: Sok: The challenges, pitfalls, and perils of using hardware performance counters for security. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 20–38 (2019). DOI 10.1109/SP.2019.00021

work page doi:10.1109/sp.2019.00021 2019
[5]

URLhttps://seissol.org

Gabriel, A.A., Kurapati, V., Niu, Z., Schliwa, N., Schneller, D., Ulrich, T., Dorozhinskii, R., Krenz, L., Uphoff, C., Wolf, S., Breuer, A., Heinecke, A., Pelties, C., Rettenberger, S., Wollherr, S., Bader, M.: Seissol (2022). URLhttps://seissol.org. Open-source software for simulating 3D seismic and acoustic wave propagation, earthquake rupture dynamics,...

work page 2022
[6]

In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp

Gonzalez, J., Gimenez, J., Labarta, J.: Automatic detection of parallel applications computation phases. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–11 (2009)

work page 2009
[7]

Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Tech. Rep. LLNL-TR-641973 (2013)

work page 2013
[8]

IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp

McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)

work page 1995
[9]

URLhttps://upcommons.upc.edu/handle/2117/430331

Orteu, J.: A framework and methodology for performance prediction of hpc workloads (2024). URLhttps://upcommons.upc.edu/handle/2117/430331

work page 2024
[10]

Orteu, J., Clasc `a, M., Labarta, J., Jennings, E., Andersson, S., Garcia-Gasulla, M.: A framework and methodology for performance prediction of hpc workloads. In: R. Wyrzykowski, J. Dongarra, E. Deelman, K. Karczewski (eds.) Parallel Processing and Applied Mathematics, pp. 114–127. Springer Nature Switzerland, Cham (2025)

work page 2025
[11]

Orteu, J., Clasc `a, M., Garcia-Gasulla, M., Labarta, J., Jennings, E.: A framework and method- ology for performance prediction of hpc workloads. In: M.K. Chessey, C. Cuti ˜no, S.S. Mehta (eds.) 11th International BSC Severo Ochoa Doctoral Symposium: Book of Abstracts, pp. 64–

work page
[12]

URLhttps://www.bsc.es/ sites/default/files/public/11thBSCDS_BoA.pdf

Barcelona Supercomputing Center, Barcelona, Spain (2024). URLhttps://www.bsc.es/ sites/default/files/public/11thBSCDS_BoA.pdf. Poster

work page 2024
[13]

In: Poster, SC’24 (2024)

Pandey, D., Bhowmick, S., Taufer, M.: Identifying regions of non-determinism in hpc simula- tions through event graph alignment. In: Poster, SC’24 (2024)

work page 2024
[14]

In: Proceedings of WoTUG-18: transputer and occam developments, vol

Pillet, V., Labarta, J., Cortes, T., Girona, S.: Paraver: A tool to visualize and analyze parallel code. In: Proceedings of WoTUG-18: transputer and occam developments, vol. 44, pp. 17–31 (1995)

work page 1995
[15]

Parallel Computing 39(8), 336–353 (2013)

Servat, H., et al.: Framework for a productive performance optimization. Parallel Computing 39(8), 336–353 (2013)

work page 2013
[16]

https://icl.utk.edu/papi/

University of Tennessee, Knoxville: PAPI: Performance Application Programming Interface. https://icl.utk.edu/papi/. Accessed: September 2024

work page 2024
[17]

Journal of computational science14, 15–27 (2016)

V ´azquez, M., Houzeaux, G., Koric, S., Artigues, A., Aguado-Sierra, J., Ar´ıs, R., Mira, D., Calmet, H., Cucchietti, F., Owen, H., et al.: Alya: Multiphysics engineering simulation toward exascale. Journal of computational science14, 15–27 (2016)

work page 2016
[18]

Parallel Computing115, 102837 (2021)

Zhou, K., Adhianto, L., Anderson, J.,et al.: Measurement and analysis of gpu-accelerated applica- tions with hpctoolkit. Parallel Computing115, 102837 (2021). DOI 10.1016/j.parco.2021.102837

work page doi:10.1016/j.parco.2021.102837 2021

[1] [1]

In: Proceedings of the 19th Annual Inter- national Conference on Supercomputing, ICS ’05, p

Azimi, R., Stumm, M., Wisniewski, R.W.: Online performance analysis by statistical sam- pling of microprocessor performance counters. In: Proceedings of the 19th Annual Inter- national Conference on Supercomputing, ICS ’05, p. 101–110. Association for Computing Machinery, New York, NY, USA (2005). DOI 10.1145/1088149.1088163. URLhttps: //doi.org/10.1145/1...

work page doi:10.1145/1088149.1088163 2005

[2] [2]

Accessed: June 2025

Barcelona Supercomputing Center: Marenostrum V.https://www.bsc.es/ca/marenostrum/ marenostrum-5. Accessed: June 2025

work page 2025

[3] [3]

Accessed: October 2024

Barcelona Supercomputing Center: SOD2D: Spectral high-Order coDe 2 solve partial Differen- tial equations.https://www.bsc.es/research-and-development/software-and-apps/ software-list/sod2d/downloads(2023). Accessed: October 2024

work page 2023

[4] [4]

Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering , pages =

Das, S., Werner, J., Antonakakis, M., Polychronakis, M., Monrose, F.: Sok: The challenges, pitfalls, and perils of using hardware performance counters for security. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 20–38 (2019). DOI 10.1109/SP.2019.00021

work page doi:10.1109/sp.2019.00021 2019

[5] [5]

URLhttps://seissol.org

Gabriel, A.A., Kurapati, V., Niu, Z., Schliwa, N., Schneller, D., Ulrich, T., Dorozhinskii, R., Krenz, L., Uphoff, C., Wolf, S., Breuer, A., Heinecke, A., Pelties, C., Rettenberger, S., Wollherr, S., Bader, M.: Seissol (2022). URLhttps://seissol.org. Open-source software for simulating 3D seismic and acoustic wave propagation, earthquake rupture dynamics,...

work page 2022

[6] [6]

In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp

Gonzalez, J., Gimenez, J., Labarta, J.: Automatic detection of parallel applications computation phases. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–11 (2009)

work page 2009

[7] [7]

Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Tech. Rep. LLNL-TR-641973 (2013)

work page 2013

[8] [8]

IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp

McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)

work page 1995

[9] [9]

URLhttps://upcommons.upc.edu/handle/2117/430331

Orteu, J.: A framework and methodology for performance prediction of hpc workloads (2024). URLhttps://upcommons.upc.edu/handle/2117/430331

work page 2024

[10] [10]

Orteu, J., Clasc `a, M., Labarta, J., Jennings, E., Andersson, S., Garcia-Gasulla, M.: A framework and methodology for performance prediction of hpc workloads. In: R. Wyrzykowski, J. Dongarra, E. Deelman, K. Karczewski (eds.) Parallel Processing and Applied Mathematics, pp. 114–127. Springer Nature Switzerland, Cham (2025)

work page 2025

[11] [11]

Orteu, J., Clasc `a, M., Garcia-Gasulla, M., Labarta, J., Jennings, E.: A framework and method- ology for performance prediction of hpc workloads. In: M.K. Chessey, C. Cuti ˜no, S.S. Mehta (eds.) 11th International BSC Severo Ochoa Doctoral Symposium: Book of Abstracts, pp. 64–

work page

[12] [12]

URLhttps://www.bsc.es/ sites/default/files/public/11thBSCDS_BoA.pdf

Barcelona Supercomputing Center, Barcelona, Spain (2024). URLhttps://www.bsc.es/ sites/default/files/public/11thBSCDS_BoA.pdf. Poster

work page 2024

[13] [13]

In: Poster, SC’24 (2024)

Pandey, D., Bhowmick, S., Taufer, M.: Identifying regions of non-determinism in hpc simula- tions through event graph alignment. In: Poster, SC’24 (2024)

work page 2024

[14] [14]

In: Proceedings of WoTUG-18: transputer and occam developments, vol

Pillet, V., Labarta, J., Cortes, T., Girona, S.: Paraver: A tool to visualize and analyze parallel code. In: Proceedings of WoTUG-18: transputer and occam developments, vol. 44, pp. 17–31 (1995)

work page 1995

[15] [15]

Parallel Computing 39(8), 336–353 (2013)

Servat, H., et al.: Framework for a productive performance optimization. Parallel Computing 39(8), 336–353 (2013)

work page 2013

[16] [16]

https://icl.utk.edu/papi/

University of Tennessee, Knoxville: PAPI: Performance Application Programming Interface. https://icl.utk.edu/papi/. Accessed: September 2024

work page 2024

[17] [17]

Journal of computational science14, 15–27 (2016)

V ´azquez, M., Houzeaux, G., Koric, S., Artigues, A., Aguado-Sierra, J., Ar´ıs, R., Mira, D., Calmet, H., Cucchietti, F., Owen, H., et al.: Alya: Multiphysics engineering simulation toward exascale. Journal of computational science14, 15–27 (2016)

work page 2016

[18] [18]

Parallel Computing115, 102837 (2021)

Zhou, K., Adhianto, L., Anderson, J.,et al.: Measurement and analysis of gpu-accelerated applica- tions with hpctoolkit. Parallel Computing115, 102837 (2021). DOI 10.1016/j.parco.2021.102837

work page doi:10.1016/j.parco.2021.102837 2021