CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis

Aleksandar Ilic; Jos\'e Morgado; Leonel Sousa

arxiv: 2605.29740 · v1 · pith:7JZFKPSGnew · submitted 2026-05-28 · 💻 cs.DC

CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis

Jos\'e Morgado , Leonel Sousa , Aleksandar Ilic This is my paper

Pith reviewed 2026-06-29 05:39 UTC · model grok-4.3

classification 💻 cs.DC

keywords Cache-Aware Roofline Modelautomatic benchmarkingperformance modelingCPU architecturesapplication analysismicrobenchmarksISA extensionsHPC optimization

0 comments

The pith

An automated tool constructs accurate Cache-Aware Roofline Models for Intel, AMD, ARM, and RISC-V CPUs using custom microbenchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Complex modern CPUs make it hard to optimize applications for peak performance. The Cache-Aware Roofline Model visualizes bottlenecks but previously lacked automatic construction tools for several major architectures. This work introduces the CARM Tool, which deploys tailored assembly microbenchmarks to measure roofs across compute units and memory levels on x86, ARM, and RISC-V systems, then adds performance-counter and binary-instrumentation support for application analysis. Experiments confirm the resulting roofs stay within 1 percent of known architectural maximums. Developers can therefore obtain consistent optimization guidance on a broader set of platforms.

Core claim

The paper presents the CARM Tool, an automated framework that builds Cache-Aware Roofline Models by executing architecture-specific assembly microbenchmarks covering scalar through all supported vector ISA extensions for both computational units and every level of the memory hierarchy, while integrating application characterization via performance counters and dynamic binary instrumentation; the constructed roofs deviate less than 1 percent from independently verified architectural maximums across tested systems.

What carries the argument

Architecture-specific assembly microbenchmarks that saturate computational units and memory hierarchy levels, executed inside an automated benchmarking and analysis framework to populate CARM roofs.

If this is right

CARM optimization guidance becomes available for AMD, ARM, and RISC-V CPUs where no comparable automated tools existed.
Application analysis is integrated directly into the CARM framework through performance counters and dynamic binary instrumentation.
Roof values remain within 1 percent of architectural maximums across the tested systems.
Microbenchmarks span the full spectrum from scalar to all supported vector ISA extensions for each architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same microbenchmark approach could be applied to future ISA extensions or new CPU designs as they appear.
Coupling the tool with compiler feedback loops might allow automatic code transformations guided by the generated roofs.
Comparison of roofs across vendors could highlight systematic differences in memory hierarchy behavior.

Load-bearing premise

The custom assembly microbenchmarks fully saturate every computational unit and memory level without hidden bottlenecks or measurement artifacts.

What would settle it

Independent measurement of a known peak performance on any tested architecture that differs by more than 1 percent from the roof produced by the tool's microbenchmarks.

Figures

Figures reproduced from arXiv: 2605.29740 by Aleksandar Ilic, Jos\'e Morgado, Leonel Sousa.

**Figure 2.** Figure 2: CARM tool modules included and supported. III. CARM TOOL: HIGH-LEVEL OVERVIEW The proposed CARM tool comprises a set of independent modules to provide a complete CARM-based profiling ecosystem as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: CARM Tool GUI Overview or compilation from source, and Intel SDE through its current release. The proposed CARM tool also incorporates specific profiling capabilities, which include Region of Interest (ROI) code profiling. This functionality is facilitated by a header file which contains the API functions for the ROI instrumentation code (carm roi start(), carm roi end()) necessary for ROI DBI application… view at source ↗

**Figure 5.** Figure 5: Memory curve benchmark results IPC. To verify this the FP CARM benchmarks were executed for both AVX512 and Scalar on Venus, in this case, results showed IPC counts of 1.88 and 1.98 for AVX512 and scalar respectively, indicating that while not perfectly achievable we are able to get close to the theoretical 2 FP IPC, especially for the scalar instructions. On Cara, two FP IPC were accurately reached, match… view at source ↗

**Figure 6.** Figure 6: Cara mixed benchmark results CPU to reach near the limits set by the CARM benchmarking. The Zen3 CPU with an optimal ratio of two loads per store, will execute various mixed benchmarks ranging from 0.0417 to 0.25 AI for addition (green dots) and 0.0833 to 0.5 AI for FMA instructions (blue dots). Mixed benchmarks targeting the L1 cache incrementally increased the FP instruction ratio to a maximum of 12 per… view at source ↗

**Figure 8.** Figure 8: , compared with the CARM Tool results (in black) using a two-load-per-store ratio on the memory benchmarks. Since Intel Advisor also tests the single precision arithmetic FP performance of AVX512 by default, these lines were also included in the CARM graph for comparison. As can be observed in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: ERT and CARM Tool Venus AVX512 results differences in performance of the Eigen SpMV implementation in these distinct architectures. The results of this application analysis using the hugetrace-00020 [4] matrix can be seen in [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the Cache-aware Roofline Model (CARM) offer effective guidance by providing insights into bottlenecks that limit the application's ability to reach the system's maximum performance. To fully exploit the benefits of CARM optimization guidance for application development, automatic tools for cross-architecture model construction and in-depth application characterization are absolutely essential. Given a plethora of existing CPU architectures, the current landscape of CARM-enabled tools covers either vendor-specific (Intel Advisor), not sufficiently developed (ARM) or simply non-existing (AMD, RISC-V) tools. This is a particular gap that this work intends to close by bringing automatic CARM support to all major CPU architectures and ISAs, i.e., x86 (Intel, AMD), ARM, and RISC-V, by developing assembly microbenchmarks specifically tailored to cover a full performance spectrum of modern CPUs (from scalar to all supported vector ISA extensions) for both computational units and all memory hierarchy levels. Additionally, this work integrates application analysis within the CARM framework using performance counters and dynamic binary instrumentation. Experimental results show that the CARM roofs constructed with the proposed automated framework provide less than a 1% deviation across various tested architectural maximums.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper automates CARM roofline construction for AMD, ARM, and RISC-V via custom assembly microbenchmarks and adds analysis hooks, filling a tooling gap even if the core model is not new.

read the letter

The paper's main contribution is a set of architecture-specific assembly microbenchmarks that let users automatically build Cache-Aware Roofline Models on AMD, ARM, and RISC-V, plus an integrated way to characterize applications with performance counters and dynamic binary instrumentation. Prior tools were limited to Intel or had weak coverage on the other ISAs, so this closes a practical hole for people who need roofline guidance on those platforms.

It does the job of covering scalar through vector extensions and all cache levels in one framework, and the integration with application analysis is a sensible addition that could save users time.

The soft spot is the reported <1% deviation from architectural maximums. That number rests entirely on the microbenchmarks actually saturating every unit and memory level without contention or measurement artifacts. The abstract gives no cross-checks against vendor peak numbers or separate counter-based saturation tests, so if any benchmark falls short on a particular ISA extension the roofs would be understated and the error claim would not hold. Minor details like how the dynamic instrumentation interacts with the roofs are also left for the full text to clarify.

This is for HPC performance engineers and tool builders who work on non-Intel CPUs and want ready-made microbenchmarks rather than a theoretical advance. A reader who needs to model ARM or RISC-V code would find the benchmarks and framework worth examining.

It deserves peer review because the tooling fills a real gap and the claims are testable once the validation data are shown.

Referee Report

2 major / 1 minor

Summary. The paper presents the CARM Tool, an automated framework for constructing Cache-Aware Roofline Models across x86 (Intel, AMD), ARM, and RISC-V CPUs. It develops custom assembly microbenchmarks tailored to scalar through vector ISA extensions and all memory hierarchy levels, then integrates application characterization via performance counters and dynamic binary instrumentation. The central empirical claim is that roofs built with this framework deviate by less than 1% from the tested architectures' maximums.

Significance. If the saturation accuracy of the microbenchmarks is independently verified, the work would fill a genuine tooling gap by delivering open CARM support for non-Intel ISAs, directly aiding performance analysis and optimization on a broad range of modern CPUs. The multi-architecture microbenchmark approach and integration of dynamic instrumentation are practical strengths that could see adoption if the <1% result is shown to be robust.

major comments (2)

[Abstract] Abstract (and results section): the quantitative claim of <1% deviation from architectural maximums is load-bearing for the paper's contribution, yet the manuscript supplies no description of the independent reference values used for the maximums, no error analysis, and no saturation verification procedure (e.g., performance-counter utilization rates or comparison against vendor peak FLOPS/bandwidth specifications).
[Microbenchmark design] Microbenchmark design section: the central assumption that the custom assembly kernels saturate every computational unit and cache level without hidden contention or measurement artifacts is not supported by any cross-validation experiments; if saturation is incomplete for any ISA extension, the reported roofs would systematically understate the true maxima and the <1% figure would not hold.

minor comments (1)

[Abstract] The first sentence of the abstract contains an awkward phrasing ("HPC systems and CPU architectures as their central components") that could be clarified for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of the CARM Tool for multi-architecture support. We address each major comment below and commit to revisions that directly strengthen the empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract (and results section): the quantitative claim of <1% deviation from architectural maximums is load-bearing for the paper's contribution, yet the manuscript supplies no description of the independent reference values used for the maximums, no error analysis, and no saturation verification procedure (e.g., performance-counter utilization rates or comparison against vendor peak FLOPS/bandwidth specifications).

Authors: We agree that the manuscript currently lacks explicit documentation of the reference values, error analysis, and saturation verification. In the revised version we will add a new subsection (in both the methods and results) that (i) lists the independent reference values (vendor peak FLOPS/bandwidth specifications together with our own microbenchmark-derived maxima), (ii) reports the error analysis performed, and (iii) details the saturation verification procedure, including performance-counter utilization rates and direct comparisons against vendor peaks for each ISA extension and memory level. These additions will make the <1% claim fully traceable. revision: yes
Referee: [Microbenchmark design] Microbenchmark design section: the central assumption that the custom assembly kernels saturate every computational unit and cache level without hidden contention or measurement artifacts is not supported by any cross-validation experiments; if saturation is incomplete for any ISA extension, the reported roofs would systematically understate the true maxima and the <1% figure would not hold.

Authors: We acknowledge that the current manuscript does not present explicit cross-validation experiments for saturation. We will extend the microbenchmark design section with additional validation results: achieved versus vendor peak FLOPS/bandwidth for every supported ISA extension on each architecture, together with performance-counter utilization data demonstrating that computational units and memory hierarchy levels reach saturation without measurable contention. These experiments will be reported for all four ISAs (Intel, AMD, ARM, RISC-V). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmarking results are independent of fitted inputs or self-citations

full rationale

The paper describes development of custom assembly microbenchmarks to automatically construct CARM roofs for multiple ISAs and architectures, followed by experimental measurement of <1% deviation from architectural maxima. No equations, derivations, or predictions are presented that reduce to inputs by construction; the central claim rests on direct empirical comparison rather than self-referential fitting, renaming, or load-bearing self-citations. The methodology is self-contained against external architectural specifications and performance measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that microbenchmarks can be written to saturate all relevant hardware units.

pith-pipeline@v0.9.1-grok · 5778 in / 1083 out tokens · 18077 ms · 2026-06-29T05:39:36.870064+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages

[1]

(2024) Amd uprof user guide

AMD. (2024) Amd uprof user guide. [Online]. Avail- able: https://www.amd.com/content/dam/amd/en/documents/developer/ uprof-v4.0-gaGA-user-guide.pdf

2024
[2]

Aergia: leveraging heterogeneity in federated learning systems,

B. Cox, L. Y . Chen, and J. Decouchant, “Aergia: leveraging heterogeneity in federated learning systems,” inProceedings of the 23rd ACM/IFIP International Middleware Conference, ser. Middleware ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 107–120. [Online]. Available: https://doi.org/10.1145/3528535.3565238

work page doi:10.1145/3528535.3565238 2022
[3]

Reducing the bandwidth of sparse symmetric matrices,

E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” inProceedings of the 1969 24th National Conference, ser. ACM ’69. New York, NY , USA: Association for Computing Machinery, 1969, p. 157–172. [Online]. Available: https://doi.org/10.1145/800195.805928

work page doi:10.1145/800195.805928 1969
[4]

T. Davis. Sparse matrix collection. Accessed on 5th October 2023. [Online]. Available: https://sparse.tamu.edu/

2023
[5]

The new linux’perf’tools,

A. C. De Melo, “The new linux’perf’tools,” inSlides from Linux Kongress, vol. 18, 2010, pp. 1–42

2010
[6]

An instruction roofline model for gpus,

N. Ding and S. Williams, “An instruction roofline model for gpus,” in 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2019, pp. 7–18

2019
[7]

(2024) Dynamorio webpage

DynamoRIO. (2024) Dynamorio webpage. [Online]. Available: https: //dynamorio.org/

2024
[8]

(2024) Eigen library main page

Eigen. (2024) Eigen library main page. [Online]. Available: https: //eigen.tuxfamily.org/index.php?title=Main Page

2024
[9]

Cache-aware roofline model: Upgrading the loft,

A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgrading the loft,”IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, 2013

2013
[10]

(2023) Intel advisor carm overview

Intel. (2023) Intel advisor carm overview. [Online]. Available: https://www.intel.com/content/www/us/en/developer/articles/ technical/integrated-roofline-model-with-intel-advisor.html

2023
[11]

(2024) Intel 64 and ia-32 architectures optimization reference manual volume 1

——. (2024) Intel 64 and ia-32 architectures optimization reference manual volume 1. [Online]. Available: https://www.intel.com/content/www/us/en/content-details/671488/ intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1. html

2024
[12]

(2024) Intel sde overview

——. (2024) Intel sde overview. [Online]. Avail- able: https://www.intel.com/content/www/us/en/developer/articles/tool/ software-development-emulator.html

2024
[13]

A novel multi-level integrated roofline model approach for performance characterization,

T. Koskela, Z. Matveev, C. Yang, A. Adedoyin, R. Belenov, P. Thierry, Z. Zhao, R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and S. Williams, “A novel multi-level integrated roofline model approach for performance characterization,” inHigh Performance Computing, R. Yokota, M. Weiland, D. Keyes, and C. Trinitis, Eds. Cham: Springer International P...

2018
[14]

B. Lab. (2024) Empirical roofline tool bitbucket repository. [Online]. Available: https://bitbucket.org/berkeleylab/cs-roofline-toolkit/ src/master/

2024
[15]

Ai-enabling workloads on large-scale gpu-accelerated system: Characterization, op- portunities, and implications,

B. Li, R. Arora, S. Samsi, T. Patel, W. Arcand, D. Bestor, C. Byun, R. B. Roy, B. Bergeron, J. Holodnak, M. Houle, M. Hubbell, M. Jones, J. Kepner, A. Klein, P. Michaleas, J. McDonald, L. Milechin, J. Mullen, A. Prout, B. Price, A. Reuther, A. Rosa, M. Weiss, C. Yee, D. Edelman, A. Vanterpool, A. Cheng, V . Gadepally, and D. Tiwari, “Ai-enabling workloads...

2022
[16]

Application-driven cache-aware roofline model,

D. Marques, A. Ilic, Z. A. Matveev, and L. Sousa, “Application-driven cache-aware roofline model,”Future Generation Computer Systems, vol. 107, pp. 257–273, 2020

2020
[17]

Papi: A portable interface to hardware performance counters,

P. Mucci, S. Moore, C. Deane, and G. Ho, “Papi: A portable interface to hardware performance counters,” 01 1999

1999
[18]

(2024) Xuantie c910-c920 usermanual

Sophgo. (2024) Xuantie c910-c920 usermanual. [Online]. Avail- able: https://github.com/sophgo/sophgo-doc/blob/main/SG2042/T-Head/ XuanTie-C910-C920-UserManual.pdf

2024
[19]

Likwid: A lightweight performance-oriented tool suite for x86 multicore environments,

J. Treibig, G. Hager, and G. Wellein, “Likwid: A lightweight performance-oriented tool suite for x86 multicore environments,” in 2010 39th International Conference on Parallel Processing Workshops, 2010, pp. 207–216

2010
[20]

Roofline: an insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

2009
[21]

High-performance computing environment: A review of twenty years of experiments in china,

Z. Xu, X. Chi, and N. Xiao, “High-performance computing environment: A review of twenty years of experiments in china,”National Science Review, vol. 3, p. nww001, 01 2016

2016
[22]

An empirical roofline methodology for quantitatively assessing per- formance portability,

C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo, B. Friesen, B. Cook, D. Doerfler, L. Oliker, J. Deslippe, and S. Williams, “An empirical roofline methodology for quantitatively assessing per- formance portability,” in2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2018, pp. 14–23. APPEND...

work page doi:10.5281/zenodo.12805280 2018
[23]

Users need to clone this repository for the artifact evaluation

How to access:The tool can be accessed via its GitHub repository. Users need to clone this repository for the artifact evaluation
[24]

Hardware dependencies:A system that contains an x86- 64 CPU (Intel Skylake-X) is ideal for reproducing most results from the paper, however, other CPUs can be used, in which case an A VX-512 capable CPU allows for more comparable results
[25]

Software dependencies:The tool has been mostly tested under Linux Ubuntu or Cent OS, however, any Linux distri- bution should also work. For the tool itself, to generate SVG memory curve graphs the following Python packages are required: plotly; numpy; For the Graphical User Interface some form of browser is required and the following Python packages: das...

[1] [1]

(2024) Amd uprof user guide

AMD. (2024) Amd uprof user guide. [Online]. Avail- able: https://www.amd.com/content/dam/amd/en/documents/developer/ uprof-v4.0-gaGA-user-guide.pdf

2024

[2] [2]

Aergia: leveraging heterogeneity in federated learning systems,

B. Cox, L. Y . Chen, and J. Decouchant, “Aergia: leveraging heterogeneity in federated learning systems,” inProceedings of the 23rd ACM/IFIP International Middleware Conference, ser. Middleware ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 107–120. [Online]. Available: https://doi.org/10.1145/3528535.3565238

work page doi:10.1145/3528535.3565238 2022

[3] [3]

Reducing the bandwidth of sparse symmetric matrices,

E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” inProceedings of the 1969 24th National Conference, ser. ACM ’69. New York, NY , USA: Association for Computing Machinery, 1969, p. 157–172. [Online]. Available: https://doi.org/10.1145/800195.805928

work page doi:10.1145/800195.805928 1969

[4] [4]

T. Davis. Sparse matrix collection. Accessed on 5th October 2023. [Online]. Available: https://sparse.tamu.edu/

2023

[5] [5]

The new linux’perf’tools,

A. C. De Melo, “The new linux’perf’tools,” inSlides from Linux Kongress, vol. 18, 2010, pp. 1–42

2010

[6] [6]

An instruction roofline model for gpus,

N. Ding and S. Williams, “An instruction roofline model for gpus,” in 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2019, pp. 7–18

2019

[7] [7]

(2024) Dynamorio webpage

DynamoRIO. (2024) Dynamorio webpage. [Online]. Available: https: //dynamorio.org/

2024

[8] [8]

(2024) Eigen library main page

Eigen. (2024) Eigen library main page. [Online]. Available: https: //eigen.tuxfamily.org/index.php?title=Main Page

2024

[9] [9]

Cache-aware roofline model: Upgrading the loft,

A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgrading the loft,”IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, 2013

2013

[10] [10]

(2023) Intel advisor carm overview

Intel. (2023) Intel advisor carm overview. [Online]. Available: https://www.intel.com/content/www/us/en/developer/articles/ technical/integrated-roofline-model-with-intel-advisor.html

2023

[11] [11]

(2024) Intel 64 and ia-32 architectures optimization reference manual volume 1

——. (2024) Intel 64 and ia-32 architectures optimization reference manual volume 1. [Online]. Available: https://www.intel.com/content/www/us/en/content-details/671488/ intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1. html

2024

[12] [12]

(2024) Intel sde overview

——. (2024) Intel sde overview. [Online]. Avail- able: https://www.intel.com/content/www/us/en/developer/articles/tool/ software-development-emulator.html

2024

[13] [13]

A novel multi-level integrated roofline model approach for performance characterization,

T. Koskela, Z. Matveev, C. Yang, A. Adedoyin, R. Belenov, P. Thierry, Z. Zhao, R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and S. Williams, “A novel multi-level integrated roofline model approach for performance characterization,” inHigh Performance Computing, R. Yokota, M. Weiland, D. Keyes, and C. Trinitis, Eds. Cham: Springer International P...

2018

[14] [14]

B. Lab. (2024) Empirical roofline tool bitbucket repository. [Online]. Available: https://bitbucket.org/berkeleylab/cs-roofline-toolkit/ src/master/

2024

[15] [15]

Ai-enabling workloads on large-scale gpu-accelerated system: Characterization, op- portunities, and implications,

B. Li, R. Arora, S. Samsi, T. Patel, W. Arcand, D. Bestor, C. Byun, R. B. Roy, B. Bergeron, J. Holodnak, M. Houle, M. Hubbell, M. Jones, J. Kepner, A. Klein, P. Michaleas, J. McDonald, L. Milechin, J. Mullen, A. Prout, B. Price, A. Reuther, A. Rosa, M. Weiss, C. Yee, D. Edelman, A. Vanterpool, A. Cheng, V . Gadepally, and D. Tiwari, “Ai-enabling workloads...

2022

[16] [16]

Application-driven cache-aware roofline model,

D. Marques, A. Ilic, Z. A. Matveev, and L. Sousa, “Application-driven cache-aware roofline model,”Future Generation Computer Systems, vol. 107, pp. 257–273, 2020

2020

[17] [17]

Papi: A portable interface to hardware performance counters,

P. Mucci, S. Moore, C. Deane, and G. Ho, “Papi: A portable interface to hardware performance counters,” 01 1999

1999

[18] [18]

(2024) Xuantie c910-c920 usermanual

Sophgo. (2024) Xuantie c910-c920 usermanual. [Online]. Avail- able: https://github.com/sophgo/sophgo-doc/blob/main/SG2042/T-Head/ XuanTie-C910-C920-UserManual.pdf

2024

[19] [19]

Likwid: A lightweight performance-oriented tool suite for x86 multicore environments,

J. Treibig, G. Hager, and G. Wellein, “Likwid: A lightweight performance-oriented tool suite for x86 multicore environments,” in 2010 39th International Conference on Parallel Processing Workshops, 2010, pp. 207–216

2010

[20] [20]

Roofline: an insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

2009

[21] [21]

High-performance computing environment: A review of twenty years of experiments in china,

Z. Xu, X. Chi, and N. Xiao, “High-performance computing environment: A review of twenty years of experiments in china,”National Science Review, vol. 3, p. nww001, 01 2016

2016

[22] [22]

An empirical roofline methodology for quantitatively assessing per- formance portability,

C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo, B. Friesen, B. Cook, D. Doerfler, L. Oliker, J. Deslippe, and S. Williams, “An empirical roofline methodology for quantitatively assessing per- formance portability,” in2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2018, pp. 14–23. APPEND...

work page doi:10.5281/zenodo.12805280 2018

[23] [23]

Users need to clone this repository for the artifact evaluation

How to access:The tool can be accessed via its GitHub repository. Users need to clone this repository for the artifact evaluation

[24] [24]

Hardware dependencies:A system that contains an x86- 64 CPU (Intel Skylake-X) is ideal for reproducing most results from the paper, however, other CPUs can be used, in which case an A VX-512 capable CPU allows for more comparable results

[25] [25]

Software dependencies:The tool has been mostly tested under Linux Ubuntu or Cent OS, however, any Linux distri- bution should also work. For the tool itself, to generate SVG memory curve graphs the following Python packages are required: plotly; numpy; For the Graphical User Interface some form of browser is required and the following Python packages: das...