pith. sign in

arxiv: 2606.15490 · v2 · pith:V2IARYV7new · submitted 2026-06-13 · 💻 cs.DC · astro-ph.IM

Is RISC-V Ready for Massively Parallel Astrophysical Codes?

Pith reviewed 2026-06-27 03:33 UTC · model grok-4.3

classification 💻 cs.DC astro-ph.IM
keywords RISC-Vastrophysical codesperformance evaluationportabilitymemory bandwidthvectorizationMPI OpenMPparallel computing
0
0 comments X

The pith

RISC-V runs three major astrophysical codes correctly but three to six times slower than x86.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks three established astrophysical production codes on a RISC-V processor and compares results to x86 and ARM systems. It establishes that the codes produce identical numerical outputs across all platforms, confirming portability, while documenting consistent performance gaps. The gaps trace to hardware limits on memory bandwidth, cache sharing, vector width, and clock speed plus weaker compiler vectorization. A reader would care because the results show what concrete improvements in hardware and software would be needed for an open architecture to handle demanding parallel scientific workloads.

Core claim

The central claim is that the three codes (iPIC3D for memory-bound work, PLUTO for compute-bound work, and OpenGadget3 for hybrid work) execute with full numerical correctness on RISC-V, yet exhibit slowdowns of roughly 3-6 times versus x86 and 5-9 times versus ARM. The performance difference is driven primarily by limited memory bandwidth that saturates early, contention in shared caches, 128-bit vector units, lower clock frequency, and less mature auto-vectorization in the GNU compiler. Memory-bound kernels suffer most at high thread counts, while hybrid MPI+OpenMP runs reach their best results at intermediate thread counts that balance memory pressure against communication cost. The autho

What carries the argument

Direct performance comparison of memory-bound, compute-bound, and hybrid astrophysical codes across RISC-V, x86, and ARM platforms, with measurements isolating bandwidth saturation, cache contention, vector-unit width, clock frequency, and compiler auto-vectorization effects.

If this is right

  • Memory-bound kernels reach bandwidth limits earliest and therefore show the largest scalability drop at high thread counts.
  • Hybrid MPI+OpenMP configurations achieve peak performance only at intermediate thread counts where memory contention and communication overhead are balanced.
  • Compiler improvements focused on auto-vectorization would reduce the performance gap more than hardware changes alone.
  • No source-code modifications are required for numerical portability across the three architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other memory-intensive scientific domains such as computational fluid dynamics or climate modeling would likely encounter similar bandwidth-driven slowdowns on current RISC-V hardware.
  • RISC-V designs that prioritize higher memory bandwidth and wider vector units could close most of the gap without requiring application rewrites.
  • Compiler teams working on RISC-V support could use these three codes as regression tests to measure progress in vectorization quality.

Load-bearing premise

The three tested codes and the specific RISC-V, x86, and ARM configurations used are representative of the memory and compute demands typical in massively parallel astrophysical applications.

What would settle it

A repeat of the same three codes on a later RISC-V system that doubles sustained memory bandwidth, widens vector units to 256 bits, raises clock frequency, and uses a compiler with mature auto-vectorization, yielding performance within 1.5 times of the x86 baseline, would falsify the claim that the observed gaps are structural.

Figures

Figures reproduced from arXiv: 2606.15490 by Andrea Bartolini, Elisabetta Boella, Emanuele Venieri, Federico Ficarelli, Geray S. Karademir, Giacomo Madella, Jenny Lynn Almerol, Nitin Shukla.

Figure 1
Figure 1. Figure 1: Magnitude of the magnetic field on the simulation plane (shown in color) with field lines overlaid in white, during an iPIC3D simulation of the GEM reconnection challenge (values are normalized to reference quantities). Left: run performed on SG2044. Right: run performed on GH200. 1  1 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: iPIC3D strong-scaling results for a 256 × 256 × 1 grid and 9216 particles per cell on x86, ARM, and RISC-V platforms. Left: wall clock time versus MPI rank count (lower is better). Right: parallel efficiency normalized to the single-rank execution time on each platform. of the RISC-V node, and equivalently configured on the ARM and x86 nodes for cross-architecture comparison. The baseline iPIC3D applicatio… view at source ↗
Figure 3
Figure 3. Figure 3: iPIC3D wall clock time breakdown by kernel phase across three particle memory layouts (sorted AoS, unsorted AoS, unsorted SoA) on ARM, x86, and RISC-V at 64 MPI ranks [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PLUTO strong-scaling results for the Orszag–Tang vortex test case on the same platforms of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: OpenGadget3 per-kernel wall clock time breakdown for the production cosmological test case on RISC-V, ARM, and x86 across various MPI+OpenMP configurations over four steps across 10 runs [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

We present a performance and portability evaluation of three well-established astrophysical production codes, namely iPIC3D, PLUTO, and OpenGadget3, on a Sophgo SG2044 RISC-V processor (part of the Monte Cimone cluster), with comparisons to AMD EPYC 9554 (x86) and NVIDIA GH200 Grace (ARM) systems. These applications represent memory-bound, compute-bound, and hybrid workloads, respectively. Numerical correctness is verified across all platforms, confirming portability. RISC-V shows consistently lower performance, with slowdowns of about $3-6\times$ relative to x86 and $5-9\times$ relative to ARM. The gap is mainly due to limited memory bandwidth, shared cache constraints, narrower 128-bit vector units, and lower clock frequency, but also less-mature auto-vectorization capability of the GNU compiler suite. Memory-bound kernels are the most affected, where early bandwidth saturation and L2 cache contention reduce scalability at higher thread counts. Hybrid MPI+OpenMP configurations reveal a trade-off between memory contention and communication overhead, with intermediate configurations achieving the best performance. These results suggest that RISC-V is capable of supporting scientific workloads; however, additional improvements in both hardware and compiler technology, particularly in auto-vectorization, are required to achieve competitive performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates portability and performance of three astrophysical production codes (iPIC3D, PLUTO, OpenGadget3) representing memory-bound, compute-bound, and hybrid workloads on a Sophgo SG2044 RISC-V processor, with direct comparisons to AMD EPYC 9554 (x86) and NVIDIA GH200 Grace (ARM) systems. It reports verified numerical correctness across platforms and consistent slowdowns on RISC-V of 3-6× versus x86 and 5-9× versus ARM, attributes the gaps primarily to limited memory bandwidth, shared L2 cache, 128-bit vectors, clock frequency, and GNU compiler auto-vectorization maturity, and notes that memory-bound kernels suffer most from early bandwidth saturation and cache contention while hybrid MPI+OpenMP shows configuration trade-offs.

Significance. If the reported timings and causal attributions are substantiated, the work supplies concrete empirical benchmarks for RISC-V in production scientific HPC, quantifying scalability limits in representative astrophysical kernels and identifying concrete hardware/compiler targets (vectorization, bandwidth) whose improvement would directly benefit memory-bound and hybrid codes.

major comments (2)
  1. [Abstract] Abstract: the claim that observed slowdowns are 'mainly due to' limited memory bandwidth, shared cache constraints, 128-bit vector units, clock frequency, and GNU auto-vectorization maturity is stated without citation to isolating measurements (performance counters, achieved bandwidth, vectorization reports, or controlled microbenchmarks) from the three codes; the attribution therefore rests on correlation with published hardware specifications rather than run-time data.
  2. [Results] Results section (implied by abstract description of thread scaling and MPI+OpenMP trade-offs): the absence of error bars, repeated-run statistics, or full profiling data (e.g., memory bandwidth utilization, cache miss rates) prevents verification that the listed factors dominate over other variables such as memory-controller implementation or library versions.
minor comments (2)
  1. [Methods] Methods: the manuscript should supply the exact compiler flags, library versions, and thread/MPI binding policies used on each platform to allow reproduction of the reported timings.
  2. [Figures] Figures: scaling plots would benefit from explicit indication of the number of runs and any observed variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the empirical basis for our performance attributions and on improving statistical rigor. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that observed slowdowns are 'mainly due to' limited memory bandwidth, shared cache constraints, 128-bit vector units, clock frequency, and GNU auto-vectorization maturity is stated without citation to isolating measurements (performance counters, achieved bandwidth, vectorization reports, or controlled microbenchmarks) from the three codes; the attribution therefore rests on correlation with published hardware specifications rather than run-time data.

    Authors: We agree that the abstract's phrasing attributes the slowdowns to specific factors without direct isolating measurements such as performance counters or vectorization reports from the application runs. Our conclusions draw from the observed differential scaling (memory-bound kernels saturate earliest) together with the documented hardware characteristics of the SG2044 (shared L2, 128-bit vectors, bandwidth limits) and the known state of GNU auto-vectorization. We will revise the abstract to qualify the attribution as 'primarily inferred from workload-specific scaling behavior and published hardware specifications' and will add a short discussion of this inference basis in the results section. If the Monte Cimone cluster provides accessible counters, we will include a limited set of bandwidth and cache-miss figures in a revision; otherwise the language will remain qualified. revision: partial

  2. Referee: [Results] Results section (implied by abstract description of thread scaling and MPI+OpenMP trade-offs): the absence of error bars, repeated-run statistics, or full profiling data (e.g., memory bandwidth utilization, cache miss rates) prevents verification that the listed factors dominate over other variables such as memory-controller implementation or library versions.

    Authors: We acknowledge that the presented results lack error bars, repeated-run statistics, and detailed profiling data, which limits independent verification that bandwidth and cache effects dominate other possible variables. The reported timings reflect single representative executions, although we observed stable behavior across preliminary runs. We will add error bars derived from at least three repeated executions per configuration, include a brief statement on run-to-run variability, and incorporate any available high-level profiling output (e.g., reported memory bandwidth from the codes) to support the causal discussion. These additions will be placed in the results section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or fitted predictions

full rationale

The paper consists of direct runtime measurements of three astrophysical codes on RISC-V, x86, and ARM hardware, with numerical correctness checks and performance comparisons. No equations, parameters fitted to subsets of data, predictions, or first-principles derivations appear in the provided text. Attributions to hardware factors (bandwidth, cache, vector width, frequency, compiler) are interpretive statements correlated with known specs and observed scaling behavior, but they do not reduce any claimed result to its own inputs by construction. Self-citations are absent from the abstract and central claims. This is a standard empirical evaluation whose central results stand independently of any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark study with no free parameters, axioms, or invented entities; all claims rest on direct runtime measurements.

pith-pipeline@v0.9.1-grok · 5824 in / 1184 out tokens · 92033 ms · 2026-06-27T03:33:22.524425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 20 canonical work pages

  1. [1]

    In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

    Almerol, J.L., et al.: Accelerating gravitational N-body simulations using the RISC-V-based Tenstorrent Wormhole. In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1729–1735. Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/373...

  2. [2]

    Astronomy and Computing p

    Almerol, J.L., et al.: Assessing performance and porting strategies for gravitational n-body sim- ulations on the risc-v-based tenstorrent wormhole. Astronomy and Computing p. 101121 (2026). https://doi.org/https://doi.org/10.1016/j.ascom.2026.101121

  3. [3]

    Nature 324(6096), 446–449 (1986)

    Barnes, J., Hut, P.: A hierarchical o(n log n) force-calculation algorithm. Nature 324(6096), 446–449 (1986). https://doi.org/10.1038/324446a0

  4. [4]

    In: 2022 IEEE 35th International System-on-Chip Conference (SOCC)

    Bartolini, A., et al.: Monte cimone: paving the road for the first generation of risc-v high-performance computers. In: 2022 IEEE 35th International System-on-Chip Conference (SOCC). pp. 1–6. IEEE (2022)

  5. [5]

    Journal of Computational Physics 499, 112701 (2024)

    Berta, V., et al.: A fourth-order accurate finite volume method for ideal classical and special relativis- tic mhd based on pointwise reconstructions. Journal of Computational Physics 499, 112701 (2024). https://doi.org/10.1016/j.jcp.2023.112701

  6. [6]

    Journal of Geophysical Research: Space Physics 106(A3), 3715–3719 (2001)

    Birn, J., et al.: Geospace environmental modeling (gem) magnetic re- connection challenge. Journal of Geophysical Research: Space Physics 106(A3), 3715–3719 (2001). https://doi.org/https://doi.org/10.1029/1999JA900449, https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/1999JA900449

  7. [7]

    Boella, E., et al.: Accelerating the particle-in-cell code ecsim with openacc (2026), https://arxiv.org/abs/2603.16624

  8. [8]

    arXiv preprint arXiv:2307.14774 (2023)

    Bramas, B.: Spc5: An efficient spmv framework vectorized using arm sve and x86 avx-512. arXiv preprint arXiv:2307.14774 (2023)

  9. [10]

    In: ISC High Performance 2024 Workshops

    Brown, N., Jamieson, M.: Performance characterisation of the 64-core sg2042 risc-v cpu for hpc. In: ISC High Performance 2024 Workshops. Lecture Notes in Computer Science, vol. 15058. Springer (2025)

  10. [11]

    In: Proceedings of the SC’23 Workshops (2023)

    Brown, N., et al.: Initial rajaperf evaluation on sophon sg2042. In: Proceedings of the SC’23 Workshops (2023)

  11. [12]

    Brown, N.: Is risc-v ready for high performance computing? an evaluation of the sophon sg2044 (2025), https://arxiv.org/abs/2508.13840

  12. [13]

    arXiv preprint arXiv:2309.06530 (2023)

    Diehl, P., Daiß, G., et al.: Evaluating hpx and kokkos on risc-v using an astrophysics application (octo- tiger). arXiv preprint arXiv:2309.06530 (2023)

  13. [14]

    arXiv preprint arXiv:2407.00026 (2024)

    Diehl, P., et al.: Preparing for hpc on risc-v: Examining vectorization and distributed perfor- mance of an astrophysics application with hpx and kokkos. arXiv preprint arXiv:2407.00026 (2024). https://doi.org/10.48550/arXiv.2407.00026

  14. [15]

    ACM Transactions on Quantum Computing 6(3), 1–46 (2025)

    Elsharkawy, A., et al.: Integration of quantum accelerators with high performance computinga review of quantum programming tools. ACM Transactions on Quantum Computing 6(3), 1–46 (2025)

  15. [16]

    Monthly Notices of the Royal Astronomical Society 526(1), 616–644 (2023)

    Groth, F., et al.: The cosmological simulation code opengadget3: Implementation of mesh- less finite mass. Monthly Notices of the Royal Astronomical Society 526(1), 616–644 (2023). https://doi.org/10.1093/mnras/stad2717

  16. [17]

    Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Commun. ACM 62(2), 4860 (Jan 2019). https://doi.org/10.1145/3282307, https://doi.org/10.1145/3282307

  17. [18]

    Mathematics and Computers in Simu- lation 80(7), 1509–1519 (2010)

    Markidis, S., et al.: Multi-scale simulations of plasma with ipic3d. Mathematics and Computers in Simu- lation 80(7), 1509–1519 (2010). https://doi.org/10.1016/j.matcom.2009.08.038

  18. [19]

    The Astrophysical Journal Supplement Series 170(1), 228–242 (2007)

    Mignone, A., et al.: Pluto: A numerical code for computational astrophysics. The Astrophysical Journal Supplement Series 170(1), 228–242 (2007). https://doi.org/10.1086/513421 14 J. L. Almerol et al

  19. [20]

    Journal of Computational Physics 229, 5896–5920 (2010)

    Mignone, A., et al.: High-order conservative finite difference glm-mhd schemes for cell-centered mhd. Journal of Computational Physics 229, 5896–5920 (2010). https://doi.org/10.1016/j.jcp.2010.04.013

  20. [21]

    Ristov, S., et al.: Superlinear speedup in hpc systems: Why and when? In: 2016 federated conference on computer science and information systems (fedcsis). pp. 889–898. IEEE (2016)

  21. [22]

    In: 2025 IEEE In- ternational Conference on Quantum Computing and Engineering (QCE)

    Rocco, R., et al.: Dynamic solutions for hybrid quantum-hpc resource allocation. In: 2025 IEEE In- ternational Conference on Quantum Computing and Engineering (QCE). vol. 02, pp. 34–40 (2025). https://doi.org/10.1109/QCE65121.2025.10289

  22. [23]

    Journal of Parallel and Distributed Computing 205, 105156 (2025)

    Rocco, R., et al.: To repair or not to repair: Assessing fault resilience in mpi stencil applications. Journal of Parallel and Distributed Computing 205, 105156 (2025)

  23. [24]

    Astronomy and Computing 55, 101076 (2026)

    Rossazza, M., et al.: The pluto code on gpus: A first look at eulerian mhd methods. Astronomy and Computing 55, 101076 (2026). https://doi.org/10.1016/j.parco.2016.01.001

  24. [25]

    In: Proceedings of the 22nd ACM International Conference on Comput- ing Frontiers Workshops and Special Sessions

    Shukla, N., et al.: Eurohpc space coe: Redesigning scalable parallel astrophysical codes for ex- ascale (invited paper). In: Proceedings of the 22nd ACM International Conference on Comput- ing Frontiers Workshops and Special Sessions. pp. 177–184. CF ’25 Companion, ACM (Jul 2025). https://doi.org/10.1145/3706594.3728892

  25. [26]

    Procedia Computer Science pp

    Shukla, N., et al.: Towards exascale computing for astrophysical simulation leveraging the leonardo eurohpc system. Procedia Computer Science pp. 112–123 (2025). https://doi.org/10.1016/j.procs.2025.08.238267

  26. [27]

    Nature Astronomy (2026)

    Shukla, N., et al.: Exascale computing to accelerate discoveries in astrophysics and space plasma physics. Nature Astronomy (2026). https://doi.org/10.1038/s41550-026-02807-8

  27. [28]

    ACM Computing Surveys 57(11), 1–39 (2025)

    Silvano, C., et al.: A survey on deep learning hardware accelerators for heterogeneous hpc platforms. ACM Computing Surveys 57(11), 1–39 (2025)

  28. [29]

    In: SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

    Simsek, O.S., et al.: Increasing energy efficiency of astrophysics simulations through gpu frequency scaling. In: SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1826–1834. IEEE (2024)

  29. [30]

    Sishtla, C., et al.: Multi-GPU Acceleration of the iPIC3D Implicit Particle-in-Cell Code, p. 612618. Springer International Publishing (2019)

  30. [31]

    Astronomy and Computing 55, 101088 (2026)

    Suriano, A., et al.: The pluto code on gpus: Offloading lagrangian particle methods. Astronomy and Computing 55, 101088 (2026). https://doi.org/https://doi.org/10.1016/j.ascom.2026.101088

  31. [32]

    In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C

    Venieri, E., et al.: Monte cimone v2: Hpc risc-v cluster evaluation and optimization. In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C. (eds.) High Performance Computing. pp. 576–585. Springer Nature Switzerland, Cham (2026)

  32. [33]

    In: Neuwirth, S., et al

    Venieri, E., Manoni, S., et al.: Monte cimone v2: Hpc risc-v cluster evaluation and optimization. In: Neuwirth, S., et al. (eds.) High Performance Computing, Lecture Notes in Computer Science, vol. 16091, pp. 576–585. Springer, Cham (2026). https://doi.org/10.1007/978-3-032-07612-0_44

  33. [34]

    In: 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)

    Viviani, P., et al.: Assessing the elephant in the room in scheduling for current hybrid hpc-qc clusters. In: 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). pp. 184–187 (2025). https://doi.org/10.1109/DSN-W65791.2025.00059

  34. [35]

    In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’23) (2023)

    Williams, J.J., et al.: Characterizing the performance of the implicit massively parallel particle-in-cell ipic3d code. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’23) (2023). https://doi.org/10.48550/arXiv.2408.01983, arXiv:2408.01983