pith. sign in

arxiv: 2605.22831 · v1 · pith:2WJNGJPCnew · submitted 2026-04-22 · 💻 cs.DC

Monte Cimone v3: Where RISC-V Stands in High-Performance Computing

Pith reviewed 2026-05-25 00:22 UTC · model grok-4.3

classification 💻 cs.DC
keywords RISC-VHigh-Performance ComputingMonte CimoneSG2044Energy EfficiencyHPLSTREAMCluster Benchmarking
0
0 comments X

The pith

The SG2044 RISC-V processor in Monte Cimone v3 more than doubles single-core performance and delivers 3.08 GFLOPs/W efficiency comparable to x86 and Arm servers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Monte Cimone v3, the latest RISC-V HPC cluster built around the SOPHGO Sophon SG2044 processor as an evolution of the SG2042 used in the prior version. It evaluates the system with the HPL and STREAM benchmarks plus power measurements, directly comparing results to an Intel Xeon Platinum 8480+ Sapphire Rapids server and an NVIDIA Grace CPU Superchip. The central finding is that single-core performance has more than doubled, scalability has improved, and energy efficiency has reached 3.08 GFLOPs/W—a tenfold gain over the first Monte Cimone cluster—while normalized vector performance at peak efficiency reaches 46 percent of the Intel platform and 91 percent of the NVIDIA platform.

Core claim

The SG2044 more than doubles single-core performance and improves scalability compared to SG2042. MCv3 achieves an energy efficiency of 3.08 GFLOPs/W which improves of 10x w.r.t. MCv1 and is in the range of x86-64 and Arm servers. On pure performance when normalized on the SIMD/Vector length MCv3 on its peak efficiency point (16 cores) achieves 46% performance of Intel Sapphire Rapids server and 91% performance of NVIDIA Grace CPU superchip.

What carries the argument

The Monte Cimone v3 cluster built on the SOPHGO Sophon SG2044 processor, measured via HPL and STREAM benchmarks paired with power instrumentation for cross-architecture comparison.

If this is right

  • RISC-V processors can now deliver energy efficiency in the same band as established x86-64 and Arm server CPUs on dense linear algebra and memory bandwidth workloads.
  • Single-core gains from SG2042 to SG2044 translate into better cluster scalability under the same HPL and STREAM conditions.
  • When normalized to vector length, RISC-V reaches over 90 percent of NVIDIA Grace performance at the 16-core efficiency point.
  • Iterative hardware generations in open testbeds like Monte Cimone can close the absolute performance gap with proprietary server chips.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the reported efficiency holds under production compilers and larger node counts, RISC-V clusters could become practical for power-limited HPC installations.
  • The normalization step highlights that remaining gaps are largely in vector unit width and software maturity rather than fundamental architectural inefficiency.
  • Extending the same measurement protocol to additional RISC-V chips would create a public performance trajectory that vendors could target.

Load-bearing premise

The HPL and STREAM benchmarks together with the chosen power measurement methodology provide a representative and architecture-fair comparison of HPC-relevant performance and efficiency across the RISC-V, x86, and Arm platforms tested.

What would settle it

Running the same HPL and STREAM workloads on the SG2044 hardware with an independent power meter or an expanded benchmark suite that produces efficiency or normalized performance figures outside the reported ranges relative to the Intel and NVIDIA references.

Figures

Figures reproduced from arXiv: 2605.22831 by Andrea Bartolini, Emanuele Venieri, Federico Ficarelli, Federico Proverbio, Giacomo Madella, Luca Benini, Simone Manoni.

Figure 1
Figure 1. Figure 1: Comprehensive view of Monte Cimone v3. The SLURM partition Peak includes the two SG2044 nodes, while the Blade partition refers to the MCv2 compute nodes based on SG2042 ∗Corresponding author: emanuele.venieri2@unibo.it 1 https://riscv.epcc.ed.ac.uk/ We evaluated the SG2044 nodes using the STREAM and HPL benchmarks with power measurements, and compared against two contemporary HPC platforms: a dual-socket … view at source ↗
Figure 4
Figure 4. Figure 4: HPL performance comparison scaling with the number of MPI processes. Power efficiency Power efficiency results are re￾ported in [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗
Figure 2
Figure 2. Figure 2: STREAM Triad bandwidth scaling on SG2044 with different OpenMP thread pinning strategies compared against MCv2 and Mcv1 nodes. 7.9 15.8 31.7 63.1 27.5 101.5 101.4 108.0 56.7 110.5 206.5 368.9 519.4 672.5 644.4 652.3 13.5 26.7 52.1 106.1 185.9 234.6 307.1 367.5 0 100 200 300 400 500 600 700 1 2 4 8 16 32 64 112 144 GB/s OMP_NUM_THREADS SG2044 NVIDIA Grace CPU Superchip Intel Xeon Platinum 8480+ [PITH_FULL_… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-architecture comparison of STREAM Triad bandwidth, scaling with the number of OpenMP threads. HPL [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
read the original abstract

The Monte Cimone project provides a RISC-V testbed for High-Performacne Computing cluster. This paper presents Monte Cimone v3 (MCv3), the third iteration of the Monte Cimone RISC-V HPC cluster, integrating the SOPHGO Sophon SG2044 processor, an evolution of the SG2042 used in MCv2. We characterize MCv3 using HPL and STREAM benchmarks coupled with power measurements, and compare it against two reference platforms: the Intel Xeon Platinum 8480+(Sapphire Rapids) and the NVIDIA Grace CPU Superchip. Our results show that the SG2044 more than doubles single-core performance and improves scalability compared to SG2042. MCv3 achieves an energy efficiency of 3.08GFLOPs/W which improves of 10x w.r.t. MCv1 and is in the range of x86-64 and Arm servers. On pure performance when normalized on the SIMD/Vector length MCv3 on its peak efficiency point (16 cores) achieves 46% performance of Intel Sapphire Rapids server and 91% performance of NVIDIA Grace CPU superchip.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Monte Cimone v3, the third iteration of a RISC-V HPC cluster testbed now using the SOPHGO SG2044 processor (successor to the SG2042 in MCv2). It reports HPL and STREAM benchmark results together with power measurements, claiming that SG2044 more than doubles single-core performance and improves scalability versus SG2042, that MCv3 reaches 3.08 GFLOPs/W (10× improvement over MCv1 and comparable to x86-64/Arm servers), and that, after SIMD/vector-length normalization, MCv3 at its 16-core peak-efficiency point attains 46 % of an Intel Sapphire Rapids server and 91 % of an NVIDIA Grace CPU superchip.

Significance. If the cross-platform measurement protocols prove architecture-fair, the work supplies concrete empirical data on RISC-V HPC progress, documenting a substantial efficiency gain and relative performance standings that can inform both hardware development and procurement decisions.

major comments (3)
  1. [Methods / Experimental Setup] The power-measurement methodology (instrumentation, sampling rate, and measurement point—wall, package, or node) is not described with sufficient detail to verify that the reported 3.08 GFLOPs/W efficiency and the 10× improvement claim rest on comparable quantities across MCv1–MCv3 and the Intel/Arm reference platforms.
  2. [Results / Benchmark Configuration] The normalized performance ratios (46 % of Sapphire Rapids, 91 % of Grace) presuppose equivalent optimization effort for HPL and STREAM on all three architectures; the manuscript does not report compiler flags, vector-extension usage, or problem-size choices that would allow an independent assessment of optimization parity.
  3. [Results] Error bars, run-to-run variability, or statistical justification for the single-core doubling and scalability claims versus SG2042 are absent, making it impossible to judge whether the reported gains exceed measurement uncertainty.
minor comments (2)
  1. [Abstract] Abstract contains the typo “High-Performacne”.
  2. [Figures] Figure captions and axis labels should explicitly state the power-measurement domain (e.g., “CPU package power”) to avoid reader misinterpretation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that highlight areas where additional detail will strengthen the manuscript. We address each major comment below and commit to revisions that provide the requested information without altering the core claims.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] The power-measurement methodology (instrumentation, sampling rate, and measurement point—wall, package, or node) is not described with sufficient detail to verify that the reported 3.08 GFLOPs/W efficiency and the 10× improvement claim rest on comparable quantities across MCv1–MCv3 and the Intel/Arm reference platforms.

    Authors: We agree that the power-measurement methodology requires more detail to support the efficiency numbers and cross-platform comparisons. In the revised manuscript we will add a dedicated subsection specifying the instrumentation (power meters or on-board sensors), sampling rates, and measurement points (wall, package, or node) used for MCv1–MCv3 as well as the Intel Sapphire Rapids and NVIDIA Grace platforms. revision: yes

  2. Referee: [Results / Benchmark Configuration] The normalized performance ratios (46 % of Sapphire Rapids, 91 % of Grace) presuppose equivalent optimization effort for HPL and STREAM on all three architectures; the manuscript does not report compiler flags, vector-extension usage, or problem-size choices that would allow an independent assessment of optimization parity.

    Authors: The observation is correct: the manuscript currently omits compiler flags, vector-extension details, and problem-size choices. We will revise the Results section to report these parameters explicitly for each architecture and benchmark, enabling readers to evaluate optimization parity for the normalized 46 % and 91 % figures. revision: yes

  3. Referee: [Results] Error bars, run-to-run variability, or statistical justification for the single-core doubling and scalability claims versus SG2042 are absent, making it impossible to judge whether the reported gains exceed measurement uncertainty.

    Authors: We acknowledge that the current text lacks error bars or run-to-run statistics. The revised manuscript will include multiple-run data, standard deviations or error bars, and a brief statistical justification for the single-core performance doubling and scalability improvements relative to SG2042. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivations or self-referential reductions

full rationale

The paper reports direct empirical measurements from HPL and STREAM benchmarks plus power figures on SG2044, SG2042, Sapphire Rapids, and Grace platforms. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the abstract or described claims. All reported values (e.g., 3.08 GFLOPs/W, 46% normalized performance) are stated as outcomes of running standard benchmarks, not reductions by construction. This matches the default expectation for measurement papers; the reader's score of 1.0 is consistent with absence of any load-bearing circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper reports empirical benchmark results with no mathematical derivations, new theoretical constructs, or postulated entities.

pith-pipeline@v0.9.0 · 5753 in / 1164 out tokens · 17922 ms · 2026-05-25T00:22:02.546526+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers , year=

    Bartolini, Andrea and Ficarelli, Federico and Parisi, Emanuele and Beneventi, Francesco and Barchi, Francesco and Gregori, Daniele and Magugliani, Fabrizio and Cicala, Marco and Gianfreda, Cosimo and Cesarini, Daniele and Acquaviva, Andrea and Benini, Luca , booktitle=. Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Comp...

  2. [2]

    Monte Cimone v2: HPC RISC-V Cluster Evaluation and Optimization

    Venieri, Emanuele and Manoni, Simone and Ceccolini, Gabriele and Madella, Giacomo and Ficarelli, Federico and Gregori, Daniele and Acquaviva, Andrea and Benini, Luca and Bartolini, Andrea. Monte Cimone v2: HPC RISC-V Cluster Evaluation and Optimization. High Performance Computing. 2026

  3. [3]

    Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis , pages =

    Brown, Nick and Jamieson, Maurice and Lee, Joseph and Wang, Paul , title =. Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis , pages =. 2023 , isbn =. doi:10.1145/3624062.3624234 , abstract =

  4. [4]

    Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages =

    Brown, Nick , title =. Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages =. 2025 , isbn =. doi:10.1145/3731599.3767531 , abstract =