pith. sign in

arxiv: 2605.10860 · v2 · pith:6IEJVZRMnew · submitted 2026-05-11 · 💻 cs.DC

Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors

Pith reviewed 2026-05-25 05:53 UTC · model grok-4.3

classification 💻 cs.DC
keywords RISC-V Vector ExtensionautovectorizationGCCLLVMHPC workloadsmachine learningperformance countersLMUL selection
0
0 comments X

The pith

GCC 15 outperforms LLVM 21 in four of six HPC and ML proxy applications on real RISC-V vector hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates compiler support for the RISC-V Vector Extension on actual hardware using assembly microbenchmarks and proxy applications. It finds that GCC 15 generally produces faster code than LLVM 21 for these workloads, except in matrix multiplication kernels where LLVM reduces instructions more aggressively. The work also identifies specific performance bottlenecks like predication overhead and stride loads that compilers do not yet model well, and shows that default vector length multipliers are near optimal. This matters because RISC-V vector processors aim for portable high performance in scientific computing and machine learning, but current tools leave gaps that limit adoption.

Core claim

Through calibrated performance counters on RVV 1.0 hardware and a suite of assembly microbenchmarks, the authors establish that GCC 15 outperforms LLVM 21 in four of six proxy applications from HPC and ML domains. LLVM's wins in SGEMM and DGEMM stem from greater instruction reduction. Default LMUL choices perform close to optimal, while predication and stride loads remain challenges. Evaluation of Qsim reveals compiler immaturity for complex memory patterns even with manual intrinsics.

What carries the argument

assembly microbenchmarks designed to establish performance ceilings and calibrate performance counters on RVV hardware

Load-bearing premise

The six proxy applications and microbenchmarks adequately represent the challenges in real scientific and machine learning workloads on RVV 1.0 hardware.

What would settle it

A direct comparison of generated assembly code or runtime on a broader set of applications or different RVV implementations would confirm if GCC's advantage holds or if LLVM's instruction reduction generalizes.

Figures

Figures reproduced from arXiv: 2605.10860 by Ivy Peng, Maya Gokhale, Pei-Hung Lin, Ruimin Shi, Xavier Teruel.

Figure 1
Figure 1. Figure 1: Thus, we design the assembly benchmark for each variant to identify [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Compare the performance of tail￾ing elements via setvl and mask opera￾tions on BPI-F3 and Jupiter [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The peak throughput of selected vector and scalar arithmetic instructions [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The performance by GCC 15 and Clang 21 autovectorization across six [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The breakdown load/store instructions in BPI-F3 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The impact of LMULs selection on Jupiter, normalized by GCC 15 nonvec [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: The comparison of Qsim across 3 versions using 8 cores LMUL up to LMUL = 4, reaching approximately 2.0× and 1.6× respectively. One hypothesis is that the conservative unrolled and vectorized loop strategy in GCC 15 allows it to better tolerate the higher register pressure caused by larger LMUL. Stream and SpMV remain near or below 1.0× across all LMUL values for both compilers. This is expected because the… view at source ↗
Figure 8
Figure 8. Figure 8: Yolov3 profiling analysis on the impact of LMULs 0 10 20 30 40 50 gcc15 clang21 Runtime/s nonvec autovec rvv intrinsics 0 1 2 3 4 5 gcc15 clang21 # instructions ×10!! [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

The RISC-V Vector Extension~(RVV) is a cornerstone for supporting compute throughout in scientific and machine learning workloads. Yet compiler support and performance monitoring on real RVV~1.0 hardware are still evolving. In this work, we design a suite of assembly microbenchmarks to establish performance ceilings and calibrate performance counters on RVV hardware. Leveraging the assembly benchmarks, we find that predication overhead and stride load pose performance challenges that current compiler cost models do not yet fully address. Moreover, we present the first evaluation of GCC~15 and LLVM~21 autovectorization in HPC and ML proxy applications. GCC~15 outperforms LLVM~21 in four out of six applications. LLVM~21 only outperforms GCC~15 in SGEMM and DGEMM, driven by more aggressive instruction reduction confirmed through validated \texttt{perf} counters on the RVV hardware. We further show that the default LMUL selection in compilers performs close to the optimal. To study the RVV support for product-level application, we also evaluate the state-vector quantum simulator, Google's Qsim, with both manual RVV intrinsics and compiler auto-vectorization, revealing immaturity in current RVV compiler for complicated memory access pattern.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper designs assembly microbenchmarks to establish performance ceilings and calibrate perf counters on real RVV 1.0 hardware, identifies predication overhead and stride loads as compiler cost-model gaps, evaluates GCC 15 versus LLVM 21 autovectorization on six HPC/ML proxy applications (GCC outperforming in four, LLVM in SGEMM/DGEMM via greater instruction reduction), shows default LMUL selection is near-optimal, and evaluates Google's Qsim to demonstrate remaining compiler immaturity for irregular memory patterns.

Significance. If the empirical results hold, the work supplies rare direct hardware measurements on RVV 1.0 silicon that calibrate counters and isolate specific compiler weaknesses (predication, stride loads, LMUL), offering concrete targets for cost-model improvements. The GCC/LLVM ranking and Qsim findings are directly relevant to portable performance in scientific and ML codes targeting RISC-V vectors.

major comments (2)
  1. [Evaluation of GCC 15 and LLVM 21 on proxy applications] Proxy application results: the central claim that GCC 15 outperforms LLVM 21 in four of six applications (and LLVM's edge in SGEMM/DGEMM via instruction reduction) is presented without error bars, repetition counts, or statistical tests; this directly affects verifiability of the performance ranking that underpins the portable-performance narrative.
  2. [Qsim evaluation] Qsim evaluation and proxy representativeness: the manuscript itself notes that Qsim's irregular memory patterns expose compiler immaturity not captured by the six proxies; because the headline GCC/LLVM comparison rests on those proxies, the lack of explicit discussion of selection criteria or generalization limits weakens the load-bearing claim that the observed gaps are broadly representative of scientific/ML workloads.
minor comments (2)
  1. [Abstract] Abstract: 'compute throughout' is presumably a typo for 'compute throughput'.
  2. [Proxy application evaluation] The six proxy applications are referenced but never listed with their access-pattern characteristics or selection rationale; a summary table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below.

read point-by-point responses
  1. Referee: Proxy application results: the central claim that GCC 15 outperforms LLVM 21 in four of six applications (and LLVM's edge in SGEMM/DGEMM via instruction reduction) is presented without error bars, repetition counts, or statistical tests; this directly affects verifiability of the performance ranking that underpins the portable-performance narrative.

    Authors: We agree that the absence of error bars and explicit repetition counts limits verifiability. In the revised manuscript we will report the number of repetitions performed for each application and include error bars (standard deviation) on the performance figures. This will directly strengthen the evidence for the GCC/LLVM ranking. revision: yes

  2. Referee: Qsim evaluation and proxy representativeness: the manuscript itself notes that Qsim's irregular memory patterns expose compiler immaturity not captured by the six proxies; because the headline GCC/LLVM comparison rests on those proxies, the lack of explicit discussion of selection criteria or generalization limits weakens the load-bearing claim that the observed gaps are broadly representative of scientific/ML workloads.

    Authors: We accept the point that an explicit discussion of proxy selection and generalization limits is needed. The six proxies were chosen to cover representative regular-access HPC/ML kernels; Qsim was included precisely to illustrate the remaining gaps on irregular patterns. In revision we will add a short subsection stating the selection criteria and noting that the observed compiler weaknesses may not generalize to highly irregular workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical hardware measurements with no derivations or fitted predictions

full rationale

The paper reports direct performance measurements on RVV 1.0 hardware using assembly microbenchmarks and six proxy applications. Claims (GCC outperforming LLVM in 4/6 apps, instruction counts via perf counters, LMUL selection) rest on observed execution times and validated counters rather than any equations, parameter fits, or predictions that reduce to prior inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the derivation chain. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study with no free parameters or invented entities; relies on standard domain assumptions about hardware counter accuracy and benchmark representativeness.

axioms (1)
  • domain assumption Performance counters on RVV hardware accurately reflect instruction counts and execution behavior for validating compiler output.
    Invoked to confirm LLVM's instruction reduction advantage in SGEMM/DGEMM.

pith-pipeline@v0.9.0 · 5756 in / 1310 out tokens · 26643 ms · 2026-05-25T05:53:04.720347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    IEEE Micro42(5), 41–48 (2022)

    Adit, N., Sampson, A.: Performance left on the table: An evaluation of compiler autovectorization for risc-v. IEEE Micro42(5), 41–48 (2022)

  2. [2]

    Asanovic., K.: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0 (2021)

  3. [3]

    In: International Conference on High Performance Computing

    Banchelli, F., et al.: Risc-v in hpc: a look into tools for performance monitoring. In: International Conference on High Performance Computing. pp. 562–575 (2025)

  4. [4]

    https://github.com/camel-cdr/rvv- bench

    Bernstein, O.: RISC-V Vector benchmark. https://github.com/camel-cdr/rvv- bench

  5. [5]

    In: Proc

    Brown, N., et al.: Is RISC-V ready for hpc prime-time: Evaluating the 64-core sophon SG2042 RISC-V CPU. In: Proc. SC’23 Workshops. pp. 1566–1574 (2023)

  6. [6]

    In: Proc

    Carpentieri, et al.: A performance analysis of autovectorization on rvv risc-v boards. In: Proc. PDP. pp. 129–136 (2025)

  7. [7]

    Future Generation Computer Systems p

    Garcia, A.M., et al.: Inference performance of large language models on a 64-core risc-v cpu with silicon-enabled vectors. Future Generation Computer Systems p. 108242 (2025)

  8. [8]

    In: 2023 IEEE International Parallel and Distributed Processing Sym- posium

    Gupta, S.R., et al.: Accelerating CNN inference on long vector architectures via co-design. In: 2023 IEEE International Parallel and Distributed Processing Sym- posium. pp. 145–155. IEEE (2023)

  9. [9]

    In: Proceedings of the SC’25 Workshops

    Lai, H.M., et al.: RISC-V vectorization coverage for HPC: A TSVC-based analysis. In: Proceedings of the SC’25 Workshops. pp. 1676–1683 (2025)

  10. [10]

    In: International Conference on High Performance Computing

    Lee, J.K., et al.: Test-driving risc-v vector hardware for hpc. In: International Conference on High Performance Computing. pp. 419–432. Springer (2023)

  11. [11]

    In: Proc

    Lin, J.K., et al.: Rewriting and optimizing vector length agnostic intrinsics from arm sve to rvv. In: Proc. 53rd ICPP Workshops. pp. 38–47 (2024)

  12. [12]

    Peccia,F.N.,Haxel,F.,Bringmann,O.:TensorprogramoptimizationfortheRISC- Vvectorextensionusingprobabilisticprograms.In:2025IEEE/ACMInternational Conference On Computer Aided Design (ICCAD). pp. 1–9. IEEE (2025)

  13. [13]

    In: ASAP

    Perotti,M.,etal.:A“newara” forvectorcomputing:Anopensourcehighlyefficient risc-v v 1.0 vector processor design. In: ASAP. IEEE (2022)

  14. [14]

    Quantum AI team: qsim (Jun 2025), https://doi.org/10.5281/zenodo.4067237

  15. [15]

    TACO17(4), 1–30 (2020)

    Ramírez, C., et al.: A risc-v simulator and benchmark suite for designing and evaluating vector architectures. TACO17(4), 1–30 (2020)

  16. [16]

    In: European Conference on Parallel Processing

    Shi, R., et al.: ARM SVE unleashed: Performance and insights across hpc applica- tions on nvidia grace. In: European Conference on Parallel Processing. pp. 33–47. Springer (2025)

  17. [17]

    In: International Parallel and Distributed Processing Symposium

    Shi, R., et al.: High-performance vector-length agnostic quantum circuit simu- lations on arm processors. In: International Parallel and Distributed Processing Symposium. IEEE (2026)

  18. [18]

    Future Generation Computer Systems p

    Vizcaino, P., et al.: Designing a qemu plugin to profile multicore long vector risc-v architectures: Rave. Future Generation Computer Systems p. 108100 (2025)