Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors
Pith reviewed 2026-05-25 05:53 UTC · model grok-4.3
The pith
GCC 15 outperforms LLVM 21 in four of six HPC and ML proxy applications on real RISC-V vector hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through calibrated performance counters on RVV 1.0 hardware and a suite of assembly microbenchmarks, the authors establish that GCC 15 outperforms LLVM 21 in four of six proxy applications from HPC and ML domains. LLVM's wins in SGEMM and DGEMM stem from greater instruction reduction. Default LMUL choices perform close to optimal, while predication and stride loads remain challenges. Evaluation of Qsim reveals compiler immaturity for complex memory patterns even with manual intrinsics.
What carries the argument
assembly microbenchmarks designed to establish performance ceilings and calibrate performance counters on RVV hardware
Load-bearing premise
The six proxy applications and microbenchmarks adequately represent the challenges in real scientific and machine learning workloads on RVV 1.0 hardware.
What would settle it
A direct comparison of generated assembly code or runtime on a broader set of applications or different RVV implementations would confirm if GCC's advantage holds or if LLVM's instruction reduction generalizes.
Figures
read the original abstract
The RISC-V Vector Extension~(RVV) is a cornerstone for supporting compute throughout in scientific and machine learning workloads. Yet compiler support and performance monitoring on real RVV~1.0 hardware are still evolving. In this work, we design a suite of assembly microbenchmarks to establish performance ceilings and calibrate performance counters on RVV hardware. Leveraging the assembly benchmarks, we find that predication overhead and stride load pose performance challenges that current compiler cost models do not yet fully address. Moreover, we present the first evaluation of GCC~15 and LLVM~21 autovectorization in HPC and ML proxy applications. GCC~15 outperforms LLVM~21 in four out of six applications. LLVM~21 only outperforms GCC~15 in SGEMM and DGEMM, driven by more aggressive instruction reduction confirmed through validated \texttt{perf} counters on the RVV hardware. We further show that the default LMUL selection in compilers performs close to the optimal. To study the RVV support for product-level application, we also evaluate the state-vector quantum simulator, Google's Qsim, with both manual RVV intrinsics and compiler auto-vectorization, revealing immaturity in current RVV compiler for complicated memory access pattern.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper designs assembly microbenchmarks to establish performance ceilings and calibrate perf counters on real RVV 1.0 hardware, identifies predication overhead and stride loads as compiler cost-model gaps, evaluates GCC 15 versus LLVM 21 autovectorization on six HPC/ML proxy applications (GCC outperforming in four, LLVM in SGEMM/DGEMM via greater instruction reduction), shows default LMUL selection is near-optimal, and evaluates Google's Qsim to demonstrate remaining compiler immaturity for irregular memory patterns.
Significance. If the empirical results hold, the work supplies rare direct hardware measurements on RVV 1.0 silicon that calibrate counters and isolate specific compiler weaknesses (predication, stride loads, LMUL), offering concrete targets for cost-model improvements. The GCC/LLVM ranking and Qsim findings are directly relevant to portable performance in scientific and ML codes targeting RISC-V vectors.
major comments (2)
- [Evaluation of GCC 15 and LLVM 21 on proxy applications] Proxy application results: the central claim that GCC 15 outperforms LLVM 21 in four of six applications (and LLVM's edge in SGEMM/DGEMM via instruction reduction) is presented without error bars, repetition counts, or statistical tests; this directly affects verifiability of the performance ranking that underpins the portable-performance narrative.
- [Qsim evaluation] Qsim evaluation and proxy representativeness: the manuscript itself notes that Qsim's irregular memory patterns expose compiler immaturity not captured by the six proxies; because the headline GCC/LLVM comparison rests on those proxies, the lack of explicit discussion of selection criteria or generalization limits weakens the load-bearing claim that the observed gaps are broadly representative of scientific/ML workloads.
minor comments (2)
- [Abstract] Abstract: 'compute throughout' is presumably a typo for 'compute throughput'.
- [Proxy application evaluation] The six proxy applications are referenced but never listed with their access-pattern characteristics or selection rationale; a summary table would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below.
read point-by-point responses
-
Referee: Proxy application results: the central claim that GCC 15 outperforms LLVM 21 in four of six applications (and LLVM's edge in SGEMM/DGEMM via instruction reduction) is presented without error bars, repetition counts, or statistical tests; this directly affects verifiability of the performance ranking that underpins the portable-performance narrative.
Authors: We agree that the absence of error bars and explicit repetition counts limits verifiability. In the revised manuscript we will report the number of repetitions performed for each application and include error bars (standard deviation) on the performance figures. This will directly strengthen the evidence for the GCC/LLVM ranking. revision: yes
-
Referee: Qsim evaluation and proxy representativeness: the manuscript itself notes that Qsim's irregular memory patterns expose compiler immaturity not captured by the six proxies; because the headline GCC/LLVM comparison rests on those proxies, the lack of explicit discussion of selection criteria or generalization limits weakens the load-bearing claim that the observed gaps are broadly representative of scientific/ML workloads.
Authors: We accept the point that an explicit discussion of proxy selection and generalization limits is needed. The six proxies were chosen to cover representative regular-access HPC/ML kernels; Qsim was included precisely to illustrate the remaining gaps on irregular patterns. In revision we will add a short subsection stating the selection criteria and noting that the observed compiler weaknesses may not generalize to highly irregular workloads. revision: yes
Circularity Check
No circularity: purely empirical hardware measurements with no derivations or fitted predictions
full rationale
The paper reports direct performance measurements on RVV 1.0 hardware using assembly microbenchmarks and six proxy applications. Claims (GCC outperforming LLVM in 4/6 apps, instruction counts via perf counters, LMUL selection) rest on observed execution times and validated counters rather than any equations, parameter fits, or predictions that reduce to prior inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the derivation chain. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Performance counters on RVV hardware accurately reflect instruction counts and execution behavior for validating compiler output.
Reference graph
Works this paper leans on
-
[1]
Adit, N., Sampson, A.: Performance left on the table: An evaluation of compiler autovectorization for risc-v. IEEE Micro42(5), 41–48 (2022)
work page 2022
-
[2]
Asanovic., K.: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0 (2021)
work page 2021
-
[3]
In: International Conference on High Performance Computing
Banchelli, F., et al.: Risc-v in hpc: a look into tools for performance monitoring. In: International Conference on High Performance Computing. pp. 562–575 (2025)
work page 2025
-
[4]
https://github.com/camel-cdr/rvv- bench
Bernstein, O.: RISC-V Vector benchmark. https://github.com/camel-cdr/rvv- bench
- [5]
- [6]
-
[7]
Future Generation Computer Systems p
Garcia, A.M., et al.: Inference performance of large language models on a 64-core risc-v cpu with silicon-enabled vectors. Future Generation Computer Systems p. 108242 (2025)
work page 2025
-
[8]
In: 2023 IEEE International Parallel and Distributed Processing Sym- posium
Gupta, S.R., et al.: Accelerating CNN inference on long vector architectures via co-design. In: 2023 IEEE International Parallel and Distributed Processing Sym- posium. pp. 145–155. IEEE (2023)
work page 2023
-
[9]
In: Proceedings of the SC’25 Workshops
Lai, H.M., et al.: RISC-V vectorization coverage for HPC: A TSVC-based analysis. In: Proceedings of the SC’25 Workshops. pp. 1676–1683 (2025)
work page 2025
-
[10]
In: International Conference on High Performance Computing
Lee, J.K., et al.: Test-driving risc-v vector hardware for hpc. In: International Conference on High Performance Computing. pp. 419–432. Springer (2023)
work page 2023
- [11]
-
[12]
Peccia,F.N.,Haxel,F.,Bringmann,O.:TensorprogramoptimizationfortheRISC- Vvectorextensionusingprobabilisticprograms.In:2025IEEE/ACMInternational Conference On Computer Aided Design (ICCAD). pp. 1–9. IEEE (2025)
work page 2025
- [13]
-
[14]
Quantum AI team: qsim (Jun 2025), https://doi.org/10.5281/zenodo.4067237
-
[15]
Ramírez, C., et al.: A risc-v simulator and benchmark suite for designing and evaluating vector architectures. TACO17(4), 1–30 (2020)
work page 2020
-
[16]
In: European Conference on Parallel Processing
Shi, R., et al.: ARM SVE unleashed: Performance and insights across hpc applica- tions on nvidia grace. In: European Conference on Parallel Processing. pp. 33–47. Springer (2025)
work page 2025
-
[17]
In: International Parallel and Distributed Processing Symposium
Shi, R., et al.: High-performance vector-length agnostic quantum circuit simu- lations on arm processors. In: International Parallel and Distributed Processing Symposium. IEEE (2026)
work page 2026
-
[18]
Future Generation Computer Systems p
Vizcaino, P., et al.: Designing a qemu plugin to profile multicore long vector risc-v architectures: Rave. Future Generation Computer Systems p. 108100 (2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.