Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors

Ivy Peng; Maya Gokhale; Pei-Hung Lin; Ruimin Shi; Xavier Teruel

arxiv: 2605.10860 · v2 · pith:6IEJVZRMnew · submitted 2026-05-11 · 💻 cs.DC

Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors

Ruimin Shi , Maya Gokhale , Pei-Hung Lin , Xavier Teruel , Ivy Peng This is my paper

Pith reviewed 2026-05-25 05:53 UTC · model grok-4.3

classification 💻 cs.DC

keywords RISC-V Vector ExtensionautovectorizationGCCLLVMHPC workloadsmachine learningperformance countersLMUL selection

0 comments

The pith

GCC 15 outperforms LLVM 21 in four of six HPC and ML proxy applications on real RISC-V vector hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates compiler support for the RISC-V Vector Extension on actual hardware using assembly microbenchmarks and proxy applications. It finds that GCC 15 generally produces faster code than LLVM 21 for these workloads, except in matrix multiplication kernels where LLVM reduces instructions more aggressively. The work also identifies specific performance bottlenecks like predication overhead and stride loads that compilers do not yet model well, and shows that default vector length multipliers are near optimal. This matters because RISC-V vector processors aim for portable high performance in scientific computing and machine learning, but current tools leave gaps that limit adoption.

Core claim

Through calibrated performance counters on RVV 1.0 hardware and a suite of assembly microbenchmarks, the authors establish that GCC 15 outperforms LLVM 21 in four of six proxy applications from HPC and ML domains. LLVM's wins in SGEMM and DGEMM stem from greater instruction reduction. Default LMUL choices perform close to optimal, while predication and stride loads remain challenges. Evaluation of Qsim reveals compiler immaturity for complex memory patterns even with manual intrinsics.

What carries the argument

assembly microbenchmarks designed to establish performance ceilings and calibrate performance counters on RVV hardware

Load-bearing premise

The six proxy applications and microbenchmarks adequately represent the challenges in real scientific and machine learning workloads on RVV 1.0 hardware.

What would settle it

A direct comparison of generated assembly code or runtime on a broader set of applications or different RVV implementations would confirm if GCC's advantage holds or if LLVM's instruction reduction generalizes.

Figures

Figures reproduced from arXiv: 2605.10860 by Ivy Peng, Maya Gokhale, Pei-Hung Lin, Ruimin Shi, Xavier Teruel.

**Figure 3.** Figure 3: Compare the performance of tailing elements via setvl and mask operations on BPI-F3 and Jupiter [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The peak throughput of selected vector and scalar arithmetic instructions [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The performance by GCC 15 and Clang 21 autovectorization across six [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The breakdown load/store instructions in BPI-F3 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: The impact of LMULs selection on Jupiter, normalized by GCC 15 nonvec [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 9.** Figure 9: The comparison of Qsim across 3 versions using 8 cores LMUL up to LMUL = 4, reaching approximately 2.0× and 1.6× respectively. One hypothesis is that the conservative unrolled and vectorized loop strategy in GCC 15 allows it to better tolerate the higher register pressure caused by larger LMUL. Stream and SpMV remain near or below 1.0× across all LMUL values for both compilers. This is expected because the… view at source ↗

**Figure 8.** Figure 8: Yolov3 profiling analysis on the impact of LMULs 0 10 20 30 40 50 gcc15 clang21 Runtime/s nonvec autovec rvv intrinsics 0 1 2 3 4 5 gcc15 clang21 # instructions ×10!! [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

The RISC-V Vector Extension~(RVV) is a cornerstone for supporting compute throughout in scientific and machine learning workloads. Yet compiler support and performance monitoring on real RVV~1.0 hardware are still evolving. In this work, we design a suite of assembly microbenchmarks to establish performance ceilings and calibrate performance counters on RVV hardware. Leveraging the assembly benchmarks, we find that predication overhead and stride load pose performance challenges that current compiler cost models do not yet fully address. Moreover, we present the first evaluation of GCC~15 and LLVM~21 autovectorization in HPC and ML proxy applications. GCC~15 outperforms LLVM~21 in four out of six applications. LLVM~21 only outperforms GCC~15 in SGEMM and DGEMM, driven by more aggressive instruction reduction confirmed through validated \texttt{perf} counters on the RVV hardware. We further show that the default LMUL selection in compilers performs close to the optimal. To study the RVV support for product-level application, we also evaluate the state-vector quantum simulator, Google's Qsim, with both manual RVV intrinsics and compiler auto-vectorization, revealing immaturity in current RVV compiler for complicated memory access pattern.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First side-by-side GCC 15 vs LLVM 21 runs on real RVV 1.0 hardware plus microbenchmarks that flag predication and stride costs.

read the letter

This paper delivers the first reported comparison of GCC 15 and LLVM 21 autovectorization on actual RVV 1.0 silicon, backed by assembly microbenchmarks that measure performance ceilings and validate perf counters. It shows GCC ahead in four of the six HPC/ML proxies, LLVM pulling ahead on SGEMM and DGEMM through lower instruction counts, and default LMUL choices landing close to optimal. The Qsim run is a useful addition because it surfaces compiler weakness on irregular memory patterns that the proxies do not stress as hard. The hardware measurements and counter validation are the parts that hold up best; they give concrete numbers instead of simulation-only claims. The main soft spot is representativeness. The six proxies plus microbenchmarks cover some costs, but the paper itself notes that Qsim's more complex accesses expose gaps the proxies miss, so the observed ranking and cost-model shortfalls may not generalize to broader scientific or ML codes. The abstract also omits error bars and selection criteria, which leaves the central claims plausible but harder to assess without the full methods section. Readers working on RVV compiler development or early hardware porting will find the data points worth checking. The work is narrow but grounded enough in real measurements to deserve a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper designs assembly microbenchmarks to establish performance ceilings and calibrate perf counters on real RVV 1.0 hardware, identifies predication overhead and stride loads as compiler cost-model gaps, evaluates GCC 15 versus LLVM 21 autovectorization on six HPC/ML proxy applications (GCC outperforming in four, LLVM in SGEMM/DGEMM via greater instruction reduction), shows default LMUL selection is near-optimal, and evaluates Google's Qsim to demonstrate remaining compiler immaturity for irregular memory patterns.

Significance. If the empirical results hold, the work supplies rare direct hardware measurements on RVV 1.0 silicon that calibrate counters and isolate specific compiler weaknesses (predication, stride loads, LMUL), offering concrete targets for cost-model improvements. The GCC/LLVM ranking and Qsim findings are directly relevant to portable performance in scientific and ML codes targeting RISC-V vectors.

major comments (2)

[Evaluation of GCC 15 and LLVM 21 on proxy applications] Proxy application results: the central claim that GCC 15 outperforms LLVM 21 in four of six applications (and LLVM's edge in SGEMM/DGEMM via instruction reduction) is presented without error bars, repetition counts, or statistical tests; this directly affects verifiability of the performance ranking that underpins the portable-performance narrative.
[Qsim evaluation] Qsim evaluation and proxy representativeness: the manuscript itself notes that Qsim's irregular memory patterns expose compiler immaturity not captured by the six proxies; because the headline GCC/LLVM comparison rests on those proxies, the lack of explicit discussion of selection criteria or generalization limits weakens the load-bearing claim that the observed gaps are broadly representative of scientific/ML workloads.

minor comments (2)

[Abstract] Abstract: 'compute throughout' is presumably a typo for 'compute throughput'.
[Proxy application evaluation] The six proxy applications are referenced but never listed with their access-pattern characteristics or selection rationale; a summary table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below.

read point-by-point responses

Referee: Proxy application results: the central claim that GCC 15 outperforms LLVM 21 in four of six applications (and LLVM's edge in SGEMM/DGEMM via instruction reduction) is presented without error bars, repetition counts, or statistical tests; this directly affects verifiability of the performance ranking that underpins the portable-performance narrative.

Authors: We agree that the absence of error bars and explicit repetition counts limits verifiability. In the revised manuscript we will report the number of repetitions performed for each application and include error bars (standard deviation) on the performance figures. This will directly strengthen the evidence for the GCC/LLVM ranking. revision: yes
Referee: Qsim evaluation and proxy representativeness: the manuscript itself notes that Qsim's irregular memory patterns expose compiler immaturity not captured by the six proxies; because the headline GCC/LLVM comparison rests on those proxies, the lack of explicit discussion of selection criteria or generalization limits weakens the load-bearing claim that the observed gaps are broadly representative of scientific/ML workloads.

Authors: We accept the point that an explicit discussion of proxy selection and generalization limits is needed. The six proxies were chosen to cover representative regular-access HPC/ML kernels; Qsim was included precisely to illustrate the remaining gaps on irregular patterns. In revision we will add a short subsection stating the selection criteria and noting that the observed compiler weaknesses may not generalize to highly irregular workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical hardware measurements with no derivations or fitted predictions

full rationale

The paper reports direct performance measurements on RVV 1.0 hardware using assembly microbenchmarks and six proxy applications. Claims (GCC outperforming LLVM in 4/6 apps, instruction counts via perf counters, LMUL selection) rest on observed execution times and validated counters rather than any equations, parameter fits, or predictions that reduce to prior inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the derivation chain. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study with no free parameters or invented entities; relies on standard domain assumptions about hardware counter accuracy and benchmark representativeness.

axioms (1)

domain assumption Performance counters on RVV hardware accurately reflect instruction counts and execution behavior for validating compiler output.
Invoked to confirm LLVM's instruction reduction advantage in SGEMM/DGEMM.

pith-pipeline@v0.9.0 · 5756 in / 1310 out tokens · 26643 ms · 2026-05-25T05:53:04.720347+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

IEEE Micro42(5), 41–48 (2022)

Adit, N., Sampson, A.: Performance left on the table: An evaluation of compiler autovectorization for risc-v. IEEE Micro42(5), 41–48 (2022)

work page 2022
[2]

Asanovic., K.: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0 (2021)

work page 2021
[3]

In: International Conference on High Performance Computing

Banchelli, F., et al.: Risc-v in hpc: a look into tools for performance monitoring. In: International Conference on High Performance Computing. pp. 562–575 (2025)

work page 2025
[4]

https://github.com/camel-cdr/rvv- bench

Bernstein, O.: RISC-V Vector benchmark. https://github.com/camel-cdr/rvv- bench

work page
[5]

In: Proc

Brown, N., et al.: Is RISC-V ready for hpc prime-time: Evaluating the 64-core sophon SG2042 RISC-V CPU. In: Proc. SC’23 Workshops. pp. 1566–1574 (2023)

work page 2023
[6]

In: Proc

Carpentieri, et al.: A performance analysis of autovectorization on rvv risc-v boards. In: Proc. PDP. pp. 129–136 (2025)

work page 2025
[7]

Future Generation Computer Systems p

Garcia, A.M., et al.: Inference performance of large language models on a 64-core risc-v cpu with silicon-enabled vectors. Future Generation Computer Systems p. 108242 (2025)

work page 2025
[8]

In: 2023 IEEE International Parallel and Distributed Processing Sym- posium

Gupta, S.R., et al.: Accelerating CNN inference on long vector architectures via co-design. In: 2023 IEEE International Parallel and Distributed Processing Sym- posium. pp. 145–155. IEEE (2023)

work page 2023
[9]

In: Proceedings of the SC’25 Workshops

Lai, H.M., et al.: RISC-V vectorization coverage for HPC: A TSVC-based analysis. In: Proceedings of the SC’25 Workshops. pp. 1676–1683 (2025)

work page 2025
[10]

In: International Conference on High Performance Computing

Lee, J.K., et al.: Test-driving risc-v vector hardware for hpc. In: International Conference on High Performance Computing. pp. 419–432. Springer (2023)

work page 2023
[11]

In: Proc

Lin, J.K., et al.: Rewriting and optimizing vector length agnostic intrinsics from arm sve to rvv. In: Proc. 53rd ICPP Workshops. pp. 38–47 (2024)

work page 2024
[12]

Peccia,F.N.,Haxel,F.,Bringmann,O.:TensorprogramoptimizationfortheRISC- Vvectorextensionusingprobabilisticprograms.In:2025IEEE/ACMInternational Conference On Computer Aided Design (ICCAD). pp. 1–9. IEEE (2025)

work page 2025
[13]

In: ASAP

Perotti,M.,etal.:A“newara” forvectorcomputing:Anopensourcehighlyefficient risc-v v 1.0 vector processor design. In: ASAP. IEEE (2022)

work page 2022
[14]

Quantum AI team: qsim (Jun 2025), https://doi.org/10.5281/zenodo.4067237

work page doi:10.5281/zenodo.4067237 2025
[15]

TACO17(4), 1–30 (2020)

Ramírez, C., et al.: A risc-v simulator and benchmark suite for designing and evaluating vector architectures. TACO17(4), 1–30 (2020)

work page 2020
[16]

In: European Conference on Parallel Processing

Shi, R., et al.: ARM SVE unleashed: Performance and insights across hpc applica- tions on nvidia grace. In: European Conference on Parallel Processing. pp. 33–47. Springer (2025)

work page 2025
[17]

In: International Parallel and Distributed Processing Symposium

Shi, R., et al.: High-performance vector-length agnostic quantum circuit simu- lations on arm processors. In: International Parallel and Distributed Processing Symposium. IEEE (2026)

work page 2026
[18]

Future Generation Computer Systems p

Vizcaino, P., et al.: Designing a qemu plugin to profile multicore long vector risc-v architectures: Rave. Future Generation Computer Systems p. 108100 (2025)

work page 2025

[1] [1]

IEEE Micro42(5), 41–48 (2022)

Adit, N., Sampson, A.: Performance left on the table: An evaluation of compiler autovectorization for risc-v. IEEE Micro42(5), 41–48 (2022)

work page 2022

[2] [2]

Asanovic., K.: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0 (2021)

work page 2021

[3] [3]

In: International Conference on High Performance Computing

Banchelli, F., et al.: Risc-v in hpc: a look into tools for performance monitoring. In: International Conference on High Performance Computing. pp. 562–575 (2025)

work page 2025

[4] [4]

https://github.com/camel-cdr/rvv- bench

Bernstein, O.: RISC-V Vector benchmark. https://github.com/camel-cdr/rvv- bench

work page

[5] [5]

In: Proc

Brown, N., et al.: Is RISC-V ready for hpc prime-time: Evaluating the 64-core sophon SG2042 RISC-V CPU. In: Proc. SC’23 Workshops. pp. 1566–1574 (2023)

work page 2023

[6] [6]

In: Proc

Carpentieri, et al.: A performance analysis of autovectorization on rvv risc-v boards. In: Proc. PDP. pp. 129–136 (2025)

work page 2025

[7] [7]

Future Generation Computer Systems p

Garcia, A.M., et al.: Inference performance of large language models on a 64-core risc-v cpu with silicon-enabled vectors. Future Generation Computer Systems p. 108242 (2025)

work page 2025

[8] [8]

In: 2023 IEEE International Parallel and Distributed Processing Sym- posium

Gupta, S.R., et al.: Accelerating CNN inference on long vector architectures via co-design. In: 2023 IEEE International Parallel and Distributed Processing Sym- posium. pp. 145–155. IEEE (2023)

work page 2023

[9] [9]

In: Proceedings of the SC’25 Workshops

Lai, H.M., et al.: RISC-V vectorization coverage for HPC: A TSVC-based analysis. In: Proceedings of the SC’25 Workshops. pp. 1676–1683 (2025)

work page 2025

[10] [10]

In: International Conference on High Performance Computing

Lee, J.K., et al.: Test-driving risc-v vector hardware for hpc. In: International Conference on High Performance Computing. pp. 419–432. Springer (2023)

work page 2023

[11] [11]

In: Proc

Lin, J.K., et al.: Rewriting and optimizing vector length agnostic intrinsics from arm sve to rvv. In: Proc. 53rd ICPP Workshops. pp. 38–47 (2024)

work page 2024

[12] [12]

Peccia,F.N.,Haxel,F.,Bringmann,O.:TensorprogramoptimizationfortheRISC- Vvectorextensionusingprobabilisticprograms.In:2025IEEE/ACMInternational Conference On Computer Aided Design (ICCAD). pp. 1–9. IEEE (2025)

work page 2025

[13] [13]

In: ASAP

Perotti,M.,etal.:A“newara” forvectorcomputing:Anopensourcehighlyefficient risc-v v 1.0 vector processor design. In: ASAP. IEEE (2022)

work page 2022

[14] [14]

Quantum AI team: qsim (Jun 2025), https://doi.org/10.5281/zenodo.4067237

work page doi:10.5281/zenodo.4067237 2025

[15] [15]

TACO17(4), 1–30 (2020)

Ramírez, C., et al.: A risc-v simulator and benchmark suite for designing and evaluating vector architectures. TACO17(4), 1–30 (2020)

work page 2020

[16] [16]

In: European Conference on Parallel Processing

Shi, R., et al.: ARM SVE unleashed: Performance and insights across hpc applica- tions on nvidia grace. In: European Conference on Parallel Processing. pp. 33–47. Springer (2025)

work page 2025

[17] [17]

In: International Parallel and Distributed Processing Symposium

Shi, R., et al.: High-performance vector-length agnostic quantum circuit simu- lations on arm processors. In: International Parallel and Distributed Processing Symposium. IEEE (2026)

work page 2026

[18] [18]

Future Generation Computer Systems p

Vizcaino, P., et al.: Designing a qemu plugin to profile multicore long vector risc-v architectures: Rave. Future Generation Computer Systems p. 108100 (2025)

work page 2025