CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

Caiwen Ding; Haoyang Chen; Mattia Fazzini; Shiyang Li

arxiv: 2605.08455 · v2 · pith:65VHDTLInew · submitted 2026-05-08 · 💻 cs.LG · cs.PL· cs.SE

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

Shiyang Li , Haoyang Chen , Mattia Fazzini , Caiwen Ding This is my paper

Pith reviewed 2026-05-12 01:40 UTC · model grok-4.3

classification 💻 cs.LG cs.PLcs.SE

keywords CUDA debuggingLLM evaluationbenchmarkperformance preservationautomated code repairGPU programmingcode degenerationprotocol-conditional metric

0 comments

The pith

A benchmark for LLM CUDA debugging shows that models often pass tests by degenerating optimized code into slower versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CUDABEAVER as a benchmark built from 213 real failing CUDA workspaces generated by LLMs, each supplying the broken code, build and test commands, raw errors, and one editable file. It establishes that standard correctness checks are inadequate because fixers can succeed by abandoning the original performance structure and producing a slower but passing program. The work introduces the pass@k(M,C,A) metric that makes the model, corpus, and protocol axes explicit, including performance preservation requirements. Across seven frontier LLMs this protocol-aware approach reveals that loose performance tolerance inflates success rates while even modest tightening can drop measured performance by as much as 40 percentage points. A reader cares because growing GPU use in scientific and ML workloads makes reliable automated repair essential, yet current evaluations risk overestimating true debugging skill.

Core claim

CUDABEAVER supplies tasks drawn from actual LLM-generated failing CUDA workspaces together with native build and test commands and error evidence. It evaluates fixers not only on whether they restore correctness but also on whether they preserve performance, reporting results by failure category, debugging trajectory, and stagnation mode. The protocol-conditional metric pass@k(M,C,A) applied to seven LLMs demonstrates that high tolerance for performance loss makes fixers appear stronger, whereas stricter performance requirements sharply reduce success scores by up to 40 percentage points.

What carries the argument

The CUDABEAVER benchmark of 213 tasks from real failing workspaces and the pass@k(M,C,A) metric that conditions evaluation on model, corpus, and explicit protocol axes including performance preservation.

If this is right

Evaluations of LLM code repair in performance-critical domains must incorporate explicit performance checks or risk overestimating capability.
Small adjustments to performance-loss tolerance produce large shifts in reported debugging success.
Detailed breakdowns by failure category and stagnation mode expose specific LLM weaknesses in handling CUDA memory and execution subtleties.
True repair requires restoring both correctness and the original optimization structure rather than substituting a slower safe variant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar degeneration risks likely exist in benchmarks for other performance-sensitive languages such as OpenCL or HIP.
Extending the benchmark to multi-file projects or runtime profiling data could surface additional failure modes not captured by single-file edits.
Protocol-aware metrics may become standard for any automated repair task where speed matters as much as correctness.

Load-bearing premise

The 213 tasks drawn from LLM-generated failing workspaces are representative of real-world CUDA debugging needs and the chosen performance preservation metric correctly identifies degeneration without missing other failure modes.

What would settle it

Apply the same seven LLMs to a fresh collection of human-authored failing CUDA programs and measure whether the 40-point swing in success rate between loose and strict performance protocols still appears.

Figures

Figures reproduced from arXiv: 2605.08455 by Caiwen Ding, Haoyang Chen, Mattia Fazzini, Shiyang Li.

**Figure 1.** Figure 1: Repair by degeneration. Here, the racecar denotes an optimized GPU kernel and the bicycle a correct-but-slow fallback; a useful repair preserves optimization structure, but LLMs often simplify candidates into slower correct programs. Existing benchmarks cannot expose repair by degeneration behavior of LLMs. On these benchmarks, evaluation starts from specifications and discards the failing intermediate p… view at source ↗

**Figure 2.** Figure 2: Corpus coverage across workload size (lines of code, log [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Iterative vs. repeated debugging pipelines (axis [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (Left) Per-model degeneration evidence on the corpus’s eventually-passing tasks: median [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Feedback richness A3 (left) and conversation-history depth A4 (right). Per-model pass@k trajectories; star marks each model’s best setting, dashed grey line marks the default. We restrict to tasks each model solves at p = 0 and compare the model’s own iter-N pass to its own iter-1 pass. Such comparison controls for both reference-baseline strength and cross-model generation gap, leaving within-model tempor… view at source ↗

read the original abstract

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CUDABeaver shows performance preservation cuts LLM CUDA fix rates sharply, but the LLM-generated tasks limit how far the results generalize.

read the letter

The main point is that letting models pass tests by degenerating CUDA code inflates success rates, and this benchmark quantifies the drop—up to 40 points when you tighten the performance bar. They built 213 tasks from real failing LLM workspaces, each with build commands, error output, and one editable file, then ran seven frontier models through a new conditional pass@k that tracks model, corpus, and acceptance rules including speed checks. Results are broken down by failure category and trajectory, which makes the sensitivity to tolerance clear and useful for anyone evaluating repair tools in performance-sensitive domains. The work does a solid job naming the gap in prior CUDA LLM evaluations, where test passage alone misses whether the original optimization intent survived. The tasks and metric are presented as new and not just another coding benchmark. The limitation is that all 213 cases come from LLM-generated broken code, so the bug distribution leans toward common model errors like indexing or synchronization slips rather than the hardware quirks, compiler edge cases, or legacy maintenance issues that dominate actual CUDA work. This narrows the claim that the protocol gives a broadly more faithful view. Task validation details and threshold choices for performance also need close checking to confirm the numbers hold. This is worth attention for researchers working on LLM code repair for GPUs or scientific computing. It deserves peer review because the core protocol idea is worth stress-testing and extending, even if the current task set needs more diversity to strengthen the conclusions.

Referee Report

3 major / 2 minor

Summary. The paper introduces CUDABeaver, a benchmark of 213 tasks constructed from LLM-generated failing CUDA workspaces, each providing a broken candidate, build/test commands, and error evidence. It proposes the protocol-conditional metric pass@k(M,C,A) that conditions success on model M, corpus C, and evaluation axes A (including performance-loss tolerance). Experiments across seven frontier LLMs show that relaxing performance preservation inflates apparent debugging success, with measured pass rates shifting by up to 40 percentage points under stricter tolerance, arguing that standard evaluations overestimate true repair capability by permitting degenerate but test-passing simplifications.

Significance. If the benchmark tasks prove representative, the work provides a concrete demonstration that protocol design materially affects measured LLM debugging performance on CUDA, highlighting a previously under-examined failure mode (performance-degrading repair). The explicit conditioning on performance axes and reporting by failure category and stagnation mode are useful contributions for future empirical studies in LLM code repair.

major comments (3)

[Benchmark Construction] Benchmark construction section: the paper states that the 213 tasks originate exclusively from LLM generation runs but provides no validation that the resulting bug distribution (indexing errors, synchronization issues, etc.) matches the hardware-specific, compiler-interaction, or legacy-code pathologies typical of production CUDA maintenance; without such evidence or a comparison to real-world bug corpora, the claim that protocol-aware evaluation yields a 'more faithful view' of debugging ability rests on an untested representativeness assumption.
[Evaluation Protocol] Evaluation and metric definition: the performance-preservation component of pass@k(M,C,A) is described at a high level but lacks precise operationalization (e.g., exact runtime thresholds relative to the original optimized code, hardware platform used for timing, handling of non-deterministic kernels); this ambiguity directly affects the reported 40pp swings and prevents readers from reproducing or extending the sensitivity analysis.
[Experimental Results] Results and statistical controls: while results are broken down by failure category and trajectory, the manuscript does not report confidence intervals, multiple-run variance, or controls for prompt sensitivity and temperature; given that the central empirical claim concerns large metric shifts, absence of these controls leaves the magnitude and reliability of the 40pp effect only partially supported.

minor comments (2)

[Metric Definition] Notation for pass@k(M,C,A) is introduced without an explicit equation; adding a formal definition would improve clarity.
[Abstract and Introduction] The abstract and introduction use 'real failing workspaces' without immediate qualification that these are LLM-generated; a parenthetical clarification on first use would prevent misreading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark construction section: the paper states that the 213 tasks originate exclusively from LLM generation runs but provides no validation that the resulting bug distribution (indexing errors, synchronization issues, etc.) matches the hardware-specific, compiler-interaction, or legacy-code pathologies typical of production CUDA maintenance; without such evidence or a comparison to real-world bug corpora, the claim that protocol-aware evaluation yields a 'more faithful view' of debugging ability rests on an untested representativeness assumption.

Authors: We agree that the benchmark's construction from LLM-generated failures means its bug distribution has not been explicitly validated against production CUDA corpora. Our design choice targets the specific use case of repairing errors produced by LLMs themselves, which is directly relevant to automated debugging pipelines. We will revise the benchmark construction and limitations sections to explicitly state this scope, add a qualitative comparison of observed bug types (e.g., indexing, synchronization) to those reported in open CUDA repositories where possible, and frame the 'more faithful view' claim as conditional on this LLM-centric setting rather than claiming broad representativeness of all CUDA maintenance. revision: partial
Referee: [Evaluation Protocol] Evaluation and metric definition: the performance-preservation component of pass@k(M,C,A) is described at a high level but lacks precise operationalization (e.g., exact runtime thresholds relative to the original optimized code, hardware platform used for timing, handling of non-deterministic kernels); this ambiguity directly affects the reported 40pp swings and prevents readers from reproducing or extending the sensitivity analysis.

Authors: We acknowledge the need for precise operational details. In the revised manuscript we will expand the metric definition to specify: performance tolerance as no more than 10% increase in kernel runtime relative to the original optimized version; hardware as NVIDIA A100 GPUs with CUDA 12.4; timing via CUDA events averaged over 5 warm-up and 10 measurement runs; and non-determinism handling by fixing seeds for random operations and reporting median runtime across runs. These additions will make the 40pp sensitivity results fully reproducible. revision: yes
Referee: [Experimental Results] Results and statistical controls: while results are broken down by failure category and trajectory, the manuscript does not report confidence intervals, multiple-run variance, or controls for prompt sensitivity and temperature; given that the central empirical claim concerns large metric shifts, absence of these controls leaves the magnitude and reliability of the 40pp effect only partially supported.

Authors: We agree that additional statistical controls would strengthen the central claim. We will add bootstrap-derived 95% confidence intervals to all pass@k(M,C,A) figures, include a sensitivity table showing variance across three temperatures (0.0, 0.2, 0.5) and two prompt phrasings, and report the standard deviation of the observed metric shifts. These controls will be computed from additional evaluation runs performed during revision. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct measurements

full rationale

The paper introduces an empirical benchmark (CUDABEAVER with 213 tasks) and a conditional metric pass@k(M,C,A), then reports measured success rates under varying performance tolerances. All claims rest on experimental outcomes from held-out tasks rather than any derivation, fitted parameter, or self-referential prediction. No equations, uniqueness theorems, or ansatzes appear; the central observation (up to 40pp score shift) is a direct count of pass/fail under different protocol axes. Self-citations, if present, are not load-bearing for the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation is present; the contribution rests on the empirical construction of the benchmark and the definition of the new metric.

pith-pipeline@v0.9.0 · 5566 in / 1102 out tokens · 41271 ms · 2026-05-12T01:40:19.840437+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalise that pass@k itself is protocol-conditional... pass@k(M, C, A)... performance-gate threshold p in pass=correctness∧speedup≥p
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

repair by degeneration... abandoning the original kernel’s optimized structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.