CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
Pith reviewed 2026-05-12 01:40 UTC · model grok-4.3
The pith
A benchmark for LLM CUDA debugging shows that models often pass tests by degenerating optimized code into slower versions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CUDABEAVER supplies tasks drawn from actual LLM-generated failing CUDA workspaces together with native build and test commands and error evidence. It evaluates fixers not only on whether they restore correctness but also on whether they preserve performance, reporting results by failure category, debugging trajectory, and stagnation mode. The protocol-conditional metric pass@k(M,C,A) applied to seven LLMs demonstrates that high tolerance for performance loss makes fixers appear stronger, whereas stricter performance requirements sharply reduce success scores by up to 40 percentage points.
What carries the argument
The CUDABEAVER benchmark of 213 tasks from real failing workspaces and the pass@k(M,C,A) metric that conditions evaluation on model, corpus, and explicit protocol axes including performance preservation.
If this is right
- Evaluations of LLM code repair in performance-critical domains must incorporate explicit performance checks or risk overestimating capability.
- Small adjustments to performance-loss tolerance produce large shifts in reported debugging success.
- Detailed breakdowns by failure category and stagnation mode expose specific LLM weaknesses in handling CUDA memory and execution subtleties.
- True repair requires restoring both correctness and the original optimization structure rather than substituting a slower safe variant.
Where Pith is reading between the lines
- Similar degeneration risks likely exist in benchmarks for other performance-sensitive languages such as OpenCL or HIP.
- Extending the benchmark to multi-file projects or runtime profiling data could surface additional failure modes not captured by single-file edits.
- Protocol-aware metrics may become standard for any automated repair task where speed matters as much as correctness.
Load-bearing premise
The 213 tasks drawn from LLM-generated failing workspaces are representative of real-world CUDA debugging needs and the chosen performance preservation metric correctly identifies degeneration without missing other failure modes.
What would settle it
Apply the same seven LLMs to a fresh collection of human-authored failing CUDA programs and measure whether the 40-point swing in success rate between loose and strict performance protocols still appears.
Figures
read the original abstract
Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CUDABeaver, a benchmark of 213 tasks constructed from LLM-generated failing CUDA workspaces, each providing a broken candidate, build/test commands, and error evidence. It proposes the protocol-conditional metric pass@k(M,C,A) that conditions success on model M, corpus C, and evaluation axes A (including performance-loss tolerance). Experiments across seven frontier LLMs show that relaxing performance preservation inflates apparent debugging success, with measured pass rates shifting by up to 40 percentage points under stricter tolerance, arguing that standard evaluations overestimate true repair capability by permitting degenerate but test-passing simplifications.
Significance. If the benchmark tasks prove representative, the work provides a concrete demonstration that protocol design materially affects measured LLM debugging performance on CUDA, highlighting a previously under-examined failure mode (performance-degrading repair). The explicit conditioning on performance axes and reporting by failure category and stagnation mode are useful contributions for future empirical studies in LLM code repair.
major comments (3)
- [Benchmark Construction] Benchmark construction section: the paper states that the 213 tasks originate exclusively from LLM generation runs but provides no validation that the resulting bug distribution (indexing errors, synchronization issues, etc.) matches the hardware-specific, compiler-interaction, or legacy-code pathologies typical of production CUDA maintenance; without such evidence or a comparison to real-world bug corpora, the claim that protocol-aware evaluation yields a 'more faithful view' of debugging ability rests on an untested representativeness assumption.
- [Evaluation Protocol] Evaluation and metric definition: the performance-preservation component of pass@k(M,C,A) is described at a high level but lacks precise operationalization (e.g., exact runtime thresholds relative to the original optimized code, hardware platform used for timing, handling of non-deterministic kernels); this ambiguity directly affects the reported 40pp swings and prevents readers from reproducing or extending the sensitivity analysis.
- [Experimental Results] Results and statistical controls: while results are broken down by failure category and trajectory, the manuscript does not report confidence intervals, multiple-run variance, or controls for prompt sensitivity and temperature; given that the central empirical claim concerns large metric shifts, absence of these controls leaves the magnitude and reliability of the 40pp effect only partially supported.
minor comments (2)
- [Metric Definition] Notation for pass@k(M,C,A) is introduced without an explicit equation; adding a formal definition would improve clarity.
- [Abstract and Introduction] The abstract and introduction use 'real failing workspaces' without immediate qualification that these are LLM-generated; a parenthetical clarification on first use would prevent misreading.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark construction section: the paper states that the 213 tasks originate exclusively from LLM generation runs but provides no validation that the resulting bug distribution (indexing errors, synchronization issues, etc.) matches the hardware-specific, compiler-interaction, or legacy-code pathologies typical of production CUDA maintenance; without such evidence or a comparison to real-world bug corpora, the claim that protocol-aware evaluation yields a 'more faithful view' of debugging ability rests on an untested representativeness assumption.
Authors: We agree that the benchmark's construction from LLM-generated failures means its bug distribution has not been explicitly validated against production CUDA corpora. Our design choice targets the specific use case of repairing errors produced by LLMs themselves, which is directly relevant to automated debugging pipelines. We will revise the benchmark construction and limitations sections to explicitly state this scope, add a qualitative comparison of observed bug types (e.g., indexing, synchronization) to those reported in open CUDA repositories where possible, and frame the 'more faithful view' claim as conditional on this LLM-centric setting rather than claiming broad representativeness of all CUDA maintenance. revision: partial
-
Referee: [Evaluation Protocol] Evaluation and metric definition: the performance-preservation component of pass@k(M,C,A) is described at a high level but lacks precise operationalization (e.g., exact runtime thresholds relative to the original optimized code, hardware platform used for timing, handling of non-deterministic kernels); this ambiguity directly affects the reported 40pp swings and prevents readers from reproducing or extending the sensitivity analysis.
Authors: We acknowledge the need for precise operational details. In the revised manuscript we will expand the metric definition to specify: performance tolerance as no more than 10% increase in kernel runtime relative to the original optimized version; hardware as NVIDIA A100 GPUs with CUDA 12.4; timing via CUDA events averaged over 5 warm-up and 10 measurement runs; and non-determinism handling by fixing seeds for random operations and reporting median runtime across runs. These additions will make the 40pp sensitivity results fully reproducible. revision: yes
-
Referee: [Experimental Results] Results and statistical controls: while results are broken down by failure category and trajectory, the manuscript does not report confidence intervals, multiple-run variance, or controls for prompt sensitivity and temperature; given that the central empirical claim concerns large metric shifts, absence of these controls leaves the magnitude and reliability of the 40pp effect only partially supported.
Authors: We agree that additional statistical controls would strengthen the central claim. We will add bootstrap-derived 95% confidence intervals to all pass@k(M,C,A) figures, include a sensitivity table showing variance across three temperatures (0.0, 0.2, 0.5) and two prompt phrasings, and report the standard deviation of the observed metric shifts. These controls will be computed from additional evaluation runs performed during revision. revision: yes
Circularity Check
No circularity: pure empirical benchmark with direct measurements
full rationale
The paper introduces an empirical benchmark (CUDABEAVER with 213 tasks) and a conditional metric pass@k(M,C,A), then reports measured success rates under varying performance tolerances. All claims rest on experimental outcomes from held-out tasks rather than any derivation, fitted parameter, or self-referential prediction. No equations, uniqueness theorems, or ansatzes appear; the central observation (up to 40pp score shift) is a direct count of pass/fail under different protocol axes. Self-citations, if present, are not load-bearing for the reported results.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalise that pass@k itself is protocol-conditional... pass@k(M, C, A)... performance-gate threshold p in pass=correctness∧speedup≥p
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
repair by degeneration... abandoning the original kernel’s optimized structure
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.