{"paper":{"title":"CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A benchmark for LLM CUDA debugging shows that models often pass tests by degenerating optimized code into slower versions.","cross_cats":["cs.PL","cs.SE"],"primary_cat":"cs.LG","authors_text":"Caiwen Ding, Haoyang Chen, Mattia Fazzini, Shiyang Li","submitted_at":"2026-05-08T20:24:32Z","abstract_excerpt":"Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons th"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The 213 tasks drawn from LLM-generated failing workspaces are representative of real-world CUDA debugging needs and that the chosen performance preservation metric correctly identifies degeneration without missing other failure modes.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A benchmark for LLM CUDA debugging shows that models often pass tests by degenerating optimized code into slower versions.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"dc9181254e6d1d37b553e41c4ebd5d5d0ad73e9a59d35c88071845f847d2677a"},"source":{"id":"2605.08455","kind":"arxiv","version":2},"verdict":{"id":"9c743154-5756-4ac5-a50f-18faddb2b4bd","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-12T01:40:19.840437Z","strongest_claim":"protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.","one_line_summary":"CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The 213 tasks drawn from LLM-generated failing workspaces are representative of real-world CUDA debugging needs and that the chosen performance preservation metric correctly identifies degeneration without missing other failure modes.","pith_extraction_headline":"A benchmark for LLM CUDA debugging shows that models often pass tests by degenerating optimized code into slower versions."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.08455/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"claim_evidence","ran_at":"2026-05-20T09:22:02.079069Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-20T04:35:35.287816Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T15:01:17.824597Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T11:07:38.508974Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"29035b091c4154b70ee5b2671f5f5d6f514b608b24c738d961e649977cdccc04"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"984c5fd2f49e7d854cb4344eb0453ca38580f17135c704629650718248cdd05b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}