AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
Pith reviewed 2026-05-19 21:25 UTC · model grok-4.3
The pith
A new benchmark reveals AI agents deliver up to 6.89x speedups on GPU kernels but show major generalization failures when translating from PyTorch to HIP.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentKernelArena is an open benchmark containing 196 tasks that measures full agent workflows through isolated workspaces, gated compilation and correctness checks, performance measurement, and an unseen-configuration protocol. Across agents such as Cursor Agent, Claude Code, and Codex Agent the strongest setups reach mean speedups of 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks, with HIP-to-HIP and Triton-to-Triton optimizations transferring well to unseen shapes while PyTorch-to-HIP shows substantial correctness drops.
What carries the argument
The unseen-configuration generalization protocol that tests whether agent-generated optimizations continue to work on input shapes the agent never encountered during the task.
Load-bearing premise
The 196 tasks and the specific unseen-configuration protocol are representative enough of real production kernel optimization work that measured agent performance and generalization behavior will predict usefulness outside the benchmark.
What would settle it
Running the same agents on a fresh collection of kernel tasks drawn from production codebases and finding markedly lower speedups or higher generalization failure rates would show that the benchmark results do not transfer.
Figures
read the original abstract
GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AgentKernelArena, an open-source benchmark with 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation. It evaluates full AI coding agent workflows (including Cursor Agent, Claude Code, and Codex Agent) in isolated workspaces using gated compilation, correctness, and performance checks, plus a centralized scoring system and an unseen-configuration protocol that tests transfer to input shapes never observed during optimization. Reported results include near-perfect compilation rates, high correctness on most categories, mean speedups reaching 6.89x (PyTorch-to-HIP), 6.69x (HIP-to-HIP), and 2.13x (Triton-to-Triton), with HIP-to-HIP and Triton-to-Triton optimizations largely transferring to unseen shapes while PyTorch-to-HIP shows substantial correctness drops.
Significance. If the 196 tasks and unseen-configuration protocol prove representative of production kernel workloads, the benchmark supplies a modular, extensible framework that fills a gap between single-LLM-call kernel benchmarks and full agentic workflows. It supplies concrete, falsifiable measurements of compilation success, speedup, and generalization that can guide iterative improvement of agents for low-level GPU code generation.
major comments (2)
- [Benchmark Construction] The task construction section provides no quantitative statistics on the 196 kernels (dimensionality, memory access patterns, fusion complexity, or baseline optimization headroom) and no explicit mapping or comparison to workloads drawn from PyTorch, JAX, or vendor libraries. This directly affects the load-bearing claim that measured agent performance and differential generalization will predict usefulness outside the benchmark.
- [Experimental Evaluation] The experimental protocol reports aggregate speedups and correctness but supplies no details on task selection criteria, number of independent runs per task, or statistical significance testing. Without these, it is impossible to determine whether post-hoc filtering or narrow task coverage influences the headline numbers (e.g., the 6.89x PyTorch-to-HIP figure).
minor comments (2)
- [Abstract] Clarify in the abstract and §4 whether the reported speedups are geometric means, arithmetic means, or medians, and over which exact subset of tasks.
- [Unseen-Configuration Protocol] The unseen-configuration protocol is described at a high level; a short pseudocode or diagram in §3.3 would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating the revisions we will incorporate to improve the paper's rigor and transparency.
read point-by-point responses
-
Referee: [Benchmark Construction] The task construction section provides no quantitative statistics on the 196 kernels (dimensionality, memory access patterns, fusion complexity, or baseline optimization headroom) and no explicit mapping or comparison to workloads drawn from PyTorch, JAX, or vendor libraries. This directly affects the load-bearing claim that measured agent performance and differential generalization will predict usefulness outside the benchmark.
Authors: We agree that additional quantitative details on the kernel tasks and explicit linkages to production workloads would better support claims of representativeness and predictive value. In the revised manuscript we will add a new subsection to the benchmark construction section that reports aggregate statistics on kernel dimensionality, memory access patterns, fusion complexity, and baseline optimization headroom across the 196 tasks. We will also include a mapping table that aligns a representative subset of tasks with equivalent operations from PyTorch, JAX, and vendor libraries (e.g., cuBLAS, ROCm). These additions will directly address the concern about external validity. revision: yes
-
Referee: [Experimental Evaluation] The experimental protocol reports aggregate speedups and correctness but supplies no details on task selection criteria, number of independent runs per task, or statistical significance testing. Without these, it is impossible to determine whether post-hoc filtering or narrow task coverage influences the headline numbers (e.g., the 6.89x PyTorch-to-HIP figure).
Authors: We acknowledge that the current description of the experimental protocol is insufficient for full reproducibility and for ruling out selection effects. We will expand the experimental evaluation section to explicitly state the task selection criteria used to assemble the 196 tasks, report the number of independent runs performed per task (noting that each agent-task combination was executed once owing to the substantial compute cost of full agent workflows), and add statistical significance measures such as 95% confidence intervals around the reported mean speedups. These clarifications will allow readers to assess whether the headline figures are robust. revision: partial
Circularity Check
No circularity; empirical benchmark results are independent measurements on externally defined tasks.
full rationale
The paper defines a new benchmark consisting of 196 tasks in three categories (HIP-to-HIP, Triton-to-Triton, PyTorch-to-HIP) together with an unseen-configuration protocol, then runs external commercial agents on those tasks and reports measured compilation rates, correctness, and speedups. No equations, fitted parameters, self-citations, or ansatzes are present that would reduce the reported speedups or generalization observations to quantities constructed inside the paper. The central claims are direct empirical outcomes of executing the agents in the defined workspaces; the benchmark itself is the input, and the performance numbers are the output with no reduction by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Compilation success, numerical correctness, and wall-clock timing on the provided test harness are reliable proxies for kernel quality.
Reference graph
Works this paper leans on
-
[1]
Anthropic. Claude code, 2026. URL https://www.anthropic.com/claude-code. Software product
work page 2026
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Cursor. Cursor agent, 2026. URLhttps://cursor.com/agents. Software product
work page 2026
-
[5]
Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, and Sean Welleck. AdaExplore: Failure-driven adaptation and diversity-preserving search for efficient kernel generation.arXiv preprint arXiv:2604.16625, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[7]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. vLLM: Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023
work page 2023
-
[8]
Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, and David Ha. Towards robust agentic CUDA kernel benchmarking, verification, and optimization.arXiv preprint arXiv:2509.14279, 2025
-
[9]
Haonan Li, Keyu Man, Partha Kanuparthy, Hanning Chen, Wei Sun, Sreen Tallam, Chenguang Zhu, Kevin Zhu, and Zhiyun Qian. TritonForge: Profiling-guided framework for automated Triton kernel optimization.arXiv preprint arXiv:2512.09196, 2025
-
[10]
Tritonbench: Benchmarking large language model capabilities for generating triton operators
Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, et al. Tritonbench: Benchmarking large language model capabilities for generating triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 23053–23066, 2025
work page 2025
-
[11]
Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, and Maosong Sun. AutoTriton: Automatic Triton programming with reinforcement learning in LLMs.arXiv preprint arXiv:2507.05687, 2025
-
[12]
AgentBench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representatio...
work page 2024
-
[13]
OpenAI. OpenAI Codex, 2026. URLhttps://openai.com/codex. Software product
work page 2026
-
[14]
KernelBench: Can LLMs Write Efficient GPU Kernels?
Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R´e, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels? InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. URLhttps://arxiv.org/abs/2502.10517
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Geak: Introducing triton kernel AI agent & evaluation benchmarks
Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum. Geak: Introducing triton kernel AI agent & evaluation benchmarks. arXiv preprint arXiv:2507.23194, 2025
-
[16]
Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. MultiKernelBench: A multi-platform benchmark for kernel generation.arXiv preprint arXiv:2507.17773, 2025
-
[17]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
work page 2024
-
[18]
Kernelbot: A competition platform for writing heterogeneous GPU code
Alex L Zhang, Matej Sirovatka, Erik Schultheis, Benjamin Horowitz, and Mark Saroufim. Kernelbot: A competition platform for writing heterogeneous GPU code. InChampioning Open-source DEvelopment in ML Workshop @ ICML25, 2025. URLhttps://openreview.net/forum?id=bq9U4dmuyJ
work page 2025
-
[19]
shared memory exceeds LDS limit
Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, et al. Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high-performance gpu kernel generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pp. 29168–29176, 2026. 11 AgentKernelArena: Generaliza...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.