pith. sign in

arxiv: 2605.16819 · v1 · pith:BOO7GNKCnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI· cs.LG

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

Pith reviewed 2026-05-19 21:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords GPU kernel optimizationAI coding agentsbenchmarkgeneralizationHIPTritonPyTorchspeedup
0
0 comments X

The pith

A new benchmark reveals AI agents deliver up to 6.89x speedups on GPU kernels but show major generalization failures when translating from PyTorch to HIP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentKernelArena to evaluate complete AI agent workflows on GPU kernel optimization rather than single code generations. It supplies 196 tasks covering HIP-to-HIP, Triton-to-Triton, and PyTorch-to-HIP work, plus an unseen-configuration protocol that checks whether performance gains survive new input shapes. A sympathetic reader would care because kernel efficiency directly affects deep-learning runtimes and production agents are already being deployed without standardized tests for their full iterative process. The evaluation finds near-perfect compilation, high correctness, and large average speedups, yet notes that PyTorch-to-HIP kernels often embed shape-specific assumptions that break on unseen inputs.

Core claim

AgentKernelArena is an open benchmark containing 196 tasks that measures full agent workflows through isolated workspaces, gated compilation and correctness checks, performance measurement, and an unseen-configuration protocol. Across agents such as Cursor Agent, Claude Code, and Codex Agent the strongest setups reach mean speedups of 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks, with HIP-to-HIP and Triton-to-Triton optimizations transferring well to unseen shapes while PyTorch-to-HIP shows substantial correctness drops.

What carries the argument

The unseen-configuration generalization protocol that tests whether agent-generated optimizations continue to work on input shapes the agent never encountered during the task.

Load-bearing premise

The 196 tasks and the specific unseen-configuration protocol are representative enough of real production kernel optimization work that measured agent performance and generalization behavior will predict usefulness outside the benchmark.

What would settle it

Running the same agents on a fresh collection of kernel tasks drawn from production codebases and finding markedly lower speedups or higher generalization failure rates would show that the benchmark results do not transfer.

Figures

Figures reproduced from arXiv: 2605.16819 by Dong Li, Emad Barsoum, Hao Li, Ji Liu, Mehdi Rezagholizadeh, Sharareh Younesian, Sharon Zhou, Sina Rafati, Vikram Appia, Wenwen Ouyang, Yuchen Yang, Yue Liu, Zhenyu Gu, Ziqiong Liu.

Figure 1
Figure 1. Figure 1: AgentKernelArena evaluation pipeline. Top: task source files, optional cheatsheets, and agent configuration are inputs. Middle: the workspace is set up, the original kernel is baselined, and the agent iteratively optimizes the kernel – prompted to produce up to max iterations successive versions (default 3). Bottom: after the agent session ends, a centralized evaluator independently runs gated compilation,… view at source ↗
Figure 2
Figure 2. Figure 2: Per-test-case execution time comparison for the fused moe gptq awq kernel (Triton-to-Triton, Claude Code / Opus 4.6). Each bar pair shows baseline vs. optimized execution time for a different parameter configuration (M=tokens, E=experts, K/N=matrix dimensions, grp=quantization group size). The agent achieves 1.55–2.40× speedup, with larger gains at higher expert counts and matrix dimensions. 0 25 50 75 100… view at source ↗
Figure 3
Figure 3. Figure 3: Unseen-configuration generalization: quadrant breakdown. Each horizontal bar shows the fraction of tasks in each correctness quadrant (both pass, opt improvement, both fail, opt regression). Conditional correctness (%) is annotated on the right. To prevent contamination of future evaluations, we do not release the unseen configurations; we do release the generation script so that the protocol is fully repr… view at source ↗
Figure 4
Figure 4. Figure 4: Unseen vs. original-run mean speedup, per agent/model and per task category. Marker color encodes the model and marker shape encodes the agent platform. The dashed line is y = x (perfect transfer); points in the green region above the diagonal generalize better on unseen configurations than on original ones, while points in the red region below the diagonal lose speedup on unseen inputs. The vertical dista… view at source ↗
Figure 5
Figure 5. Figure 5: Task directory layout and config.yaml for the fused moe kernel Triton-to-Triton task. The agent receives the source file to optimize, while evaluation scripts run independently after the agent session ends. and the calendar window in which each configuration was executed ( [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Complete prompt assembled for the fused moe kernel Triton-to-Triton task on MI300X. The eight sections are: (1) task-type role, (2) source files and target functions, (3) GPU architecture pre-check, (4) task-specific optimization instructions, (5) completion directive, (6) hardware and language cheatsheets, (7) workspace path, and (8) an iteration directive appended by the agent launcher when max iteration… view at source ↗
read the original abstract

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AgentKernelArena, an open-source benchmark with 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation. It evaluates full AI coding agent workflows (including Cursor Agent, Claude Code, and Codex Agent) in isolated workspaces using gated compilation, correctness, and performance checks, plus a centralized scoring system and an unseen-configuration protocol that tests transfer to input shapes never observed during optimization. Reported results include near-perfect compilation rates, high correctness on most categories, mean speedups reaching 6.89x (PyTorch-to-HIP), 6.69x (HIP-to-HIP), and 2.13x (Triton-to-Triton), with HIP-to-HIP and Triton-to-Triton optimizations largely transferring to unseen shapes while PyTorch-to-HIP shows substantial correctness drops.

Significance. If the 196 tasks and unseen-configuration protocol prove representative of production kernel workloads, the benchmark supplies a modular, extensible framework that fills a gap between single-LLM-call kernel benchmarks and full agentic workflows. It supplies concrete, falsifiable measurements of compilation success, speedup, and generalization that can guide iterative improvement of agents for low-level GPU code generation.

major comments (2)
  1. [Benchmark Construction] The task construction section provides no quantitative statistics on the 196 kernels (dimensionality, memory access patterns, fusion complexity, or baseline optimization headroom) and no explicit mapping or comparison to workloads drawn from PyTorch, JAX, or vendor libraries. This directly affects the load-bearing claim that measured agent performance and differential generalization will predict usefulness outside the benchmark.
  2. [Experimental Evaluation] The experimental protocol reports aggregate speedups and correctness but supplies no details on task selection criteria, number of independent runs per task, or statistical significance testing. Without these, it is impossible to determine whether post-hoc filtering or narrow task coverage influences the headline numbers (e.g., the 6.89x PyTorch-to-HIP figure).
minor comments (2)
  1. [Abstract] Clarify in the abstract and §4 whether the reported speedups are geometric means, arithmetic means, or medians, and over which exact subset of tasks.
  2. [Unseen-Configuration Protocol] The unseen-configuration protocol is described at a high level; a short pseudocode or diagram in §3.3 would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating the revisions we will incorporate to improve the paper's rigor and transparency.

read point-by-point responses
  1. Referee: [Benchmark Construction] The task construction section provides no quantitative statistics on the 196 kernels (dimensionality, memory access patterns, fusion complexity, or baseline optimization headroom) and no explicit mapping or comparison to workloads drawn from PyTorch, JAX, or vendor libraries. This directly affects the load-bearing claim that measured agent performance and differential generalization will predict usefulness outside the benchmark.

    Authors: We agree that additional quantitative details on the kernel tasks and explicit linkages to production workloads would better support claims of representativeness and predictive value. In the revised manuscript we will add a new subsection to the benchmark construction section that reports aggregate statistics on kernel dimensionality, memory access patterns, fusion complexity, and baseline optimization headroom across the 196 tasks. We will also include a mapping table that aligns a representative subset of tasks with equivalent operations from PyTorch, JAX, and vendor libraries (e.g., cuBLAS, ROCm). These additions will directly address the concern about external validity. revision: yes

  2. Referee: [Experimental Evaluation] The experimental protocol reports aggregate speedups and correctness but supplies no details on task selection criteria, number of independent runs per task, or statistical significance testing. Without these, it is impossible to determine whether post-hoc filtering or narrow task coverage influences the headline numbers (e.g., the 6.89x PyTorch-to-HIP figure).

    Authors: We acknowledge that the current description of the experimental protocol is insufficient for full reproducibility and for ruling out selection effects. We will expand the experimental evaluation section to explicitly state the task selection criteria used to assemble the 196 tasks, report the number of independent runs performed per task (noting that each agent-task combination was executed once owing to the substantial compute cost of full agent workflows), and add statistical significance measures such as 95% confidence intervals around the reported mean speedups. These clarifications will allow readers to assess whether the headline figures are robust. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark results are independent measurements on externally defined tasks.

full rationale

The paper defines a new benchmark consisting of 196 tasks in three categories (HIP-to-HIP, Triton-to-Triton, PyTorch-to-HIP) together with an unseen-configuration protocol, then runs external commercial agents on those tasks and reports measured compilation rates, correctness, and speedups. No equations, fitted parameters, self-citations, or ansatzes are present that would reduce the reported speedups or generalization observations to quantities constructed inside the paper. The central claims are direct empirical outcomes of executing the agents in the defined workspaces; the benchmark itself is the input, and the performance numbers are the output with no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical benchmark paper; it introduces no new mathematical axioms, fitted constants, or postulated physical entities. Evaluation relies on standard compiler and profiler outputs plus the assumption that the chosen tasks exercise representative optimization patterns.

axioms (1)
  • domain assumption Compilation success, numerical correctness, and wall-clock timing on the provided test harness are reliable proxies for kernel quality.
    Invoked when the benchmark declares a kernel successful or reports speedup; appears in the description of gated checks.

pith-pipeline@v0.9.0 · 5871 in / 1438 out tokens · 75279 ms · 2026-05-19T21:25:38.745504+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 4 internal anchors

  1. [1]

    Claude code, 2026

    Anthropic. Claude code, 2026. URL https://www.anthropic.com/claude-code. Software product

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

  4. [4]

    Cursor agent, 2026

    Cursor. Cursor agent, 2026. URLhttps://cursor.com/agents. Software product

  5. [5]

    AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

    Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, and Sean Welleck. AdaExplore: Failure-driven adaptation and diversity-preserving search for efficient kernel generation.arXiv preprint arXiv:2604.16625, 2026

  6. [6]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024

  7. [7]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. vLLM: Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023

  8. [8]

    Towards robust agentic CUDA kernel benchmarking, verification, and optimization.arXiv preprint arXiv:2509.14279, 2025

    Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, and David Ha. Towards robust agentic CUDA kernel benchmarking, verification, and optimization.arXiv preprint arXiv:2509.14279, 2025

  9. [9]

    TritonForge: Profiling-guided framework for automated Triton kernel optimization.arXiv preprint arXiv:2512.09196, 2025

    Haonan Li, Keyu Man, Partha Kanuparthy, Hanning Chen, Wei Sun, Sreen Tallam, Chenguang Zhu, Kevin Zhu, and Zhiyun Qian. TritonForge: Profiling-guided framework for automated Triton kernel optimization.arXiv preprint arXiv:2512.09196, 2025

  10. [10]

    Tritonbench: Benchmarking large language model capabilities for generating triton operators

    Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, et al. Tritonbench: Benchmarking large language model capabilities for generating triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 23053–23066, 2025

  11. [11]

    AutoTriton: Automatic Triton programming with reinforcement learning in LLMs.arXiv preprint arXiv:2507.05687, 2025

    Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, and Maosong Sun. AutoTriton: Automatic Triton programming with reinforcement learning in LLMs.arXiv preprint arXiv:2507.05687, 2025

  12. [12]

    AgentBench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representatio...

  13. [13]

    OpenAI Codex, 2026

    OpenAI. OpenAI Codex, 2026. URLhttps://openai.com/codex. Software product

  14. [14]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R´e, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels? InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. URLhttps://arxiv.org/abs/2502.10517

  15. [15]

    Geak: Introducing triton kernel AI agent & evaluation benchmarks

    Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum. Geak: Introducing triton kernel AI agent & evaluation benchmarks. arXiv preprint arXiv:2507.23194, 2025

  16. [16]

    MultiKernelBench: A multi-platform benchmark for kernel generation.arXiv preprint arXiv:2507.17773, 2025

    Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. MultiKernelBench: A multi-platform benchmark for kernel generation.arXiv preprint arXiv:2507.17773, 2025

  17. [17]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  18. [18]

    Kernelbot: A competition platform for writing heterogeneous GPU code

    Alex L Zhang, Matej Sirovatka, Erik Schultheis, Benjamin Horowitz, and Mark Saroufim. Kernelbot: A competition platform for writing heterogeneous GPU code. InChampioning Open-source DEvelopment in ML Workshop @ ICML25, 2025. URLhttps://openreview.net/forum?id=bq9U4dmuyJ

  19. [19]

    shared memory exceeds LDS limit

    Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, et al. Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high-performance gpu kernel generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pp. 29168–29176, 2026. 11 AgentKernelArena: Generaliza...