Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

Bernie Wang; George Karypis; Haoyang Fang; Huzefa Rangwala; Jiading Gai; Jin Huang; Kaj Bostrom; Shuai Zhang; Vihang Patil

arxiv: 2606.26453 · v1 · pith:4HYE3WKWnew · submitted 2026-06-24 · 💻 cs.LG

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

Jiading Gai , Shuai Zhang , Kaj Bostrom , Jin Huang , Vihang Patil , Haoyang Fang , Bernie Wang , Huzefa Rangwala

show 1 more author

George Karypis

This is my paper

Pith reviewed 2026-06-26 01:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords CUDA kernel optimizationLLM agentsmicro-profiling toolsGPU performanceMCTS searchKernelBenchenergy efficiencyCuTe code generation

0 comments

The pith

Micro-profiling tools convert hardware metrics into natural-language guidance that lets LLMs optimize CUDA kernels to expert levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

KernelPro is a closed-loop multi-agent system that pairs LLM code generation with pluggable micro-profiling tools to iteratively optimize GPU kernels. These tools encode expert heuristics by turning raw profiler outputs from ncu, SASS, and nsys into actionable feedback after roofline-based filtering. The system also uses a domain-adapted MCTS search and direct CuTe generation. On KernelBench it delivers geometric mean speedups of 2.42x, 4.69x, and 5.30x across difficulty levels while producing a from-scratch Hopper WGMMA kernel that beats hand-tuned Triton by 1.23x and reduces energy by 11.6 percent at matched speed. A reader cares because the results show structured tool feedback can substitute for human expertise in a domain where small code changes produce large performance differences.

Core claim

KernelPro shows that a semantic feedback operator, which maps raw hardware metrics to natural-language guidance via pluggable micro-profiling tools, combined with roofline-filtered two-stage tool invocation and a domain-adapted MCTS that includes progressive widening, dead-end pruning, and search memory, enables an LLM agent to generate optimized CUDA and CuTe kernels. These kernels achieve state-of-the-art geometric mean speedups on all three levels of KernelBench and surpass expert-optimized Triton kernels on VeOmni MoE workloads while also improving energy efficiency, with ablations confirming that each added component contributes measurable gains.

What carries the argument

The semantic feedback operator, which encodes expert heuristics as pluggable micro-profiling tools that transform raw hardware metrics into actionable natural language guidance for the LLM.

If this is right

Micro-profiling tools produce statistically significant gains over raw metrics (p < 0.0001).
Domain-adapted MCTS yields 26 percent higher geometric mean performance than greedy search (p = 0.004).
Proactive tool orchestration adds 23 percent improvement (p = 0.035).
The same system can generate raw-CUDA plus CuTe kernels that exceed hand-tuned Triton on production MoE training workloads.
Energy efficiency can be optimized jointly with speed, producing an 11.6 percent measured reduction at matched performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tool-encoding pattern could be applied to other code-generation domains such as CPU vectorization or FPGA design if equivalent profiling interfaces exist.
Search memory and cross-iteration learning may reduce the number of LLM calls needed on repeated optimization tasks for similar kernel families.
Direct source-level generation over an existing library like CUTLASS/CuTe may generalize to other domain-specific codebases that expose composable primitives.
Adding explicit energy or power metrics into the MCTS reward function could further tilt the search toward power-efficient solutions without separate post-processing.

Load-bearing premise

The micro-profiling tools and roofline classifier must correctly encode expert heuristics so the natural-language feedback supplied to the LLM is not systematically misleading or hardware-specific.

What would settle it

Running KernelPro on KernelBench with the micro-profiling tools disabled and measuring whether geometric mean speedups fall to the level of prior LLM-only baselines or raw-metric prompting.

Figures

Figures reproduced from arXiv: 2606.26453 by Bernie Wang, George Karypis, Haoyang Fang, Huzefa Rangwala, Jiading Gai, Jin Huang, Kaj Bostrom, Shuai Zhang, Vihang Patil.

**Figure 1.** Figure 1: KernelPro agentic optimization workflow. Stage 1 (Benchmarking Agent) performs one-time roofline-based bottleneck classification. The iterative loop comprises search, programming, compilation, correctness validation, profiling, and semantic feedback. The Search Orchestrator (MCTS or greedy) maintains an expansion tree over candidate solutions, while the Stage 2 Profiling Agent invokes bottleneck-filtered m… view at source ↗

**Figure 2.** Figure 2: Multi-language input pipeline: all source languages are normalized to a unified format before being routed through languagespecific prompts to produce optimized CUDA kernel. While prior CUDA agents accept only PyTorch input (via KernelBench) or only existing CUDA code, KernelPro natively supports PyTorch, Triton, and CUDA inputs— enabling optimization of kernels at any stage of the development pipeline (… view at source ↗

read the original abstract

We present KernelPro, a closed-loop multi-agent system that automatically generates, profiles, and iteratively optimizes GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and pluggable bottleneck detection tools. KernelPro introduces four contributions: (1) a semantic feedback operator that encodes expert heuristics as pluggable micro-profiling tools, transforming raw hardware metrics into actionable natural language guidance; (2) a two-stage tool invocation architecture where roofline-based bottleneck classification filters which specialized analysis tools execute, combining kernel-level (ncu), instruction-level (SASS), and system-level (nsys) profiling; (3) a domain-adapted MCTS with progressive widening, asymmetric branching, log-reward calibration, dead-end pruning, and search memory for cross-iteration learning; and (4) direct CuTe source-level code generation via autonomous code search over the CUTLASS/CuTe codebase. On KernelBench, KernelPro achieves geometric mean speedups of 2.42x/4.69x/5.30x on Levels 1/2/3, establishing state-of-the-art performance across all difficulty levels. On VeOmni's expert-optimized MoE training kernels, KernelPro achieves 1.23x over hand-tuned Triton by generating a from-scratch raw-CUDA+CuTe Hopper WGMMA kernel. Ablation studies demonstrate that each design component independently and significantly improves optimization quality: micro-profiling tools (p < 0.0001 vs raw metrics), MCTS search (26% higher geometric mean vs greedy, p = 0.004), and proactive tool orchestration (23% improvement, p = 0.035). Finally, KernelPro is the first CUDA kernel coding agent to optimize energy efficiency beyond the speed-only focus of prior systems, demonstrating an 11.6% measured energy reduction at matched speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KernelPro wires LLMs to micro-profilers and a tuned MCTS to chase CUDA speedups, with the main open question being whether the tool feedback actually tracks expert judgment.

read the letter

KernelPro's main advance is the closed loop that turns raw profiler output into natural-language guidance for an LLM agent, using a roofline classifier to pick which micro-profilers to run and then feeding the results back for iterative edits. The adapted MCTS with progressive widening and cross-iteration memory is the other concrete piece that looks new relative to earlier LLM kernel agents.

The paper shows the components add up: micro-profiling beats raw metrics, the search beats greedy, and the full system reaches the reported geometric-mean speedups on KernelBench levels plus the 1.23x win on the VeOmni MoE kernel. The energy measurement is a small but real addition. Ablations with p-values are helpful and give some confidence that the gains are not noise.

The soft spot is the untested mapping from metrics to text. The premise that these tools act as expert surrogates requires that the classifier and the natural-language translators correctly surface the bottlenecks a human would care about on Hopper WGMMA kernels. The speedups alone do not prove that; a mismatch would let the LLM chase the wrong things without the numbers flagging it. The abstract also leaves the KernelBench construction and baseline details thin, which makes the SOTA claim harder to weigh.

This is for readers who work on LLM agents for systems code or on automated performance engineering. It has enough new architecture and quantitative results to merit a serious referee, even if the feedback validation needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript presents KernelPro, a closed-loop multi-agent LLM system for GPU kernel optimization that integrates semantic feedback from pluggable micro-profiling tools (roofline classification followed by ncu/SASS/nsys analysis), a domain-adapted MCTS with progressive widening and dead-end pruning, and direct CuTe source generation. It claims geometric-mean speedups of 2.42×/4.69×/5.30× on KernelBench Levels 1/2/3 (SOTA across difficulty levels), 1.23× over hand-tuned Triton on VeOmni MoE kernels via a from-scratch Hopper WGMMA kernel, an 11.6% energy reduction at matched speed, and statistically significant gains from each component (micro-profiling p<0.0001, MCTS p=0.004, orchestration p=0.035).

Significance. If the central performance claims hold after addressing experimental transparency, the work is significant for demonstrating a practical, reproducible path to LLM-driven kernel optimization that exceeds prior systems and hand-tuned baselines while extending to energy efficiency. The pluggable micro-profiling design and MCTS adaptations represent concrete engineering contributions that could be adopted in future automated performance tools.

major comments (2)

[Abstract and results section] Abstract and results section: The geometric-mean speedups (2.42×/4.69×/5.30×) and SOTA claim rest on unreviewed experimental choices; the manuscript supplies no baseline details, error bars, or description of how KernelBench levels were constructed, preventing independent assessment of the reported gains.
[§3.2 and ablation studies] §3.2 (two-stage tool invocation) and ablation studies: The central 'expert surrogate' premise—that the roofline classifier plus micro-profilers faithfully encode expert heuristics and avoid systematic mis-routing (e.g., on Hopper WGMMA register-pressure vs. bandwidth trade-offs)—is load-bearing but unsupported by direct validation against independent expert judgments on the same kernels; the reported p<0.0001 improvement over raw metrics does not address fidelity of the natural-language mapping.

minor comments (2)

[Abstract] Abstract: The energy-efficiency result (11.6% reduction) is presented without corresponding baseline energy numbers or measurement methodology, which should be clarified for completeness.
[MCTS description] The MCTS description mentions 'log-reward calibration' and 'search memory' without explicit equations or pseudocode; adding these would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues of experimental transparency and validation of the expert-surrogate components. We address each comment below and will revise the manuscript to improve clarity and reproducibility while preserving the core claims supported by the existing ablation data.

read point-by-point responses

Referee: [Abstract and results section] Abstract and results section: The geometric-mean speedups (2.42×/4.69×/5.30×) and SOTA claim rest on unreviewed experimental choices; the manuscript supplies no baseline details, error bars, or description of how KernelBench levels were constructed, preventing independent assessment of the reported gains.

Authors: We agree that the current manuscript lacks sufficient detail for independent verification. In the revised version we will add: (1) explicit baseline configurations including library versions, compilation flags, and hardware setup; (2) error bars or standard deviations accompanying all geometric-mean speedups; and (3) a dedicated paragraph describing the construction of KernelBench Levels 1–3, including the criteria used for difficulty assignment. These additions will appear in a new “Experimental Setup” subsection and the results section. revision: yes
Referee: [§3.2 and ablation studies] §3.2 (two-stage tool invocation) and ablation studies: The central 'expert surrogate' premise—that the roofline classifier plus micro-profilers faithfully encode expert heuristics and avoid systematic mis-routing (e.g., on Hopper WGMMA register-pressure vs. bandwidth trade-offs)—is load-bearing but unsupported by direct validation against independent expert judgments on the same kernels; the reported p<0.0001 improvement over raw metrics does not address fidelity of the natural-language mapping.

Authors: The reported p<0.0001 quantifies the performance benefit of the full tool pipeline versus raw metrics, which indirectly supports the utility of the encoded heuristics. The roofline classifier and subsequent profilers follow well-established expert practices. Nevertheless, we acknowledge that direct side-by-side comparison with independent expert judgments on the same kernels would provide stronger evidence of mapping fidelity, particularly for cases such as Hopper WGMMA register-pressure versus bandwidth decisions. We will revise §3.2 to expand the description of the natural-language mapping rules, include a brief discussion of potential mis-routing scenarios, and add a limitations paragraph noting the absence of direct expert validation. We view this as a partial revision because a new expert study is outside the scope of the current experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark-driven system with independent evaluations

full rationale

The paper describes an engineering system (KernelPro) whose central claims are geometric-mean speedups on external benchmarks (KernelBench Levels 1-3 and VeOmni MoE kernels) and ablation p-values. These results are measured outcomes, not quantities derived from equations or parameters fitted inside the paper. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the architecture description or evaluation chain; the roofline classifier and micro-profilers are presented as pluggable components whose correctness is assessed by downstream speedups rather than by internal consistency alone. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical engineering paper; the central claims rest on the assumption that the chosen profilers and roofline model faithfully represent expert reasoning, plus standard assumptions about benchmark representativeness. No new physical entities or fitted constants are introduced.

axioms (1)

domain assumption Roofline model correctly classifies kernel bottlenecks for the purpose of deciding which micro-profiling tools to invoke
Invoked in the two-stage tool invocation architecture described in contribution (2)

pith-pipeline@v0.9.1-grok · 5917 in / 1385 out tokens · 21050 ms · 2026-06-26T01:10:06.605512+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 3 canonical work pages

[1]

GPU kernel scientist: An LLM -driven framework for iterative kernel optimization

Martin Andrews and Sam Witteveen. GPU kernel scientist: An LLM -driven framework for iterative kernel optimization. In ES-FoMo III Workshop at ICML, 2025. URL https://arxiv.org/abs/2506.20807

arXiv 2025
[2]

Continuous upper confidence trees with polynomial exploration -- consistency

David Auger, Adrien Couetoux, and Olivier Teytaud. Continuous upper confidence trees with polynomial exploration -- consistency. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), pp.\ 194--209, 2013. doi:10.1007/978-3-642-40988-2_13

work page doi:10.1007/978-3-642-40988-2_13 2013
[3]

LongBench v2 : Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2 : Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204, 2024. URL https://arxiv.org/abs/2412.15204

Pith/arXiv arXiv 2024
[4]

Ahmed, and Ali Jannesari

Arijit Bhattacharjee, Heng Ping, Son Vu Le, Paul Bogdan, Nesreen K. Ahmed, and Ali Jannesari. OptiML : An end-to-end framework for program synthesis and CUDA kernel optimization. arXiv preprint arXiv:2602.12305, 2026. URL https://arxiv.org/abs/2602.12305

arXiv 2026
[5]

Architecting an energy-efficient DRAM system for GPUs

Niladrish Chatterjee, Mike O'Connor, Donghyuk Lee, Daniel R Johnson, Stephen W Keckler, Minsoo Rhu, and William J Dally. Architecting an energy-efficient DRAM system for GPUs . In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp.\ 73--84. IEEE, 2017

2017
[6]

AVO : Agentic variation operators for autonomous evolutionary search

Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi. AVO : Agentic variation operators for autonomous evolutionary...

arXiv 2026
[7]

cuPilot : A strategy-coordinated multi-agent framework for CUDA kernel evolution

Yongchao Chen, Yueying Li, Yue Zhang, Shreyas Singh, Tian Lan, and Yongle Zhang. cuPilot : A strategy-coordinated multi-agent framework for CUDA kernel evolution. In arXiv preprint arXiv:2512.16465, 2025. URL https://arxiv.org/abs/2512.16465

arXiv 2025
[8]

KernelBlaster : Continual cross-task CUDA optimization via memory-augmented in-context reinforcement learning

Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, and Christos Kozyrakis. KernelBlaster : Continual cross-task CUDA optimization via memory-augmented in-context reinforcement learning. In arXiv preprint arXiv:2602.14293, 2026. URL https://arxiv.org/abs/2602.14293

arXiv 2026
[9]

Making LLMs optimize multi-scenario CUDA kernels like experts

Yuxuan Han, Meng-Hao Guo, Zhengning Liu, Wenguang Chen, and Shi-Min Hu. Making LLMs optimize multi-scenario CUDA kernels like experts. In arXiv preprint arXiv:2603.07169, 2026. URL https://arxiv.org/abs/2603.07169

arXiv 2026
[10]

Wolfe, and Eric Chicken

Myles Hollander, Douglas A. Wolfe, and Eric Chicken. Nonparametric Statistical Methods. John Wiley & Sons, 3rd edition, 2014

2014
[11]

An integrated GPU power and performance model

Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), pp.\ 280--289, 2010

2010
[12]

1.1 computing's energy problem (and what we can do about it)

Mark Horowitz. 1.1 computing's energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp.\ 10--14. IEEE, 2014

2014
[13]

TreeRL : LLM reinforcement learning with on-policy tree search

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. TreeRL : LLM reinforcement learning with on-policy tree search. arXiv preprint arXiv:2506.11902, 2025. URL https://arxiv.org/abs/2506.11902

arXiv 2025
[14]

Electricity 2024: Analysis and forecast to 2026

International Energy Agency . Electricity 2024: Analysis and forecast to 2026. Technical report, International Energy Agency (IEA), Paris, 2024. URL https://www.iea.org/reports/electricity-2024

2024
[15]

Tree search for LLM agent reinforcement learning

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for LLM agent reinforcement learning. In arXiv preprint arXiv:2509.21240, 2025. URL https://arxiv.org/abs/2509.21240

arXiv 2025
[16]

Accelwattch: A power modeling framework for modern gpus

Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G Rogers, Tor M Aamodt, and Nikos Hardavellas. Accelwattch: A power modeling framework for modern gpus. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp.\ 738--753, 2021

2021
[17]

Bandit based Monte-Carlo planning

Levente Kocsis and Csaba Szepesv\' a ri. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning (ECML), pp.\ 282--293, 2006. doi:10.1007/11871842_29

work page doi:10.1007/11871842_29 2006
[18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP), 2023. URL https://arxiv.org/abs/2309.06180

Pith/arXiv arXiv 2023
[19]

GPUWattch : Enabling energy optimizations in GPGPUs

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. GPUWattch : Enabling energy optimizations in GPGPUs . In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pp.\ 487--498, 2013

2013
[20]

StitchCUDA : An automated multi-agents end-to-end GPU programing framework with rubric-based agentic reinforcement learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, and Caiwen Ding. StitchCUDA : An automated multi-agents end-to-end GPU programing framework with rubric-based agentic reinforcement learning. In Proceedings of the 43rd International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research. PMLR, 2026. URL https://arxiv....

arXiv 2026
[21]

Over-synchronization in GPU programs

Ajay Nayak and Arkaprava Basu. Over-synchronization in GPU programs. In 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024

2024
[22]

Alexander Novikov, Ng\^ a n V\ u , Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific an...

Pith/arXiv arXiv 2025
[23]

CUTLASS : CUDA templates for linear algebra subroutines

NVIDIA . CUTLASS : CUDA templates for linear algebra subroutines. https://github.com/NVIDIA/cutlass, 2023

2023
[24]

CUDA C++ Best Practices Guide , 2024

NVIDIA Corporation . CUDA C++ Best Practices Guide , 2024. URL https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

2024
[25]

OpenAI Agents SDK

OpenAI . OpenAI Agents SDK . https://github.com/openai/openai-agents-python, 2025

2025
[26]

Zhang, William Hu, Christopher R\' e , and Azalia Mirhoseini

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R\' e , and Azalia Mirhoseini. KernelBench : Can LLMs write efficient GPU kernels? In arXiv preprint arXiv:2502.10517, 2025. URL https://arxiv.org/abs/2502.10517

Pith/arXiv arXiv 2025
[27]

GrepSeek : Training search agents for direct corpus interaction

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, and Hamed Zamani. GrepSeek : Training search agents for direct corpus interaction. arXiv preprint arXiv:2605.29307, 2026. URL https://arxiv.org/abs/2605.29307

Pith/arXiv arXiv 2026
[28]

GPUs Go Brrr , 2024

Benjamin Spector, Aaryan Singhal, Simran Arora, and Chris Re. GPUs Go Brrr , 2024. URL https://hazyresearch.stanford.edu/blog/2024-05-12-tk. Hazy Research Blog, introducing ThunderKittens

2024
[29]

KernelEvolve : Scaling agentic kernel coding for heterogeneous AI accelerators at meta

Ansor Team and Meta. KernelEvolve : Scaling agentic kernel coding for heterogeneous AI accelerators at meta. In arXiv preprint arXiv:2512.23236, 2025. URL https://arxiv.org/abs/2512.23236

arXiv 2025
[30]

Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019

2019
[31]

KernelFoundry : Hardware-aware evolutionary GPU kernel optimization

Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummenhofer. KernelFoundry : Hardware-aware evolutionary GPU kernel optimization. In arXiv preprint arXiv:2603.12440, 2026. URL https://arxiv.org/abs/2603.12440

arXiv 2026
[32]

DeepSearch : Overcome the bottleneck of reinforcement learning with verifiable rewards via Monte Carlo tree search

Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, and Yejin Choi. DeepSearch : Overcome the bottleneck of reinforcement learning with verifiable rewards via Monte Carlo tree search. In arXiv preprint arXiv:2509.25454, 2025. URL https://arxiv.org/abs/2509.25454

Pith/arXiv arXiv 2025
[33]

Hierarchical roofline analysis for GPUs : Accelerating performance optimization for the NERSC-9 Perlmutter system

Charlene Yang, Thorsten Kurth, and Samuel Williams. Hierarchical roofline analysis for GPUs : Accelerating performance optimization for the NERSC-9 Perlmutter system. Concurrency and Computation: Practice and Experience, 32 0 (20): 0 e5547, 2020. doi:10.1002/cpe.5547

work page doi:10.1002/cpe.5547 2020
[34]

Part-time power measurements: nvidia-smi's lack of attention

Zeyu Yang, Karel Ad \'a mek, and Wesley Armour. Part-time power measurements: nvidia-smi's lack of attention. arXiv preprint arXiv:2312.02741, 2024

arXiv 2024
[35]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023
[36]

Accessing GPT-4 level mathematical olympiad solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou. Accessing GPT-4 level mathematical olympiad solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B . arXiv preprint arXiv:2406.07394, 2024. URL https://arxiv.org/abs/2406.07394

arXiv 2024
[37]

CudaForge : An agent framework with hardware feedback for CUDA kernel optimization

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CudaForge : An agent framework with hardware feedback for CUDA kernel optimization. In arXiv preprint arXiv:2511.01884, 2025. URL https://arxiv.org/abs/2511.01884

arXiv 2025
[38]

CUDA Agent : Large-scale agentic RL for high-performance CUDA kernel generation

Zijian Zhang, Shiyang Li, Rong Wang, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CUDA Agent : Large-scale agentic RL for high-performance CUDA kernel generation. In arXiv preprint arXiv:2602.24286, 2026. URL https://arxiv.org/abs/2602.24286

arXiv 2026
[39]

GPA : A GPU performance advisor based on instruction sampling

Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. GPA : A GPU performance advisor based on instruction sampling. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp.\ 115--125, 2021. URL https://arxiv.org/abs/2009.04061

arXiv 2021

[1] [1]

GPU kernel scientist: An LLM -driven framework for iterative kernel optimization

Martin Andrews and Sam Witteveen. GPU kernel scientist: An LLM -driven framework for iterative kernel optimization. In ES-FoMo III Workshop at ICML, 2025. URL https://arxiv.org/abs/2506.20807

arXiv 2025

[2] [2]

Continuous upper confidence trees with polynomial exploration -- consistency

David Auger, Adrien Couetoux, and Olivier Teytaud. Continuous upper confidence trees with polynomial exploration -- consistency. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), pp.\ 194--209, 2013. doi:10.1007/978-3-642-40988-2_13

work page doi:10.1007/978-3-642-40988-2_13 2013

[3] [3]

LongBench v2 : Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2 : Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204, 2024. URL https://arxiv.org/abs/2412.15204

Pith/arXiv arXiv 2024

[4] [4]

Ahmed, and Ali Jannesari

Arijit Bhattacharjee, Heng Ping, Son Vu Le, Paul Bogdan, Nesreen K. Ahmed, and Ali Jannesari. OptiML : An end-to-end framework for program synthesis and CUDA kernel optimization. arXiv preprint arXiv:2602.12305, 2026. URL https://arxiv.org/abs/2602.12305

arXiv 2026

[5] [5]

Architecting an energy-efficient DRAM system for GPUs

Niladrish Chatterjee, Mike O'Connor, Donghyuk Lee, Daniel R Johnson, Stephen W Keckler, Minsoo Rhu, and William J Dally. Architecting an energy-efficient DRAM system for GPUs . In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp.\ 73--84. IEEE, 2017

2017

[6] [6]

AVO : Agentic variation operators for autonomous evolutionary search

Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi. AVO : Agentic variation operators for autonomous evolutionary...

arXiv 2026

[7] [7]

cuPilot : A strategy-coordinated multi-agent framework for CUDA kernel evolution

Yongchao Chen, Yueying Li, Yue Zhang, Shreyas Singh, Tian Lan, and Yongle Zhang. cuPilot : A strategy-coordinated multi-agent framework for CUDA kernel evolution. In arXiv preprint arXiv:2512.16465, 2025. URL https://arxiv.org/abs/2512.16465

arXiv 2025

[8] [8]

KernelBlaster : Continual cross-task CUDA optimization via memory-augmented in-context reinforcement learning

Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, and Christos Kozyrakis. KernelBlaster : Continual cross-task CUDA optimization via memory-augmented in-context reinforcement learning. In arXiv preprint arXiv:2602.14293, 2026. URL https://arxiv.org/abs/2602.14293

arXiv 2026

[9] [9]

Making LLMs optimize multi-scenario CUDA kernels like experts

Yuxuan Han, Meng-Hao Guo, Zhengning Liu, Wenguang Chen, and Shi-Min Hu. Making LLMs optimize multi-scenario CUDA kernels like experts. In arXiv preprint arXiv:2603.07169, 2026. URL https://arxiv.org/abs/2603.07169

arXiv 2026

[10] [10]

Wolfe, and Eric Chicken

Myles Hollander, Douglas A. Wolfe, and Eric Chicken. Nonparametric Statistical Methods. John Wiley & Sons, 3rd edition, 2014

2014

[11] [11]

An integrated GPU power and performance model

Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), pp.\ 280--289, 2010

2010

[12] [12]

1.1 computing's energy problem (and what we can do about it)

Mark Horowitz. 1.1 computing's energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp.\ 10--14. IEEE, 2014

2014

[13] [13]

TreeRL : LLM reinforcement learning with on-policy tree search

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. TreeRL : LLM reinforcement learning with on-policy tree search. arXiv preprint arXiv:2506.11902, 2025. URL https://arxiv.org/abs/2506.11902

arXiv 2025

[14] [14]

Electricity 2024: Analysis and forecast to 2026

International Energy Agency . Electricity 2024: Analysis and forecast to 2026. Technical report, International Energy Agency (IEA), Paris, 2024. URL https://www.iea.org/reports/electricity-2024

2024

[15] [15]

Tree search for LLM agent reinforcement learning

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for LLM agent reinforcement learning. In arXiv preprint arXiv:2509.21240, 2025. URL https://arxiv.org/abs/2509.21240

arXiv 2025

[16] [16]

Accelwattch: A power modeling framework for modern gpus

Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G Rogers, Tor M Aamodt, and Nikos Hardavellas. Accelwattch: A power modeling framework for modern gpus. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp.\ 738--753, 2021

2021

[17] [17]

Bandit based Monte-Carlo planning

Levente Kocsis and Csaba Szepesv\' a ri. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning (ECML), pp.\ 282--293, 2006. doi:10.1007/11871842_29

work page doi:10.1007/11871842_29 2006

[18] [18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP), 2023. URL https://arxiv.org/abs/2309.06180

Pith/arXiv arXiv 2023

[19] [19]

GPUWattch : Enabling energy optimizations in GPGPUs

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. GPUWattch : Enabling energy optimizations in GPGPUs . In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pp.\ 487--498, 2013

2013

[20] [20]

StitchCUDA : An automated multi-agents end-to-end GPU programing framework with rubric-based agentic reinforcement learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, and Caiwen Ding. StitchCUDA : An automated multi-agents end-to-end GPU programing framework with rubric-based agentic reinforcement learning. In Proceedings of the 43rd International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research. PMLR, 2026. URL https://arxiv....

arXiv 2026

[21] [21]

Over-synchronization in GPU programs

Ajay Nayak and Arkaprava Basu. Over-synchronization in GPU programs. In 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024

2024

[22] [22]

Alexander Novikov, Ng\^ a n V\ u , Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific an...

Pith/arXiv arXiv 2025

[23] [23]

CUTLASS : CUDA templates for linear algebra subroutines

NVIDIA . CUTLASS : CUDA templates for linear algebra subroutines. https://github.com/NVIDIA/cutlass, 2023

2023

[24] [24]

CUDA C++ Best Practices Guide , 2024

NVIDIA Corporation . CUDA C++ Best Practices Guide , 2024. URL https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

2024

[25] [25]

OpenAI Agents SDK

OpenAI . OpenAI Agents SDK . https://github.com/openai/openai-agents-python, 2025

2025

[26] [26]

Zhang, William Hu, Christopher R\' e , and Azalia Mirhoseini

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R\' e , and Azalia Mirhoseini. KernelBench : Can LLMs write efficient GPU kernels? In arXiv preprint arXiv:2502.10517, 2025. URL https://arxiv.org/abs/2502.10517

Pith/arXiv arXiv 2025

[27] [27]

GrepSeek : Training search agents for direct corpus interaction

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, and Hamed Zamani. GrepSeek : Training search agents for direct corpus interaction. arXiv preprint arXiv:2605.29307, 2026. URL https://arxiv.org/abs/2605.29307

Pith/arXiv arXiv 2026

[28] [28]

GPUs Go Brrr , 2024

Benjamin Spector, Aaryan Singhal, Simran Arora, and Chris Re. GPUs Go Brrr , 2024. URL https://hazyresearch.stanford.edu/blog/2024-05-12-tk. Hazy Research Blog, introducing ThunderKittens

2024

[29] [29]

KernelEvolve : Scaling agentic kernel coding for heterogeneous AI accelerators at meta

Ansor Team and Meta. KernelEvolve : Scaling agentic kernel coding for heterogeneous AI accelerators at meta. In arXiv preprint arXiv:2512.23236, 2025. URL https://arxiv.org/abs/2512.23236

arXiv 2025

[30] [30]

Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019

2019

[31] [31]

KernelFoundry : Hardware-aware evolutionary GPU kernel optimization

Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummenhofer. KernelFoundry : Hardware-aware evolutionary GPU kernel optimization. In arXiv preprint arXiv:2603.12440, 2026. URL https://arxiv.org/abs/2603.12440

arXiv 2026

[32] [32]

DeepSearch : Overcome the bottleneck of reinforcement learning with verifiable rewards via Monte Carlo tree search

Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, and Yejin Choi. DeepSearch : Overcome the bottleneck of reinforcement learning with verifiable rewards via Monte Carlo tree search. In arXiv preprint arXiv:2509.25454, 2025. URL https://arxiv.org/abs/2509.25454

Pith/arXiv arXiv 2025

[33] [33]

Hierarchical roofline analysis for GPUs : Accelerating performance optimization for the NERSC-9 Perlmutter system

Charlene Yang, Thorsten Kurth, and Samuel Williams. Hierarchical roofline analysis for GPUs : Accelerating performance optimization for the NERSC-9 Perlmutter system. Concurrency and Computation: Practice and Experience, 32 0 (20): 0 e5547, 2020. doi:10.1002/cpe.5547

work page doi:10.1002/cpe.5547 2020

[34] [34]

Part-time power measurements: nvidia-smi's lack of attention

Zeyu Yang, Karel Ad \'a mek, and Wesley Armour. Part-time power measurements: nvidia-smi's lack of attention. arXiv preprint arXiv:2312.02741, 2024

arXiv 2024

[35] [35]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023

[36] [36]

Accessing GPT-4 level mathematical olympiad solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou. Accessing GPT-4 level mathematical olympiad solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B . arXiv preprint arXiv:2406.07394, 2024. URL https://arxiv.org/abs/2406.07394

arXiv 2024

[37] [37]

CudaForge : An agent framework with hardware feedback for CUDA kernel optimization

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CudaForge : An agent framework with hardware feedback for CUDA kernel optimization. In arXiv preprint arXiv:2511.01884, 2025. URL https://arxiv.org/abs/2511.01884

arXiv 2025

[38] [38]

CUDA Agent : Large-scale agentic RL for high-performance CUDA kernel generation

Zijian Zhang, Shiyang Li, Rong Wang, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CUDA Agent : Large-scale agentic RL for high-performance CUDA kernel generation. In arXiv preprint arXiv:2602.24286, 2026. URL https://arxiv.org/abs/2602.24286

arXiv 2026

[39] [39]

GPA : A GPU performance advisor based on instruction sampling

Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. GPA : A GPU performance advisor based on instruction sampling. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp.\ 115--125, 2021. URL https://arxiv.org/abs/2009.04061

arXiv 2021