CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Anne Ouyang; Fan Long; Jikai Jason Li; Tara Saba; Xujie Si; Zhiyang Chen

arxiv: 2604.01489 · v2 · pith:NCKZNFL3new · submitted 2026-04-01 · 💻 cs.LG · cs.AI· cs.DC· cs.PF· cs.SE

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Tara Saba , Zhiyang Chen , Jikai Jason Li , Anne Ouyang , Xujie Si , Fan Long This is my paper

classification 💻 cs.LG cs.AIcs.DCcs.PFcs.SE

keywords cutegenkernelagenticcutegenerationkernelsframeworkhigh-performance

0 comments

read the original abstract

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

This paper has not been read by Pith yet.

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

discussion (0)