FastKernels: Benchmarking GPU Kernel Generation in Production
Pith reviewed 2026-05-25 05:18 UTC · model grok-4.3
The pith
Even the strongest LLM agents for GPU kernel generation reach only 0.94 times the speedup of production baselines on a benchmark built to match real systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FastKernels supplies 46 representative architectures whose kernels collectively subsume those of 96.2 percent of HuggingFace Transformers models and whose task interfaces mirror the corresponding modules in current production libraries, allowing direct deployment of generated kernels. The same framework runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving workloads and exceeds upstream references on under-served architectures. Evaluation of current kernel-generation agents on this setup shows the strongest reaching only 0.94 times aggregate speedup over production baselines, weaker agents reaching 0.78 times and 0.53 times.
What carries the argument
FastKernels benchmark of 46 architectures with production-matched interfaces that doubles as a minimalistic inference framework.
If this is right
- Kernel agents can be evaluated and improved using tasks whose outputs drop directly into existing production codebases without interface changes.
- Benchmarks that ignore compilation stacks reward kernels that introduce silent correctness issues or compatibility failures in real systems.
- Reward signals in current agent training favor replication of known optimizations rather than discovery of new ones.
- Field progress requires benchmarks whose measured gains translate into production throughput rather than sandbox scores.
Where Pith is reading between the lines
- If the 96.2 percent coverage holds across evolving model families, work on these 46 tasks could improve the majority of deployed transformers without additional architecture-specific engineering.
- Extending the benchmark to include multi-GPU and distributed serving patterns would test whether the same alignment principle applies beyond single-device inference.
- Agents that succeed on FastKernels could serve as stronger starting points for production tuning loops that currently begin from hand-written kernels.
Load-bearing premise
The 46 selected architectures and their interfaces are sufficient to represent the production workloads that matter.
What would settle it
Demonstrating that kernels optimized on FastKernels produce measurable throughput gains when inserted into unmodified vLLM or SGLang codebases would support the alignment claim; the opposite result would undermine it.
Figures
read the original abstract
LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FastKernels, a benchmark and minimalistic production-grade inference framework for evaluating LLM-based agents on GPU kernel generation. It selects 46 architectures across 8 categories whose kernels cover 96.2% (409/425) of HuggingFace Transformers models, with interfaces matching SOTA library modules. The framework runs at parity with vLLM and SGLang on mainstream serving and exceeds references on under-served architectures. Evaluation of SOTA kernel agents shows the strongest achieves only 0.94× aggregate speedup over production baselines (weaker agents at 0.78× and 0.53×), leading to the claim that benchmark-production misalignment is a critical bottleneck. The benchmark and code (https://github.com/Snowflake-AI-Research/fastkernels) are released to enable direct deployment of agent-generated kernels.
Significance. If the representativeness claim holds, the work is significant for the kernel-generation agent field by supplying a benchmark whose gains are intended to translate to production throughput, along with a reusable framework. Explicit credit is due for the code release, the parity demonstration with hardened systems (vLLM/SGLang), and the empirical agent evaluation that produces falsifiable speedup numbers rather than synthetic metrics.
major comments (1)
- [Benchmark construction] Benchmark construction (abstract and associated section): the central claim that misalignment is a critical bottleneck rests on FastKernels being a faithful production proxy, yet representativeness is established only by architecture count (96.2% coverage of 425 HF models) and interface mirroring. No analysis is supplied on whether the 46 architectures capture performance-critical production dimensions such as realistic tensor shapes, batch/sequence length distributions, KV-cache behavior, or memory-vs-compute bounds under serving loads. This gap directly affects whether the reported 0.94×/0.78×/0.53× speedups demonstrate general misalignment or benchmark idiosyncrasies.
minor comments (1)
- [Abstract] Abstract: the statement that the framework 'substantially exceeds upstream references on under-served architectures' lacks quantitative detail (speedup factors or specific architectures); adding this would clarify the parity claim without lengthening the abstract.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of the work's significance, the credit for the code release and parity results, and the constructive comment on benchmark construction. We respond to the major comment below.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (abstract and associated section): the central claim that misalignment is a critical bottleneck rests on FastKernels being a faithful production proxy, yet representativeness is established only by architecture count (96.2% coverage of 425 HF models) and interface mirroring. No analysis is supplied on whether the 46 architectures capture performance-critical production dimensions such as realistic tensor shapes, batch/sequence length distributions, KV-cache behavior, or memory-vs-compute bounds under serving loads. This gap directly affects whether the reported 0.94×/0.78×/0.53× speedups demonstrate general misalignment or benchmark idiosyncrasies.
Authors: We thank the referee for this point. Representativeness is grounded in two elements: (1) the 46 architectures were chosen because their kernels subsume those used by 409/425 HF models, and (2) the framework achieves measured parity with vLLM and SGLang on mainstream serving workloads. Parity with hardened production systems necessarily implies that the evaluated workloads reflect realistic batch/sequence distributions, KV-cache behavior, and memory-vs-compute regimes encountered in deployment; otherwise the throughput numbers would diverge. Interface mirroring further ensures that any kernel optimizations are directly deployable. While the manuscript does not include an explicit appendix tabulating shape histograms or load distributions, the parity result supplies empirical evidence against idiosyncrasy. We will add a short subsection in the revised manuscript that reports the concrete serving configurations (batch sizes, sequence lengths, and model shapes) used for the vLLM/SGLang parity experiments and the agent evaluations. revision: partial
Circularity Check
No circularity in derivation chain
full rationale
The paper is an empirical benchmark release that constructs a set of 46 architectures, measures their coverage by direct count against HuggingFace models, implements a production-grade framework, and reports measured speedups from agent evaluations on that benchmark. No equations, fitted parameters, or predictions are defined in terms of the target results; the 0.94×/0.78×/0.53× figures are experimental outcomes, not reductions by construction. The coverage statement is a factual enumeration rather than a derived claim, and the misalignment conclusion follows directly from those independent measurements. No self-citation chains, ansatzes, or renamings appear in the load-bearing steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The selected 46 architectures and 8 categories collectively represent the kernel workloads that dominate production LLM inference.
- domain assumption Running at parity with vLLM and SGLang on mainstream models is a sufficient test of production-grade behavior.
Reference graph
Works this paper leans on
-
[1]
Kevin: Multi-turn rl for generating cuda kernels, 2025
Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn rl for generating cuda kernels, 2025. URLhttps://arxiv.org/abs/2507.11948
-
[2]
KernelAgent: Hardware-guided GPU kernel optimization via multi- agent orchestration
Kaiming Cheng, Laura Wang, Jack Khuu, Mark Saroufim, Wenyuan Chi, Jiannan Wang, and Joe Isaacson. KernelAgent: Hardware-guided GPU kernel optimization via multi- agent orchestration. PyTorch Blog, March 2026. URL https://pytorch.org/blog/ kernelagent-hardware-guided-gpu-kernel-optimization-via-multi-agent-orchestration/ . Accessed: 2026-05-06
work page 2026
-
[3]
Computeeval: A benchmark for evaluating large language models on cuda code generation, 2025
NVIDIA Corporation. Computeeval: A benchmark for evaluating large language models on cuda code generation, 2025. URLhttps://github.com/NVIDIA/compute-eval
work page 2025
-
[4]
Zacharias V . Fisches, Sahan Paliskara, Simon Guo, Alex Zhang, Joe Spisak, Chris Cummins, Hugh Leather, Gabriel Synnaeve, Joe Isaacson, Aram Markosyan, and Mark Saroufim. Kernel- llm: Making kernel development more accessible, 6 2025. URL https://huggingface.co/ facebook/KernelLLM
work page 2025
-
[5]
Npueval: Optimizing npu kernels with llms and open source compilers, 2025
Sarunas Kalade and Graham Schelle. Npueval: Optimizing npu kernels with llms and open source compilers, 2025. URLhttps://arxiv.org/abs/2507.14403
-
[6]
Andrej Karpathy. nanoGPT. 2023
work page 2023
-
[7]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention.Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[8]
Towards robust agentic cuda kernel benchmarking, verification, and optimization, 2025
Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, and David Ha. Towards robust agentic cuda kernel benchmarking, verification, and optimization, 2025. URL https://arxiv.org/abs/2509.14279
-
[9]
Tritonbench: Benchmarking large language model capabilities for generating triton operators, 2025
Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. Tritonbench: Benchmarking large language model capabilities for generating triton operators, 2025. URL https://arxiv. org/abs/2502.14752
-
[10]
Autotriton: Automatic triton programming with reinforcement learning in llms, 2025
Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, and Maosong Sun. Autotriton: Automatic triton programming with reinforcement learning in llms, 2025. URLhttps://arxiv.org/abs/2507.05687
-
[11]
Cuda-l1: Improving cuda optimization via contrastive reinforcement learning, 2026
Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. Cuda-l1: Improving cuda optimization via contrastive reinforcement learning, 2026. URL https://arxiv.org/abs/ 2507.14111
-
[12]
Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta,
Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Roman Levenstein, Kunming Ho, Haishan Zhu, Alec Hammond, Richard Li, Ajit Mathews, Kaustubh Gon...
- [13]
-
[14]
Sol-execbench: Speed-of-light benchmarking for real-world gpu kernels against hardware limits, 2026
Edward Lin, Sahil Modi, Siva Kumar Sastry Hari, Qijing Huang, Zhifan Ye, Nestor Qin, Fengzhe Zhou, Yuan Zhang, Jingquan Wang, Sana Damani, Dheeraj Peri, Ouye Xie, Aditya Kane, Moshe Maor, Michael Behar, Triston Cao, Rishabh Mehta, Vartika Singh, Vikram Sharma 10 Mailthody, Terry Chen, Zihao Ye, Hanfeng Chen, Tianqi Chen, Vinod Grover, Wei Chen, Wei Liu, E...
- [15]
-
[16]
Jones, Robert Mullins, Rika Antonova, and Yiren Zhao
Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, and Yiren Zhao. Kernelcraft: Benchmarking for agentic close-to-metal kernel generation on emerging hardware, 2026. URL https://arxiv.org/abs/2603.08721
-
[17]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
OpenAI. OpenAI Codex. https://openai.com/index/introducing-codex/, 2025. Ac- cessed: 2026-05-06
work page 2025
-
[19]
Anne Ouyang, Simon Zheng, Ce Gao, Yida Dong, Kunhao He, Lisa Li, Saurav Nrusimha, Daniel Zhu, Fei Song, et al. Kernelbench: Can LLMs write efficient gpu kernels? InProceedings of the International Conference on Machine Learning (ICML), 2025
work page 2025
-
[20]
Mark Saroufim, Jiannan Wang, Bert Maher, Sahan Paliskara, Laura Wang, Shahin Sefati, and Manuel Candales. Backendbench: An evaluation suite for testing how well llms and humans can write pytorch backends, 2025. URL https://github.com/meta-pytorch/BackendBench
work page 2025
-
[21]
Geak: Introducing triton kernel ai agent & evaluation benchmarks, 2025
Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prab- hanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum. Geak: Introducing triton kernel ai agent & evaluation benchmarks, 2025. URLhttps://arxiv.org/abs/2507.23194
-
[22]
Astra: A multi-agent system for gpu kernel performance optimization
Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A multi-agent system for gpu kernel performance optimization. InNeurIPS 2025 Fourth Workshop on Deep Learning for Code, 2025
work page 2025
-
[23]
Multikernelbench: A multi-platform benchmark for kernel generation, 2025
Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. Multikernelbench: A multi-platform benchmark for kernel generation, 2025. URL https: //arxiv.org/abs/2507.17773
-
[24]
Flashinfer-bench: Building the virtuous cycle for ai-driven llm systems, 2026
Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, Aksara Bayyapu, Luis Ceze, and Tianqi Chen. Flashinfer-bench: Building the virtuous cycle for ai-driven llm systems, 2026. URL https: //arxiv.org/abs/2601.00227
-
[25]
Cudaforge: An agent framework with hardware feedback for cuda kernel optimization, 2025
Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Cudaforge: An agent framework with hardware feedback for cuda kernel optimization, 2025. URL https: //arxiv.org/abs/2511.01884
-
[26]
SGLang: Efficient execution of structured language model programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Ying Sheng, et al. SGLang: Efficient execution of structured language model programs. 2024
work page 2024
-
[27]
Cudabench: Benchmarking llms for text-to-cuda generation, 2026
Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Run- grueangwutthinon, Yehan Ma, and An Zou. Cudabench: Benchmarking llms for text-to-cuda generation, 2026. URLhttps://arxiv.org/abs/2603.02236
-
[28]
Correct” counts target families for which every scenario passes the scaled tolerance; “geomean
Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, et al. Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high-performance gpu kernel generation.arXiv preprint arXiv:2511.20100, 2025. 11 A Benchmark Composition Table 2 reports the per-architecture composition of FASTKERNELS: ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.