pith. sign in

arxiv: 2605.23215 · v1 · pith:NONTTJ6Gnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI· cs.CL

FastKernels: Benchmarking GPU Kernel Generation in Production

Pith reviewed 2026-05-25 05:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords GPU kernel generationLLM agentsbenchmark alignmentproduction inferenceHuggingFace Transformerskernel optimizationvLLMSGLang
0
0 comments X

The pith

Even the strongest LLM agents for GPU kernel generation reach only 0.94 times the speedup of production baselines on a benchmark built to match real systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for LLM-based GPU kernel agents evaluate on single GPUs with synthetic inputs and ignore surrounding compilation stacks, leading agents to produce kernels that fail to integrate into real inference frameworks. FastKernels counters this with a minimal set of 46 architectures spanning eight categories whose kernels cover 96.2 percent of HuggingFace Transformers models and whose interfaces match those in state-of-the-art libraries. The benchmark also functions as a production-grade inference framework that runs at parity with vLLM and SGLang on mainstream serving workloads. When state-of-the-art agents are tested on it, the best achieves only 0.94 times aggregate speedup over production baselines while weaker agents reach 0.78 times and 0.53 times. This gap shows that benchmark misalignment, not agent capability, is the main constraint on translating kernel improvements into actual throughput gains.

Core claim

FastKernels supplies 46 representative architectures whose kernels collectively subsume those of 96.2 percent of HuggingFace Transformers models and whose task interfaces mirror the corresponding modules in current production libraries, allowing direct deployment of generated kernels. The same framework runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving workloads and exceeds upstream references on under-served architectures. Evaluation of current kernel-generation agents on this setup shows the strongest reaching only 0.94 times aggregate speedup over production baselines, weaker agents reaching 0.78 times and 0.53 times.

What carries the argument

FastKernels benchmark of 46 architectures with production-matched interfaces that doubles as a minimalistic inference framework.

If this is right

  • Kernel agents can be evaluated and improved using tasks whose outputs drop directly into existing production codebases without interface changes.
  • Benchmarks that ignore compilation stacks reward kernels that introduce silent correctness issues or compatibility failures in real systems.
  • Reward signals in current agent training favor replication of known optimizations rather than discovery of new ones.
  • Field progress requires benchmarks whose measured gains translate into production throughput rather than sandbox scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the 96.2 percent coverage holds across evolving model families, work on these 46 tasks could improve the majority of deployed transformers without additional architecture-specific engineering.
  • Extending the benchmark to include multi-GPU and distributed serving patterns would test whether the same alignment principle applies beyond single-device inference.
  • Agents that succeed on FastKernels could serve as stronger starting points for production tuning loops that currently begin from hand-written kernels.

Load-bearing premise

The 46 selected architectures and their interfaces are sufficient to represent the production workloads that matter.

What would settle it

Demonstrating that kernels optimized on FastKernels produce measurable throughput gains when inserted into unmodified vLLM or SGLang codebases would support the alignment claim; the opposite result would undermine it.

Figures

Figures reproduced from arXiv: 2605.23215 by Gabriele Oliaro, Hao Zhang, Junli Wang, May Jiang, Owen Lu, Samyam Rajbhandari, Yichao Fu, Zhihao Jia.

Figure 1
Figure 1. Figure 1: FASTKERNELS overview. Three benchmark–production misalignments (left) motivate three design pillars (center), which converge into a unified benchmark-as-framework design (right). Optimized kernels flow through a virtuous cycle (bottom): optimization, validation, integration as new baselines, and release. Our approach We introduce FASTKERNELS, a kernel benchmark that doubles as a minimalistic, production-gr… view at source ↗
Figure 2
Figure 2. Figure 2: Composition of FASTKERNELS: 46 end-to-end architectures across 8 categories. Per￾architecture details in Appendix [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-architecture speedup across all benchmark categories. Each bar shows the speedup of FASTKERNELS over the strongest production or upstream reference for one architecture. Panels group architectures by category; the dashed line in each panel marks the average speedup within that category. 6.1 Reference Performance and Correctness We evaluate FASTKERNELS against the strongest publicly available production… view at source ↗
Figure 4
Figure 4. Figure 4: Input capture changes data-dependent execution. First MoE layer of Qwen3-VL-30B￾A3B-Instruct-FP8 (128 experts, top-8 routing) under three inputs: real WildChat requests, random token IDs (matched lengths), and random tensors injected at the gate. Random tensors look near￾uniform; random tokens skew differently from real requests, with only 4 of the top-16 hot experts in common. Many production operators ha… view at source ↗
read the original abstract

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces FastKernels, a benchmark and minimalistic production-grade inference framework for evaluating LLM-based agents on GPU kernel generation. It selects 46 architectures across 8 categories whose kernels cover 96.2% (409/425) of HuggingFace Transformers models, with interfaces matching SOTA library modules. The framework runs at parity with vLLM and SGLang on mainstream serving and exceeds references on under-served architectures. Evaluation of SOTA kernel agents shows the strongest achieves only 0.94× aggregate speedup over production baselines (weaker agents at 0.78× and 0.53×), leading to the claim that benchmark-production misalignment is a critical bottleneck. The benchmark and code (https://github.com/Snowflake-AI-Research/fastkernels) are released to enable direct deployment of agent-generated kernels.

Significance. If the representativeness claim holds, the work is significant for the kernel-generation agent field by supplying a benchmark whose gains are intended to translate to production throughput, along with a reusable framework. Explicit credit is due for the code release, the parity demonstration with hardened systems (vLLM/SGLang), and the empirical agent evaluation that produces falsifiable speedup numbers rather than synthetic metrics.

major comments (1)
  1. [Benchmark construction] Benchmark construction (abstract and associated section): the central claim that misalignment is a critical bottleneck rests on FastKernels being a faithful production proxy, yet representativeness is established only by architecture count (96.2% coverage of 425 HF models) and interface mirroring. No analysis is supplied on whether the 46 architectures capture performance-critical production dimensions such as realistic tensor shapes, batch/sequence length distributions, KV-cache behavior, or memory-vs-compute bounds under serving loads. This gap directly affects whether the reported 0.94×/0.78×/0.53× speedups demonstrate general misalignment or benchmark idiosyncrasies.
minor comments (1)
  1. [Abstract] Abstract: the statement that the framework 'substantially exceeds upstream references on under-served architectures' lacks quantitative detail (speedup factors or specific architectures); adding this would clarify the parity claim without lengthening the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance, the credit for the code release and parity results, and the constructive comment on benchmark construction. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (abstract and associated section): the central claim that misalignment is a critical bottleneck rests on FastKernels being a faithful production proxy, yet representativeness is established only by architecture count (96.2% coverage of 425 HF models) and interface mirroring. No analysis is supplied on whether the 46 architectures capture performance-critical production dimensions such as realistic tensor shapes, batch/sequence length distributions, KV-cache behavior, or memory-vs-compute bounds under serving loads. This gap directly affects whether the reported 0.94×/0.78×/0.53× speedups demonstrate general misalignment or benchmark idiosyncrasies.

    Authors: We thank the referee for this point. Representativeness is grounded in two elements: (1) the 46 architectures were chosen because their kernels subsume those used by 409/425 HF models, and (2) the framework achieves measured parity with vLLM and SGLang on mainstream serving workloads. Parity with hardened production systems necessarily implies that the evaluated workloads reflect realistic batch/sequence distributions, KV-cache behavior, and memory-vs-compute regimes encountered in deployment; otherwise the throughput numbers would diverge. Interface mirroring further ensures that any kernel optimizations are directly deployable. While the manuscript does not include an explicit appendix tabulating shape histograms or load distributions, the parity result supplies empirical evidence against idiosyncrasy. We will add a short subsection in the revised manuscript that reports the concrete serving configurations (batch sizes, sequence lengths, and model shapes) used for the vLLM/SGLang parity experiments and the agent evaluations. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is an empirical benchmark release that constructs a set of 46 architectures, measures their coverage by direct count against HuggingFace models, implements a production-grade framework, and reports measured speedups from agent evaluations on that benchmark. No equations, fitted parameters, or predictions are defined in terms of the target results; the 0.94×/0.78×/0.53× figures are experimental outcomes, not reductions by construction. The coverage statement is a factual enumeration rather than a derived claim, and the misalignment conclusion follows directly from those independent measurements. No self-citation chains, ansatzes, or renamings appear in the load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the design choice of 46 architectures as representative and on the assumption that interface parity with existing libraries is the right production proxy; no free parameters are fitted to data and no new entities are postulated.

axioms (2)
  • domain assumption The selected 46 architectures and 8 categories collectively represent the kernel workloads that dominate production LLM inference.
    Invoked in the benchmark construction paragraph to justify the 96.2% coverage claim.
  • domain assumption Running at parity with vLLM and SGLang on mainstream models is a sufficient test of production-grade behavior.
    Used to position the framework as directly deployable.

pith-pipeline@v0.9.0 · 5855 in / 1266 out tokens · 19980 ms · 2026-05-25T05:18:24.596686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Kevin: Multi-turn rl for generating cuda kernels, 2025

    Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn rl for generating cuda kernels, 2025. URLhttps://arxiv.org/abs/2507.11948

  2. [2]

    KernelAgent: Hardware-guided GPU kernel optimization via multi- agent orchestration

    Kaiming Cheng, Laura Wang, Jack Khuu, Mark Saroufim, Wenyuan Chi, Jiannan Wang, and Joe Isaacson. KernelAgent: Hardware-guided GPU kernel optimization via multi- agent orchestration. PyTorch Blog, March 2026. URL https://pytorch.org/blog/ kernelagent-hardware-guided-gpu-kernel-optimization-via-multi-agent-orchestration/ . Accessed: 2026-05-06

  3. [3]

    Computeeval: A benchmark for evaluating large language models on cuda code generation, 2025

    NVIDIA Corporation. Computeeval: A benchmark for evaluating large language models on cuda code generation, 2025. URLhttps://github.com/NVIDIA/compute-eval

  4. [4]

    Fisches, Sahan Paliskara, Simon Guo, Alex Zhang, Joe Spisak, Chris Cummins, Hugh Leather, Gabriel Synnaeve, Joe Isaacson, Aram Markosyan, and Mark Saroufim

    Zacharias V . Fisches, Sahan Paliskara, Simon Guo, Alex Zhang, Joe Spisak, Chris Cummins, Hugh Leather, Gabriel Synnaeve, Joe Isaacson, Aram Markosyan, and Mark Saroufim. Kernel- llm: Making kernel development more accessible, 6 2025. URL https://huggingface.co/ facebook/KernelLLM

  5. [5]

    Npueval: Optimizing npu kernels with llms and open source compilers, 2025

    Sarunas Kalade and Graham Schelle. Npueval: Optimizing npu kernels with llms and open source compilers, 2025. URLhttps://arxiv.org/abs/2507.14403

  6. [6]

    Andrej Karpathy. nanoGPT. 2023

  7. [7]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention.Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  8. [8]

    Towards robust agentic cuda kernel benchmarking, verification, and optimization, 2025

    Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, and David Ha. Towards robust agentic cuda kernel benchmarking, verification, and optimization, 2025. URL https://arxiv.org/abs/2509.14279

  9. [9]

    Tritonbench: Benchmarking large language model capabilities for generating triton operators, 2025

    Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. Tritonbench: Benchmarking large language model capabilities for generating triton operators, 2025. URL https://arxiv. org/abs/2502.14752

  10. [10]

    Autotriton: Automatic triton programming with reinforcement learning in llms, 2025

    Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, and Maosong Sun. Autotriton: Automatic triton programming with reinforcement learning in llms, 2025. URLhttps://arxiv.org/abs/2507.05687

  11. [11]

    Cuda-l1: Improving cuda optimization via contrastive reinforcement learning, 2026

    Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. Cuda-l1: Improving cuda optimization via contrastive reinforcement learning, 2026. URL https://arxiv.org/abs/ 2507.14111

  12. [12]

    Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta,

    Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Roman Levenstein, Kunming Ho, Haishan Zhu, Alec Hammond, Richard Li, Ajit Mathews, Kaustubh Gon...

  13. [13]

    URLhttps://arxiv.org/abs/2512.23236

  14. [14]

    Sol-execbench: Speed-of-light benchmarking for real-world gpu kernels against hardware limits, 2026

    Edward Lin, Sahil Modi, Siva Kumar Sastry Hari, Qijing Huang, Zhifan Ye, Nestor Qin, Fengzhe Zhou, Yuan Zhang, Jingquan Wang, Sana Damani, Dheeraj Peri, Ouye Xie, Aditya Kane, Moshe Maor, Michael Behar, Triston Cao, Rishabh Mehta, Vartika Singh, Vikram Sharma 10 Mailthody, Terry Chen, Zihao Ye, Hanfeng Chen, Tianqi Chen, Vinod Grover, Wei Chen, Wei Liu, E...

  15. [15]

    Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, and Junxian He. Dr. kernel: Reinforcement learning done right for triton kernel generations, 2026. URL https: //arxiv.org/abs/2602.05885

  16. [16]

    Jones, Robert Mullins, Rika Antonova, and Yiren Zhao

    Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, and Yiren Zhao. Kernelcraft: Benchmarking for agentic close-to-metal kernel generation on emerging hardware, 2026. URL https://arxiv.org/abs/2603.08721

  17. [17]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

  18. [18]

    OpenAI Codex

    OpenAI. OpenAI Codex. https://openai.com/index/introducing-codex/, 2025. Ac- cessed: 2026-05-06

  19. [19]

    Kernelbench: Can LLMs write efficient gpu kernels? InProceedings of the International Conference on Machine Learning (ICML), 2025

    Anne Ouyang, Simon Zheng, Ce Gao, Yida Dong, Kunhao He, Lisa Li, Saurav Nrusimha, Daniel Zhu, Fei Song, et al. Kernelbench: Can LLMs write efficient gpu kernels? InProceedings of the International Conference on Machine Learning (ICML), 2025

  20. [20]

    Backendbench: An evaluation suite for testing how well llms and humans can write pytorch backends, 2025

    Mark Saroufim, Jiannan Wang, Bert Maher, Sahan Paliskara, Laura Wang, Shahin Sefati, and Manuel Candales. Backendbench: An evaluation suite for testing how well llms and humans can write pytorch backends, 2025. URL https://github.com/meta-pytorch/BackendBench

  21. [21]

    Geak: Introducing triton kernel ai agent & evaluation benchmarks, 2025

    Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prab- hanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum. Geak: Introducing triton kernel ai agent & evaluation benchmarks, 2025. URLhttps://arxiv.org/abs/2507.23194

  22. [22]

    Astra: A multi-agent system for gpu kernel performance optimization

    Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A multi-agent system for gpu kernel performance optimization. InNeurIPS 2025 Fourth Workshop on Deep Learning for Code, 2025

  23. [23]

    Multikernelbench: A multi-platform benchmark for kernel generation, 2025

    Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. Multikernelbench: A multi-platform benchmark for kernel generation, 2025. URL https: //arxiv.org/abs/2507.17773

  24. [24]

    Flashinfer-bench: Building the virtuous cycle for ai-driven llm systems, 2026

    Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, Aksara Bayyapu, Luis Ceze, and Tianqi Chen. Flashinfer-bench: Building the virtuous cycle for ai-driven llm systems, 2026. URL https: //arxiv.org/abs/2601.00227

  25. [25]

    Cudaforge: An agent framework with hardware feedback for cuda kernel optimization, 2025

    Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Cudaforge: An agent framework with hardware feedback for cuda kernel optimization, 2025. URL https: //arxiv.org/abs/2511.01884

  26. [26]

    SGLang: Efficient execution of structured language model programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Ying Sheng, et al. SGLang: Efficient execution of structured language model programs. 2024

  27. [27]

    Cudabench: Benchmarking llms for text-to-cuda generation, 2026

    Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Run- grueangwutthinon, Yehan Ma, and An Zou. Cudabench: Benchmarking llms for text-to-cuda generation, 2026. URLhttps://arxiv.org/abs/2603.02236

  28. [28]

    Correct” counts target families for which every scenario passes the scaled tolerance; “geomean

    Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, et al. Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high-performance gpu kernel generation.arXiv preprint arXiv:2511.20100, 2025. 11 A Benchmark Composition Table 2 reports the per-architecture composition of FASTKERNELS: ...