pith. sign in

arxiv: 2606.17518 · v1 · pith:REYUUESQnew · submitted 2026-06-16 · 💻 cs.DC

SpecGen: Accelerating Agentic Kernel Optimization with Speculative Generation

Pith reviewed 2026-06-26 23:33 UTC · model grok-4.3

classification 💻 cs.DC
keywords speculative generationagentic kernel optimizationLLM reasoningGPU kernel tuningparallel profilingearly terminationresource utilizationKV cache reuse
0
0 comments X

The pith

SpecGen accelerates agentic kernel optimization by forking non-reasoning generations during LLM reasoning traces to enable early termination and parallel profiling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that agentic kernel optimization, which relies on iterative LLM generation and feedback for GPU kernels, suffers from long reasoning latencies, sparse profiling data, and idle validation resources. SpecGen counters this by forking simpler non-reasoning generations at selected points in the ongoing reasoning trace, running those candidates through validation and profiling while the main trace continues. When any forked kernel meets the stop condition, the system halts the slower reasoning generation. It also reallocates GPU pools dynamically and reuses spare memory for KV caches to handle the burstier load. If these steps work as described, optimization loops finish faster and return higher-speedup kernels within the same time or token limits.

Core claim

SpecGen introduces speculative generation that forks non-reasoning generations at well-chosen trigger points in the reasoning trace to produce additional candidate kernels. These kernels are validated and profiled in parallel with the continuing reasoning generation, which allows the system to terminate the reasoning process early once a kernel satisfies the termination criterion. The approach is augmented by dynamic reallocation of validation and profiling GPU pools according to arrival rates and by using spare memory on those GPUs as remote KV cache storage to avoid prefix recomputation. Experiments on H200 hardware with two reasoning LLMs demonstrate reductions in end-to-end time relative

What carries the argument

Speculative generation by forking non-reasoning generations at trigger points in the reasoning trace, which supplies extra candidates for concurrent validation and enables early stopping of the main reasoning process.

If this is right

  • End-to-end optimization time decreases compared with three baseline systems.
  • The volume of profiling feedback per iteration increases.
  • GPU resource utilization rises during the generation phase.
  • Kernel speedups improve when total time or token budget is held constant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same forking pattern could be applied to other long-horizon agentic tasks that mix reasoning with fast verification steps.
  • Dynamic pool reallocation might be generalized to heterogeneous accelerator clusters beyond H200 GPUs.
  • The remote KV cache reuse technique could be tested as a standalone method for reducing recomputation in memory-constrained LLM serving.

Load-bearing premise

The assumption that forking non-reasoning generations at chosen trigger points will reliably produce kernels meeting the termination criterion often enough to justify early stopping and yield net speedup.

What would settle it

A controlled run that records the exact fraction of iterations where a forked kernel triggers early termination and measures whether the resulting wall-clock savings exceed the added overhead of forking and parallel scheduling.

Figures

Figures reproduced from arXiv: 2606.17518 by Dahua Lin, Jihu Guo, Sitian Lu, Tenghui Ma, Wei Gao, Xingcheng Zhang, Zhisheng Ye.

Figure 1
Figure 1. Figure 1: The three phases of agentic kernel optimization iteration: generation, validation, and profiling. Across KernelBench Level 1/2/3 tasks with two reasoning LLMs on H200, SpecGen consistently improves the efficiency of agentic kernel optimization against CudaForge [40], Al￾phaEvolve [27], and KernelAgent [26]. Compared with these baselines, SpecGen reduces end-to-end execution time by 1.68–1.82× and increases… view at source ↗
Figure 2
Figure 2. Figure 2: Per-iteration generation time share across ten KernelBench kernel optimization tasks per model. sharing would distort both the speedup measured against the reference kernel and the profiled performance metrics. Due to this exclusivity, existing frameworks statically dedicate one GPU per kernel to run its validation and profiling [20, 40], a partitioning we refer to as “one GPU per kernel”. Harness Engineer… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of iteration status across 100 iterations per kernel with two models. An iteration is Success iff its emitted kernel compiles, runs, and matches the reference. 75th-percentile (P75) of this share falls between 92–99%, with most matmul variants clustered near 98%. For DeepSeek-V4- Pro, P75 spans 70%–98%, with HingeLoss (T1) as a low outlier at 70% and the remaining matmul workloads at P75 betwe… view at source ↗
Figure 4
Figure 4. Figure 4: In-flight request counts for the three pipeline phases (generation, validation, profiling) over the first 10,000 seconds of execution, with 10 agent workflows and 10 exclusive-mode validation/profiling GPUs. Idle 1 0 0 1 0 1 2 1 1 2 2 3 4 3 3 4 4 Reasoning Generation Validation Profiling Non-reasoning Generation Dispatch K Generation (K = 2) 5 6 5 5 6 6 Prefix Speculative Generation Termination Point Termi… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of existing techniques in agentic ker￾nel optimization and speculative generation. 0 25 50 75 100 0 25 50 75 100 CDF (%) GLM-5.1 0 25 50 75 100 DeepSeek-v4-Pro Reasoning prefix length (%) T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CDF of the shortest reasoning prefix length for generating a kernel with higher speedups than the average speedup of previously generated kernels across both models. 4.2 Insight: Speculative Generation Our key insight is that the reasoning process exposes a win￾dow for producing additional candidate kernels with reason￾ing prefix conditioning. Since non-reasoning generations are faster than reasoning gener… view at source ↗
Figure 8
Figure 8. Figure 8: A reasoning trace of GLM-5.1 on the 4D tensor Matmul task. Trigger signals are highlighted in red. It forks speculative generations with this speculative prompt. As each fork emits a candidate kernel, SpecController dis￾patches the kernel to ElasticScheduler for validation and pro￾filing. SpecController maintains a history of kernels and their speedups to evaluate the termination criterion. (5) Elastic sch… view at source ↗
Figure 9
Figure 9. Figure 9: Reducing GPU idle with reallocation. generated by GLM-5.1 and DeepSeek-V4-Pro, and implement the parser with regular expressions. Once SpecController detects these trigger signals, it con￾catenates the iteration’s original prompt and the reasoning trace prefix to construct a speculative prompt. Then SpecCon￾troller forks non-reasoning generations with this speculative prompt. To avoid speculative-generatio… view at source ↗
Figure 10
Figure 10. Figure 10: End-to-end execution time of 100 iterations per task on GLM-5.1 and DeepSeek-V4-Pro. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 0 20 40 60 80 100 Profiling feedback GLM-5.1 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 DeepSeek-v4-Pro CudaForge AlphaEvolve KernelAgent SpecGen [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Profiling feedback of 100 iterations per task on GLM-5.1 and DeepSeek-V4-Pro. 70.3, compared with 36.3 for CudaForge, 42.5 for AlphaE￾volve, and 42.4 for KernelAgent. This corresponds to ge￾omean lifts of 2.13×, 1.78×, and 1.77×, respectively. On DeepSeek￾V4-Pro, SpecGen reaches 66.4, compared with 40.7 for Cud￾aForge, 45.1 for AlphaEvolve, and 49.4 for KernelAgent. The corresponding geomean lifts are 1.8… view at source ↗
Figure 12
Figure 12. Figure 12: In-flight request counts per phase over the first 10,000 seconds of SpecGen on GLM-5.1. systems leave most validation/profiling capacity idle. On GLM-5.1, their utilization stays between 4.2% and 5.0%; on DeepSeek-V4-Pro, it stays between 11.3% and 17.6%. SpecGen without ElasticScheduler already raises utilization to 56.2% on GLM-5.1 and 74.7% on DeepSeek-V4-Pro via emitting additional speculative candida… view at source ↗
Figure 13
Figure 13. Figure 13: Detailed time-to-speedup over the KernelBench reference across 100 iterations per task on GLM-5.1 [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
read the original abstract

Agentic kernel optimization automates manual GPU kernel tuning via iterative generation, validation, and profiling with reasoning LLMs, casting the optimization task as feedback-guided search. However, our workload characterization reveals three system-level inefficiencies that limit search efficiency: (1) long generation latency due to LLM reasoning, (2) insufficient profiling feedback, and (3) underutilized validation/profiling resources. Our key insight is that the ongoing reasoning generation exposes a window for producing additional candidate kernels before it completes, allowing the system to terminate reasoning early once a satisfactory kernel appears. We present SpecGen, an agentic kernel optimization system with \emph{speculative generation}. First, SpecGen forks non-reasoning generations at well-chosen trigger points in the reasoning trace to yield kernels, increasing the candidate kernel count per iteration. These kernels are validated and profiled in parallel with the ongoing reasoning, increasing profiling feedback, and keeping resources busy during generation. When a kernel meets the termination criterion, SpecGen terminates the reasoning generation early to reduce the generation latency. Second, SpecGen dynamically reallocates validation and profiling GPU pools based on the arrival rate and prioritizes requests to reduce profiling feedback latency under bursty speculative generation load. Furthermore, SpecGen utilizes spare memory of the validation/profiling GPUs as remote KV cache storage to eliminate prefix recomputation of speculative generations under limited memory budget. Experiments with two reasoning LLMs on H200 show that SpecGen reduces end-to-end time over three baseline systems, while producing more profiling feedback, increasing resource utilization, and improving kernel speedup under a fixed time and token budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces SpecGen, a system for accelerating agentic kernel optimization with reasoning LLMs. It addresses three inefficiencies (long generation latency, insufficient profiling feedback, underutilized resources) by forking non-reasoning generations at trigger points in the reasoning trace, validating and profiling them in parallel, and terminating the main generation early when a kernel meets the termination criterion. Additional mechanisms include dynamic reallocation of validation/profiling GPU pools and using spare memory as remote KV cache to avoid prefix recomputation. Experiments with two reasoning LLMs on H200 GPUs claim reduced end-to-end time versus three baselines, more profiling feedback, higher resource utilization, and improved kernel speedups under fixed time/token budgets.

Significance. If the empirical results hold with reproducible details on success rates and baselines, the work could meaningfully advance automated GPU kernel tuning by adapting speculative execution ideas to LLM-driven search. The combination of parallel speculative forking, early termination, and resource management under bursty loads addresses practical bottlenecks in agentic optimization systems.

major comments (2)
  1. [Abstract] Abstract: the central claim of end-to-end time reduction over three baselines rests on the empirical frequency of successful early terminations from non-reasoning speculative forks. No data, tables, or discussion of this success rate, number of early stops, or how often forks meet the termination criterion are provided, leaving the net speedup (after forking/validation overhead) unverifiable.
  2. [Abstract] Abstract: the experimental results claim wins on H200 but supply no details on the three baseline systems, statistical significance, variance across runs, how termination criteria were selected, or the specific trigger points used. This prevents assessment of whether gains are attributable to the speculative mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments highlighting the need for greater empirical detail in the abstract. We respond to each major comment below and propose targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of end-to-end time reduction over three baselines rests on the empirical frequency of successful early terminations from non-reasoning speculative forks. No data, tables, or discussion of this success rate, number of early stops, or how often forks meet the termination criterion are provided, leaving the net speedup (after forking/validation overhead) unverifiable.

    Authors: We agree that the abstract would be strengthened by including quantitative support for the early-termination mechanism. The manuscript body reports these metrics in the evaluation section. We will revise the abstract to include a concise statement summarizing the observed early-termination frequency and its contribution to net speedup after overhead. revision: yes

  2. Referee: [Abstract] Abstract: the experimental results claim wins on H200 but supply no details on the three baseline systems, statistical significance, variance across runs, how termination criteria were selected, or the specific trigger points used. This prevents assessment of whether gains are attributable to the speculative mechanism.

    Authors: We acknowledge that the abstract omits these experimental details. The three baselines, statistical tests, variance, termination criteria, and trigger points are defined in Sections 3 and 4 with results in Section 5. We will revise the abstract to name the baselines and note that results are statistically validated, with full details remaining in the body. revision: yes

Circularity Check

0 steps flagged

No circularity: system description and empirical claims are independent of inputs

full rationale

The paper presents an engineering system (SpecGen) for speculative generation in agentic kernel optimization. The abstract and description outline an architecture with forking at trigger points, parallel validation, dynamic reallocation, and KV cache reuse, followed by experimental claims of reduced end-to-end time on H200 hardware. No equations, fitted parameters, predictions, or derivations are present. No self-citations are invoked as load-bearing premises. The central claims rest on measured performance deltas against baselines rather than any reduction to self-defined quantities or prior author results. This matches the default case of a self-contained systems paper with no detectable circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The approach rests on the unstated premise that the LLM reasoning trace contains predictable points where non-reasoning forks are likely to succeed, and that validation/profiling resources can be dynamically reallocated without introducing new bottlenecks.

free parameters (2)
  • trigger points in reasoning trace
    Chosen points where speculative generations are forked; not specified how they are selected.
  • termination criterion
    Threshold for deciding when to stop reasoning early; not detailed.

pith-pipeline@v0.9.1-grok · 5838 in / 1153 out tokens · 21012 ms · 2026-06-26T23:33:11.433333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 6 linked inside Pith

  1. [1]

    Hydra: Sequentially-dependent draft heads for medusa decoding.CoRR, abs/2402.05109, 2024

    Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christo- pher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding.CoRR, abs/2402.05109, 2024

  2. [2]

    Kevin: Multi-turn RL for generating CUDA kernels.CoRR, abs/2507.11948, 2025

    Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Al- berti. Kevin: Multi-turn RL for generating CUDA kernels.CoRR, abs/2507.11948, 2025

  3. [3]

    Dynamic depth decoding: Faster speculative decoding for llms

    Oscar Brown, Zhengjie Wang, Andrea Do, Nikhil Mathew, and Cheng Yu. Dynamic depth decoding: Faster speculative decoding for llms. CoRR, abs/2409.00142, 2024

  4. [4]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceler- ation framework with multiple decoding heads. In Ruslan Salakhut- dinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first Interna- tional Conference on ...

  5. [5]

    Gonzalez, and Ion Stoica

    Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, and Ion Stoica. K-search: LLM kernel generation via co-evolving intrinsic world model.CoRR, abs/2602.19128, 2026

  6. [6]

    Accelerating large language model decoding with speculative sampling.CoRR, abs/2302.01318, 2023

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.CoRR, abs/2302.01318, 2023

  7. [7]

    CUDA agent: Large-scale agentic RL for high- performance CUDA kernel generation.CoRR, abs/2602.24286, 2026

    Weinan Dai, Hanlin Wu, Qiying Yu, Huan-Ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, and Hao Zhou. CUDA agent: Large-scale agentic RL for high- performance CUDA kernel generation.CoRR, abs/2602.24286, 2026

  8. [8]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. 2024

  9. [9]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, Ne...

  10. [10]

    DeepSeek-V4: Towards highly efficient million-token context intelligence.https://huggingface.co/deepseek-ai/DeepSeek- V4-Pro, 2026

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence.https://huggingface.co/deepseek-ai/DeepSeek- V4-Pro, 2026. Technical Report available athttps://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

  11. [11]

    STARK: strategic team of agents for refining kernels.CoRR, abs/2510.16996, 2025

    Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, and Shuang Yang. STARK: strategic team of agents for refining kernels.CoRR, abs/2510.16996, 2025

  12. [12]

    Ker- nelblaster: Continual cross-task CUDA optimization via memory- augmented in-context reinforcement learning.CoRR, abs/2602.14293, 2026

    Shengjun Kris Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Ed- ward Lin, Siva Kumar Sastry Hari, and Christos Kozyrakis. Ker- nelblaster: Continual cross-task CUDA optimization via memory- augmented in-context reinforcement learning.CoRR, abs/2602.14293, 2026

  13. [13]

    Kernel- smith: A unified recipe for evolutionary kernel optimization.CoRR, abs/2603.28342, 2026

    He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, Yining Li, Jiaxing Xie, Huanan Dong, Yaguang Wu, Xiangjun Huang, Jian Yang, Hui Wang, Bowen Zhou, Bowen Li, Qipeng Guo, and Kai Chen. Kernel- smith: A unified recipe for evolutionary kernel optimization.CoRR, abs/2603.28342, 2026

  14. [14]

    GLM-5: from vibe coding to agentic engineering.CoRR, abs/2602.15763, 2026

    GLM. GLM-5: from vibe coding to agentic engineering.CoRR, abs/2602.15763, 2026

  15. [15]

    Drtriton: Large-scale syn- thetic data reinforcement learning for triton kernel generation.CoRR, abs/2603.21465, 2026

    Siqi Guo, Ming Lin, and Tianbao Yang. Drtriton: Large-scale syn- thetic data reinforcement learning for triton kernel generation.CoRR, abs/2603.21465, 2026

  16. [16]

    Improving efficiency of GPU kernel opti- mization agents using a domain-specific language and speed-of-light guidance.CoRR, abs/2603.29010, 2026

    Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, and Christos Kozyrakis. Improving efficiency of GPU kernel opti- mization agents using a domain-specific language and speed-of-light guidance.CoRR, abs/2603.29010, 2026

  17. [17]

    Autokernel: Autonomous GPU kernel optimization via iterative agent-driven search.CoRR, abs/2603.21331, 13 2026

    Jaber Jaber and Osama Jaber. Autokernel: Autonomous GPU kernel optimization via iterative agent-driven search.CoRR, abs/2603.21331, 13 2026

  18. [18]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, An- toine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating Systems...

  19. [19]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learn- ing, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Res...

  20. [20]

    Tri- tonforge: Profiling-guided framework for automated triton kernel optimization.CoRR, abs/2512.09196, 2025

    Haonan Li, Keyu Man, Partha Kanuparthy, Hanning Chen, Wei Sun, Sreen Tallam, Chenguang Zhu, Kevin Zhu, and Zhiyun Qian. Tri- tonforge: Profiling-guided framework for automated triton kernel optimization.CoRR, abs/2512.09196, 2025

  21. [21]

    Tritonbench: Benchmarking large language model capabilities for generating triton operators

    Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. Tritonbench: Benchmarking large language model capabilities for generating triton operators. In Wanxi- ang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of ...

  22. [22]

    CUDA- L1: improving CUDA optimization via contrastive reinforcement learn- ing.CoRR, abs/2507.14111, 2025

    Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. CUDA- L1: improving CUDA optimization via contrastive reinforcement learn- ing.CoRR, abs/2507.14111, 2025

  23. [23]

    EAGLE: speculative sampling requires rethinking feature uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: speculative sampling requires rethinking feature uncertainty. pages 28935–28948, 2024

  24. [24]

    Kangaroo: Lossless self-speculative decod- ing for accelerating llms via double early exiting

    Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, and Yunhe Wang. Kangaroo: Lossless self-speculative decod- ing for accelerating llms via double early exiting. 2024

  25. [25]

    Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, and Junxian He. Dr. kernel: Reinforcement learning done right for triton kernel generations.CoRR, abs/2602.05885, 2026

  26. [26]

    KernelAgent: Autonomous GPU ker- nel generation & optimization via deep agents, 2025

    Meta PyTorch Team. KernelAgent: Autonomous GPU ker- nel generation & optimization via deep agents, 2025. Blog post:https://pytorch.org/blog/kernelfalcon-autonomous-gpu-kernel- generation-via-deep-agents/

  27. [27]

    Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Ko- zlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abi- gail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and al...

  28. [28]

    Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient GPU kernels? In Aarti Singh, Maryam Fazel, Daniel Hsu, Si- mon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Ma- chine Learning, ICML...

  29. [29]

    Mooncake: Trad- ing more storage for less computation - A kvcache-centric architecture for serving LLM chatbot

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trad- ing more storage for less computation - A kvcache-centric architecture for serving LLM chatbot. In Haryadi S. Gunawi and Vasily Tarasov, editors,23rd USENIX Conference on File and Storage Technologies, FAST 2025, Santa Clara, CA...

  30. [30]

    Cutegen: An llm- based agentic framework for generation and optimization of high- performance GPU kernels using cute.CoRR, abs/2604.01489, 2026

    Tara Saba, Anne Ouyang, Xujie Si, and Fan Long. Cutegen: An llm- based agentic framework for generation and optimization of high- performance GPU kernels using cute.CoRR, abs/2604.01489, 2026

  31. [31]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Sys- tems 38: Annual Con...

  32. [32]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025

  33. [33]

    Kernelskill: A multi-agent framework for GPU kernel optimization.CoRR, abs/2603.10085, 2026

    Qitong Sun, Jun Han, Tianlin Li, Zhe Tang, Sheng Chen, Fei Yang, Aishan Liu, Xianglong Liu, and Yang Liu. Kernelskill: A multi-agent framework for GPU kernel optimization.CoRR, abs/2603.10085, 2026

  34. [34]

    Kernelevolve: Scaling agen- tic kernel coding for heterogeneous AI accelerators at meta.CoRR, abs/2512.23236, 2025

    KernelEvolve Team and Meta Platforms. Kernelevolve: Scaling agen- tic kernel coding for heterogeneous AI accelerators at meta.CoRR, abs/2512.23236, 2025

  35. [35]

    Abdelfattah

    Ali Tehrani, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, Lukasz Dudziak, and Mohamed S. Abdelfattah. Fine-tuning GPT-5 for GPU kernel generation.CoRR, abs/2602.11000, 2026

  36. [36]

    Awad, Ryan Swann, Kesavan Ramakr- ishnan, Jeffrey Jian Ma, Keith Lowery, Ganesh Dasika, and Vijay Janapa Reddi

    Arya Tschand, Muhammad A. Awad, Ryan Swann, Kesavan Ramakr- ishnan, Jeffrey Jian Ma, Keith Lowery, Ganesh Dasika, and Vijay Janapa Reddi. Swizzleperf: Hardware-aware llms for GPU kernel performance optimization.CoRR, abs/2508.20258, 2025

  37. [37]

    Astra: A multi-agent system for GPU kernel performance optimization.CoRR, abs/2509.07506, 2025

    Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A multi-agent system for GPU kernel performance optimization.CoRR, abs/2509.07506, 2025

  38. [38]

    Tritonrl: Training llms to think and code triton without cheating

    Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, and Youngsuk Park. Tritonrl: Training llms to think and code triton without cheating. CoRR, abs/2510.17891, 2025

  39. [39]

    SWIFT: on-the-fly self-speculative decoding for LLM inference acceleration

    Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. SWIFT: on-the-fly self-speculative decoding for LLM inference acceleration. 2025

  40. [40]

    Cudaforge: An agent framework with hardware feed- back for CUDA kernel optimization.CoRR, abs/2511.01884, 2025

    Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Cudaforge: An agent framework with hardware feed- back for CUDA kernel optimization.CoRR, abs/2511.01884, 2025

  41. [41]

    Cross-attention speculative decoding.CoRR, abs/2505.24544, 2025

    Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Nikhil Verma, Yipeng Ji, and Chul Lee. Cross-attention speculative decoding.CoRR, abs/2505.24544, 2025

  42. [42]

    Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high-performance GPU kernel generation

    Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, and Ling Li. Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high-performance GPU kernel generation. pages 29168–29176, 2026. 14