pith. machine review for the scientific record. sign in

arxiv: 2604.16625 · v1 · submitted 2026-04-17 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords adaptationoptimizationadaexploreagentgenerationkernellocalsearch
0
0 comments X

The pith

AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yielding 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3 within 100 steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often fail when asked to write optimized code for specialized languages like Triton because those languages are rare in training data and have strict rules plus tricky performance trade-offs. AdaExplore lets the model learn from its own mistakes by turning repeated errors into a growing set of rules that keep new attempts valid. It also searches more effectively by keeping a tree of candidate programs, sometimes making small edits and sometimes trying bigger structural changes to escape bad local solutions. On benchmark tasks the method produced kernels that ran several times faster than before and kept getting better with more search time.

Core claim

AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3, respectively, within 100 steps, and continues to improve with additional computation.

Load-bearing premise

That the synthesized tasks and the memory of validity rules extracted from failures will generalize reliably to unseen kernel problems rather than overfitting to the specific benchmark instances used during adaptation.

Figures

Figures reproduced from arXiv: 2604.16625 by Andre Wang He, Ivan Fox, Jingming Zhuo, Manupa Karunaratne, Sean Welleck, Tianqi Chen, Tim Dettmers, Weihua Du, Weiwei Sun, Yiming Yang, Yixin Dong, Zeyu Zheng.

Figure 1
Figure 1. Figure 1: Illustration of Kernel Optimization Bottlenecks. (a) Most generated kernels are invalid due to limited training coverage; (b) Kernel refinement may be stuck in local optima; (c) Our agent, AdaExplore, can learn skills from failures to prevent pitfalls, and apply diversity-preserving search for global optima exploration. a high proportion of invalid programs. Second, the performance landscape is highly non-… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AdaExplore for Kernel Runtime Optimization. The method has two stages: Adapt: it turns failures on synthesized tasks into a cross-task memory that helps generate correct kernels. Explore: it organizes candidate kernels as a tree and alternates between local refinement and regeneration to search for higher-performing solutions. 2 Related Work LLMs for Code Generation. General-purpose code models… view at source ↗
Figure 3
Figure 3. Figure 3: Test-time Scaling and Case Study on Actions. Left: Average best-so-far speedup as the test-time budget increases, showing that AdaExplore continues to efficiently improve with more search steps. Right: Case study illustrating the roles of large and small steps. memory on another benchmark, TritonBench, and also find a 28% accuracy improvement (see Appendix C.4 for details). Cross-task Skill Memory Statisti… view at source ↗
Figure 4
Figure 4. Figure 4: Best AdaExplore-generated fused add RMSNorm kernel (1.75 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training Program Synthesis Prompt. C.2 Examples of Training Tasks [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Shared Search Context Prompt. Large-Step Reconstruction Prompt System: Your goal is to generate a kernel that outperforms all kernels in the pool. You are allowed to reuse, adapt, or directly build upon any kernel in the pool. You may combine ideas, modify implementations, or start from the best-performing kernel and improve it further. Objective: * Minimize runtime as much as possible * Beat the fastest k… view at source ↗
Figure 7
Figure 7. Figure 7: Large-step Reconstruction Prompt. ure 7. In practice, the agent will regenerate a structurally different kernel from those in the representative pool. Small Step. The small step performs local tuning on the current kernel. Instead of re￾generating a new structure, it first identifies concrete modifications or improvement plans 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Small-step Tuning Prompt. by providing guidance, and then applies one or more code patches to improve correct￾ness or runtime performance. Each patch is specified as an old str/new str pair: old str must exactly match a unique code region in the current kernel, and new str provides the corresponding replacement block. The prompt used for this step is shown in [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
read the original abstract

Recent large language model (LLM) agents have shown promise in using execution feedback for test-time adaptation. However, robust self-improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain-specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non-linear optimization landscape further make naive generation and local refinement unreliable. We propose AdaExplore, an agent framework that enables self-improvement via accumulated execution feedback for performance-critical kernel code generation through two complementary stages: failure-driven adaptation and diversity-preserving search, jointly improving correctness and optimization performance without additional fine-tuning or external knowledge. In the adaptation stage, the agent synthesizes tasks and converts recurring failures into a reusable memory of validity rules, helping subsequent generations remain within the feasible set. In the search stage, the agent organizes candidate kernels as a tree and alternates between small local refinements and larger structural regeneration, allowing it to explore the optimization landscape beyond local optima. Experiments on kernel runtime optimization benchmarks validate these gains: AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3, respectively, within 100 steps, and continues to improve with additional computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes AdaExplore, an LLM agent framework for generating efficient Triton kernel code via two stages: failure-driven adaptation (synthesizing tasks from execution failures to build a reusable memory of validity rules) and diversity-preserving search (organizing candidates in a tree with alternation between local refinements and structural regeneration). It claims this enables self-improvement without fine-tuning, reporting 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3 within a 100-step budget, with further gains from additional computation.

Significance. If the empirical claims are supported by rigorous controls, the work would demonstrate a practical mechanism for accumulating reusable knowledge from execution feedback in underrepresented domains like Triton, advancing test-time adaptation for code generation beyond per-instance independent solving.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The reported speedups of 3.12x and 1.72x lack any description of baselines (e.g., naive LLM generation, prior methods), number of independent trials, variance or standard deviation across runs, statistical tests, or how the 100-step budget was allocated across adaptation and search phases; without these the central performance claim cannot be evaluated.
  2. [§3.1] §3.1 (Failure-Driven Adaptation): The process of synthesizing tasks from recurring failures and converting them into a 'memory of validity rules' is described only at a high level with no concrete examples, update mechanism, or pseudocode; this makes it impossible to assess whether the rules are general or risk overfitting to the specific KernelBench instances, directly impacting the weakest assumption about generalization.
  3. [§3.2] §3.2 (Diversity-Preserving Search): The tree-based organization and alternation between local and structural changes is presented without any quantitative metric for diversity preservation, analysis of exploration vs. exploitation balance, or ablation showing the contribution of each component to the final speedups.
minor comments (1)
  1. [Abstract and §3] The abstract and method sections use terms like 'non-linear optimization landscape' and 'feasible set' without defining them in the context of Triton kernel constraints.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework is presented conceptually without mathematical formalization or listed assumptions.

pith-pipeline@v0.9.0 · 5575 in / 1068 out tokens · 33939 ms · 2026-05-10T08:25:46.868409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 25 canonical work pages · 10 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei A. Zaharia, and O. Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning. ArXiv, abs/2507.19457,

  2. [2]

    arXiv preprint arXiv:2510.14150 , year =

    URL https: //www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf. 10 Preprint. Under review. Henrique Assumpc ¸˜ao, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: an open source evolutionary coding agent for algorithm discovery and optimization. arXiv preprint arXiv:2510.14150,

  3. [3]

    Kevin: Multi-turn rl for generating cuda kernels, 2025.URL https://arxiv

    URLhttps://arxiv.org/abs/2507.11948. Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729,

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code.arXiv preprint arXiv:2107.03374,

  5. [5]

    Exploring and controlling diversity in llm-agent conversation.arXiv preprint arXiv:2412.21102,

    KuanChao Chu, Yi-Pei Chen, and Hideki Nakayama. Exploring and controlling diversity in llm-agent conversation.arXiv preprint arXiv:2412.21102,

  6. [6]

    Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286, 2026

    URLhttps://arxiv.org/abs/2602.24286. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

  7. [7]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,

  8. [8]

    36 Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, et al

    Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, et al. Tritonbench: Benchmarking large language model capabilities for generating triton operators.arXiv preprint arXiv:2502.14752, 2025a. Kefan Li, Hongyue Yu, Tingyu Guo, Shijie Cao, and Yuan Yuan. Cocoevo: Co-evolution of programs a...

  9. [9]

    Cuda-l1: Improving cuda optimization via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025

    URL https://arxiv.org/abs/ 2507.14111. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097,

  10. [10]

    Scattered forest search: Smarter code space exploration with LLMs

    Jonathan Light, Yue Wu, Yiyou Sun, Wenchao Yu, Xujiang Zhao, Ziniu Hu, Haifeng Chen, Wei Cheng, et al. Scattered forest search: Smarter code space exploration with llms.arXiv preprint arXiv:2411.05010,

  11. [11]

    Under review

    11 Preprint. Under review. Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, and Junxian He. Dr. kernel: Reinforcement learning done right for triton kernel generations.arXiv preprint arXiv:2602.05885,

  12. [12]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474,

  13. [13]

    Alexander Novikov, Ng ˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and al...

  14. [14]

    Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini

    Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher R´e, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517,

  15. [15]

    Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization.arXiv preprint arXiv:2402.16694, 2024

    Qiwei Peng, Yekun Chai, and Xuhong Li. Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization.arXiv preprint arXiv:2402.16694,

  16. [16]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

  17. [17]

    Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

    ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524,

  18. [18]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    URLhttps://arxiv.org/abs/2303.11366. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  19. [19]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  20. [20]

    Kernelskill: A multi-agent framework for gpu kernel optimization

    Qitong Sun, Jun Han, Tianlin Li, Zhe Tang, Sheng Chen, Fei Yang, Aishan Liu, Xianglong Liu, and Yang Liu. Kernelskill: A multi-agent framework for gpu kernel optimization. arXiv preprint arXiv:2603.10085,

  21. [21]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

  22. [22]

    Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

    URLhttps://arxiv.org/abs/2509.07506. Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, et al. Flashinfer-bench: Building the virtuous cycle for ai-driven llm systems.arXiv preprint arXiv:2601.00227,

  23. [23]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    URLhttps://arxiv.org/abs/2304.12244. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic ”differentiation” via text.ArXiv, abs/2406.07496,

  24. [24]

    Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

    Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605,

  25. [25]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsid- har Reddy Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models.ArXiv, abs/2510.04618, 2025b. Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin,...

  26. [26]

    vs Expert

    URL https://arxiv.org/abs/2308. 10144. A Supplementary Material We open-source the implementation of AdaExplore in https://github.com/StigLidu/ AdaExplore. 13 Preprint. Under review. B Evaluation on FlashInfer-Bench B.1 Setup KernelBench evaluates kernel runtime optimization on general-purpose PyTorch opera- tor rewrites. To further assess performance on ...

  27. [27]

    This demonstrates that LLM agents can discover hardware-efficient implementations that exceed expert-tuned CUDA for memory-bound workloads

    loads the entire hidden dimension ( H=2048) in a single tile ( BLOCK=2048), performing the fused add, variance reduction, and scaling entirely in registers. This demonstrates that LLM agents can discover hardware-efficient implementations that exceed expert-tuned CUDA for memory-bound workloads. GEMM: Fundamentally Hard Due to Heavy Human Optimization.The...

  28. [28]

    sqrt ( sumsq / tl

    inv_rms = 1.0 / tl . sqrt ( sumsq / tl . cast (N , tl . float32 ) + tl . cast ( EPS , tl . float32 ) ) w_val = tl . load ( w_ptr + offs * stride_w , mask = mask , other =0.0) y = x_f * tl . cast ( w_val , tl . float32 ) * inv_rms tl . store ( out_ptr + pid * stride_outm + offs * stride_outn , tl . cast (y , tl . bfloat16 ) , mask = mask ) Figure 4: Best A...