daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

Dayuan Fu; Dian Yang; Jiarui Hu; Jinlong Hou; Liming Liu; Mohan Jiang; Pengfei Liu; Tongyu Wang

arxiv: 2606.16497 · v2 · pith:J7TWO3BPnew · submitted 2026-06-15 · 💻 cs.LG · cs.AI· cs.CL

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

Dayuan Fu , Mohan Jiang , Tongyu Wang , Dian Yang , Jiarui Hu , Liming Liu , Jinlong Hou , Pengfei Liu This is my paper

Pith reviewed 2026-06-27 03:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords GPU kernel optimizationreinforcement learningmulti-agent RLskill discoveryCUDATritonKernelBenchexecution verification

0 comments

The pith

A single LLM backbone jointly trains three agents to select, generate, and summarize GPU kernel skills, building a verified library that beats prior RL models on KernelBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents daVinci-kernel as an RL framework that couples skill discovery and exploitation by dynamically evolving a skill library. Three agents share one LLM backbone: one selects techniques via BM25 and reranking, one generates CUDA or Triton kernels conditioned on those skills, and one distills successful verified rollouts into reusable entries. Skills enter the library only after execution confirms reproducible speedups. The system starts from structured SFT on diversity-filtered data then optimizes all agents end-to-end with multi-turn REINFORCE and per-agent advantages. A sympathetic reader would care because manual kernel tuning is a major bottleneck in high-performance computing, and an automated co-evolution process could reduce that cost if it scales.

Core claim

daVinci-kernel jointly trains a Skill Selection Agent, a Policy Agent, and a Skill Summary Agent that share a single LLM backbone; the selection agent retrieves skills, the policy agent produces multi-turn kernels, and the summary agent adds only those skills whose speedups survive execution-based verification. After an SFT cold start the three agents are optimized together via multi-turn REINFORCE with per-agent advantage estimation. On KernelBench the resulting 14B model records 37.2 percent, 70.6 percent, and 32.2 percent success under the Fast_1 threshold on Levels 1, 2, and 3, exceeding the strongest prior RL baseline Dr. Kernel-14B.

What carries the argument

The three-agent system with shared LLM backbone and execution-verified dynamic skill library, trained end-to-end by multi-turn REINFORCE.

If this is right

Only kernels whose speedups survive repeated execution enter the skill library, limiting noise in the evolving set of techniques.
A shared LLM backbone across selection, policy, and summary agents enables direct information flow during co-evolution.
The SFT cold-start on diversity-filtered data supplies a stable initialization that supports subsequent joint REINFORCE optimization.
Performance gains appear across three difficulty levels of KernelBench under the Fast_1 threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-agent pattern could be tested on CPU or accelerator code generation tasks outside GPU kernels.
The skill library might be inspected to measure whether the discovered techniques generalize to new GPU architectures without retraining.
Replacing BM25-plus-reranking retrieval with learned retrieval could be compared directly inside the same framework.

Load-bearing premise

Execution-based verification reliably identifies reproducible speedups and the joint REINFORCE training keeps the three agents from collapsing or overfitting to the verification signal.

What would settle it

Retraining the 14B model from the same SFT checkpoint on the same KernelBench split and measuring whether Level-1 Fast_1 success falls below 30 percent.

Figures

Figures reproduced from arXiv: 2606.16497 by Dayuan Fu, Dian Yang, Jiarui Hu, Jinlong Hou, Liming Liu, Mohan Jiang, Pengfei Liu, Tongyu Wang.

**Figure 1.** Figure 1: Skill-policy co-evolution in daVinci-kernel. The selection, policy, and summary agents form a closed RL loop, where task-relevant skills guide optimization and successful rollouts are distilled back into reusable skills. As the policy improves, the useful skill frontier shifts from simple skills to more complex and task-specific skills, enabling increasingly difficult kernel optimization. 1 † Corresponding… view at source ↗

**Figure 2.** Figure 2: The structure of daVinci-kernel. 3.2 Skill Library The skill library L is a collection of GPU optimization techniques. Each skill records with five fields: • name: a short snake case identifier, used as a unique key within a snapshot and as the retrieval target returned by the Selection Agent’s tool call. • description: a one-sentence summary of what the technique does and when it applies, shown to the Sel… view at source ↗

**Figure 3.** Figure 3: Validation Fast1.2 training curves for daVinci-kernel and selected ablations on the 14B model. without skill conditioning, the model must find improvements by itself and can still do so frequently. However, skills are necessary for producing reliable, deep speedups: the skill library acts as an accumulated recipe book that enables the policy to consistently exploit dominant bottlenecks rather than only occ… view at source ↗

**Figure 4.** Figure 4: Task A: three-way comparison. daVinci-kernel with skill (left) recognises that min value=0.0 collapses the entire tail to zero and skips both Conv3D and GroupNorm, achieving 2.0× at Turn 1. Dr. Kernel (middle) discovers the constant-fill trick only at Turn 2 but still runs the heavy vendor kernels, capping speedup at 1.07×. daVinci-kernel without skill (right) attempts a custom Triton GroupNorm, causing fr… view at source ↗

**Figure 5.** Figure 5: Skill selection evolution on Task A a specific implementation idiom (flat-1D kernels for contiguous tails) substantially increase the fraction of samples producing valid, high-speedup kernels. The shift in skill selection from step 260 to step 460 demonstrates that the joint RL objective drives the Selection Agent to surface increasingly specific, high-value techniques as the policy’s capability frontier a… view at source ↗

**Figure 6.** Figure 6: Task B: two-way comparison. daVinci-kernel with skill (left) uses the flat-1D masked kernel pattern prescribed by fuse only contiguous pointwise tails, achieving 1.24× in all 8 samples. Dr. Kernel (right) defaults to explicit 5-D stride indexing, causing shape mismatches in 3 of 8 samples. C System prompt Summary Agent system prompt (GPT series) You are an expert CUDA/Triton kernel optimization engineer. Y… view at source ↗

read the original abstract

GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, outperforming the strongest prior RL-trained model, Dr\. Kernel-14B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

daVinci-kernel describes a three-agent RL setup for co-evolving verified skills in GPU kernel generation and reports gains over Dr. Kernel, but the abstract supplies no experimental protocol or stability checks.

read the letter

The main takeaway is that this paper puts forward a three-agent RL framework where skill selection, policy generation, and skill summarization co-evolve on a shared LLM backbone, with new skills added only after execution confirms speedups, and it claims better results than prior RL work on KernelBench levels 1-3.

What stands out as new is the joint multi-turn REINFORCE training of the three agents after a structured SFT cold start, plus the dynamic library that grows only on verified outcomes. The high-level architecture ties discovery and exploitation together in one loop, which differs from the single-agent RL baselines it cites.

The paper does a clear job outlining the agent roles, the BM25-plus-reranker selection step, and the per-agent advantage estimation. That description is concrete enough to understand the intended mechanism.

The soft spots are the lack of any experimental protocol, baseline implementation details, error bars, seed counts, or ablations. The abstract states the 37.2/70.6/32.2 numbers and the training recipe but does not show how the shared-backbone credit assignment avoids one agent dominating or the policy overfitting to verification noise. The stress-test point about fragile advantage estimation in this multi-agent setup therefore lands directly on what is visible.

This work is for researchers already working on RL for systems code or automated kernel tuning. A reader in that niche would get a usable sketch of the co-evolution idea and could decide whether to dig into the full experiments.

It deserves a serious referee to check whether the full manuscript supplies the missing controls and whether the claimed stability holds under scrutiny.

Referee Report

3 major / 1 minor

Summary. The paper presents daVinci-kernel, an RL framework for GPU kernel optimization that jointly trains three agents (Skill Selection via BM25/LLM reranking, Policy for multi-turn CUDA/Triton generation, and Skill Summary for distilling successful rollouts) sharing one LLM backbone. Agents are initialized with structured SFT and optimized end-to-end via multi-turn REINFORCE with per-agent advantage estimation; skills enter the library only after execution verification of reproducible speedups. On KernelBench the 14B model reports 37.2/70.6/32.2 % success on Levels 1/2/3 under the Fast_1 threshold, outperforming the prior RL baseline Dr. Kernel-14B.

Significance. If the reported speedups are reproducible under controlled conditions and the joint training is shown to be stable, the co-evolution of skill discovery and exploitation via a shared backbone could meaningfully advance automated kernel optimization. The execution-verification gate and per-agent advantage design are potentially valuable if supported by ablations.

major comments (3)

[Abstract] Abstract: the headline performance numbers (37.2 %, 70.6 %, 32.2 % on KernelBench Levels 1–3) are stated without any experimental protocol, baseline implementation details, number of seeds, error bars, or statistical tests, leaving the central claim of outperformance over Dr. Kernel-14B without visible supporting evidence.
[Abstract] Abstract: the claim that the three agents co-evolve stably under shared-backbone multi-turn REINFORCE with per-agent advantage estimation is load-bearing for the method, yet no derivation, ablation, or analysis of credit assignment, collapse risk, or overfitting to verification signals is supplied.
[Abstract] Abstract: the assertion that execution-based verification produces a reliable, non-overfit skill library lacks any description of statistical controls (multiple hardware runs, variability thresholds, or verification repetition), which directly affects the reproducibility of the reported speedups.

minor comments (1)

[Abstract] Abstract: the terms 'Fast_1 threshold' and 'Dr. Kernel-14B' are used without definition or citation, reducing clarity for readers unfamiliar with KernelBench or the referenced baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, with proposed revisions to improve transparency in the abstract and supporting details in the main text.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance numbers (37.2 %, 70.6 %, 32.2 % on KernelBench Levels 1–3) are stated without any experimental protocol, baseline implementation details, number of seeds, error bars, or statistical tests, leaving the central claim of outperformance over Dr. Kernel-14B without visible supporting evidence.

Authors: We agree that the abstract would benefit from additional context on the experimental protocol to support the reported numbers. The full manuscript details the KernelBench evaluation, Fast_1 threshold, and Dr. Kernel-14B comparison in Section 4. We will revise the abstract to concisely reference the evaluation protocol, note that results are averaged over multiple seeds, and indicate the outperformance margin. revision: yes
Referee: [Abstract] Abstract: the claim that the three agents co-evolve stably under shared-backbone multi-turn REINFORCE with per-agent advantage estimation is load-bearing for the method, yet no derivation, ablation, or analysis of credit assignment, collapse risk, or overfitting to verification signals is supplied.

Authors: Section 3 derives the multi-turn REINFORCE objective with per-agent advantage estimation and describes the shared-backbone joint optimization. We acknowledge that explicit ablations on credit assignment, collapse risk, and overfitting are not present. We will add a qualitative discussion of training stability and potential risks in a new paragraph in the Experiments section. revision: partial
Referee: [Abstract] Abstract: the assertion that execution-based verification produces a reliable, non-overfit skill library lacks any description of statistical controls (multiple hardware runs, variability thresholds, or verification repetition), which directly affects the reproducibility of the reported speedups.

Authors: Section 3.3 describes that skills enter the library only after execution verification of reproducible speedups. We agree that statistical controls merit more detail. We will revise the abstract to reference the verification gate and expand the methods description with repetition protocol and variability thresholds used. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation chain

full rationale

The paper describes an empirical RL framework with three jointly trained agents using multi-turn REINFORCE and execution-based verification on KernelBench. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or method outline. Performance claims rest on external benchmarks and comparisons to prior models rather than any self-contained mathematical reduction. The central claims are statistically falsifiable via execution and do not reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5763 in / 1124 out tokens · 50647 ms · 2026-06-27T03:10:09.475927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 10 linked inside Pith

[1]

Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, et al. 2026. davinci-env: Open swe environment synthesis at scale.arXiv preprint arXiv:2603.13023

arXiv 2026
[2]

Siqi Guo, Ming Lin, and Tianbao Yang. 2026. Drtriton: Large-scale synthetic data reinforcement learning for triton kernel generation.arXiv preprint arXiv:2603.21465

Pith/arXiv arXiv 2026
[3]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
[4]

Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770

Pith/arXiv arXiv
[5]

Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie Wang- Haojie, Jianrong Wang, Xu Han, et al. 2025a. Tritonbench: Benchmarking large language model capabilities for generating triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, pages 23053–23066

2025
[6]

Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, et al. 2025b. Autotriton: Automatic triton programming with reinforcement learning in llms.arXiv preprint arXiv:2507.05687

arXiv
[7]

Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. 2026. Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060

arXiv 2026
[8]

Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, et al. 2025. Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470

2025
[9]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556

Pith/arXiv arXiv 2025
[10]

Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, and Junxian He. 2026. Dr. kernel: Reinforcement learning done right for triton kernel generations.arXiv preprint arXiv:2602.05885

arXiv 2026
[11]

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. 2026. Skill0: In-context agentic reinforcement learning for skill internalization. arXiv preprint arXiv:2604.02268

Pith/arXiv arXiv 2026
[12]

Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher R´e, and Azalia Mirhoseini
[13]

Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517

Pith/arXiv arXiv
[14]

2009.The probabilistic relevance framework: BM25 and beyond, volume 4

Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc

2009
[15]

Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, and Dongbin Zhao
[16]

Dynamic dual-granularity skill bank for agentic rl.arXiv preprint arXiv:2603.28716

Pith/arXiv arXiv
[17]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291

Pith/arXiv arXiv 2023
[18]

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. 2025. Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506

arXiv 2025
[19]

Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, and Youngsuk Park. 2025. Tritonrl: Training llms to think and code triton without cheating.arXiv preprint arXiv:2510.17891

arXiv 2025
[20]

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. 2026. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning. arXiv preprint arXiv:2602.08234

Pith/arXiv arXiv 2026
[21]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025
[22]

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471

Pith/arXiv arXiv 2025
[23]

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642. 11 A. Training costs SII-GAIR Task A — Reference Architecture class Model(nn.Module): def forward(self, x): x = self.conv(x) # Co...

2024
[24]

Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, et al. 2026. Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high- performance gpu kernel generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29168–29176. A Training costs Interacting...

2026
[25]

Run conv/matmul using PyTorch/library kernels
[26]

Ensure the output is contiguous
[27]

Flatten to numel() and launch masked 1D kernel
[28]

Pitfalls: .contiguous() erases gains if tail tiny

Fuse as many elementwise ops as possible. Pitfalls: .contiguous() erases gains if tail tiny. [2]name: fuse_only_contiguous_pointwise_tails description: Fuse memory-bound pointwise epilogues around vendor kernels; don’t replace heavy ops. tags: [fusion, pointwise, memory_bound, triton] --- \#\# Motivation Custom Triton underperforms when it replaces librar...
[29]

Ensure tail input is contiguous
[30]

Flatten to numel() and launch masked 1D grid
[31]

Fuse as many pointwise ops as possible
[32]

call .contiguous(), flat- ten to numel(), and launch a masked 1-D grid,

Avoid extra temporaries unless profiling justifies. Pitfalls: use masks for non-power-of-two sizes. [3]name: hotspot_aware_triton_selection description: Kernelize only bandwidth-bound tails that are actually hot. tags: [hotspot_analysis, fusion_strategy, triton] --- \#\# Motivation Custom Triton often underperforms when it targets only a tiny fraction of ...
[33]

Run conv/matmul in PyTorch
[34]

Make the output contiguous if needed
[35]

Flatten to numel(), launch 1D grid with mask (offs < n)
[36]

Pitfalls: .contiguous() can cost more than the tail

Fuse all pointwise ops before the final store. Pitfalls: .contiguous() can cost more than the tail
[37]

name: hoist invariant scalars out of kernel path←NEW description: Reduce overhead by hoisting invariant scalar/tensor params out of the hot kernel path. tags: [scalar hoisting, host overhead, fast path] --- ## Motivation A surprising amount of overhead comes from re-fetching tiny invariant scalars every call: clamp bounds, strides, shape constants. On sma...
[38]

Inspect constant params once in init
[39]

Convert scalar tensors to Python floats
[40]

Combine with a specialised fast path: if min value == 0.0: # hoist the scalar skip entire pipeline() # 2.0x speedup Pitfalls: hoisting is wrong if value can change. Figure 5:Skill selection evolution on Task A a specific implementation idiom (flat-1D kernels for contiguous tails) substantially increase the fraction of samples producing valid, high-speedup...
[41]

Optionally call read_skill_files to inspect existing skills and avoid duplicates
[42]

Each skill body must contain: ## Motivation, ## Key Idea, ## Example (with code)

Call update_skill_library with at most {max_skills} new skill(s). Each skill body must contain: ## Motivation, ## Key Idea, ## Example (with code)
[43]

Rules: - Skills must be GENERAL (applicable beyond this specific task)

Your turn ends automatically after update_skill_library is called. Rules: - Skills must be GENERAL (applicable beyond this specific task). - Do NOT add task-specific hacks or solutions. - Respond in English only. Summary Agent user prompt (GPT series) ## Task {task_formatted} ## Existing Skills in Library ‘‘‘ {skill_library.get_file_tree()} ‘‘‘ ## Turn 1 ...

2000
[44]

A kernel optimization task (original PyTorch code + performance target). 15 C. System prompt SII-GAIR
[45]

use torch.compile

A numbered list of candidate optimization skills from the skill library. Each entry shows only the skill name, description, tags, and scope | NOT the full content. Your job: identify the top 3 skills most likely to help solve THIS specific task. ## Selection criteria - Relevance: the technique directly applies to the operator/pattern in the task. - Impact...
[46]

A kernel optimization task (original PyTorch code + performance target)
[47]

use torch.compile

A numbered list of candidate optimization skills from the skill library. Each entry shows only the skill name, description, tags, and scope | NOT the full content. Your job: identify the top __top_k_select__ skills most likely to help solve THIS specific task. ## Selection criteria - Relevance: the technique directly applies to the operator/pattern in the...
[48]

Ensures the inputs are contiguous on GPU
[49]

Calculates the grid (blocks) needed
[50]

"" assert x.is_cuda and y.is_cuda,

Launches the Triton kernel. """ assert x.is_cuda and y.is_cuda, "Tensors must be on CUDA." x = x.contiguous() y = y.contiguous() # Prepare output tensor out = torch.empty_like(x) # Number of elements in the tensor n_elements = x.numel() BLOCK_SIZE = 128 # Tunable parameter for block size # Determine the number of blocks needed grid = lambda meta: ((n_elem...

[1] [1]

Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, et al. 2026. davinci-env: Open swe environment synthesis at scale.arXiv preprint arXiv:2603.13023

arXiv 2026

[2] [2]

Siqi Guo, Ming Lin, and Tianbao Yang. 2026. Drtriton: Large-scale synthetic data reinforcement learning for triton kernel generation.arXiv preprint arXiv:2603.21465

Pith/arXiv arXiv 2026

[3] [3]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

[4] [4]

Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770

Pith/arXiv arXiv

[5] [5]

Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie Wang- Haojie, Jianrong Wang, Xu Han, et al. 2025a. Tritonbench: Benchmarking large language model capabilities for generating triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, pages 23053–23066

2025

[6] [6]

Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, et al. 2025b. Autotriton: Automatic triton programming with reinforcement learning in llms.arXiv preprint arXiv:2507.05687

arXiv

[7] [7]

Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. 2026. Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060

arXiv 2026

[8] [8]

Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, et al. 2025. Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470

2025

[9] [9]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556

Pith/arXiv arXiv 2025

[10] [10]

Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, and Junxian He. 2026. Dr. kernel: Reinforcement learning done right for triton kernel generations.arXiv preprint arXiv:2602.05885

arXiv 2026

[11] [11]

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. 2026. Skill0: In-context agentic reinforcement learning for skill internalization. arXiv preprint arXiv:2604.02268

Pith/arXiv arXiv 2026

[12] [12]

Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher R´e, and Azalia Mirhoseini

[13] [13]

Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517

Pith/arXiv arXiv

[14] [14]

2009.The probabilistic relevance framework: BM25 and beyond, volume 4

Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc

2009

[15] [15]

Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, and Dongbin Zhao

[16] [16]

Dynamic dual-granularity skill bank for agentic rl.arXiv preprint arXiv:2603.28716

Pith/arXiv arXiv

[17] [17]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291

Pith/arXiv arXiv 2023

[18] [18]

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. 2025. Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506

arXiv 2025

[19] [19]

Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, and Youngsuk Park. 2025. Tritonrl: Training llms to think and code triton without cheating.arXiv preprint arXiv:2510.17891

arXiv 2025

[20] [20]

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. 2026. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning. arXiv preprint arXiv:2602.08234

Pith/arXiv arXiv 2026

[21] [21]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025

[22] [22]

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471

Pith/arXiv arXiv 2025

[23] [23]

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642. 11 A. Training costs SII-GAIR Task A — Reference Architecture class Model(nn.Module): def forward(self, x): x = self.conv(x) # Co...

2024

[24] [24]

Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, et al. 2026. Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high- performance gpu kernel generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29168–29176. A Training costs Interacting...

2026

[25] [25]

Run conv/matmul using PyTorch/library kernels

[26] [26]

Ensure the output is contiguous

[27] [27]

Flatten to numel() and launch masked 1D kernel

[28] [28]

Pitfalls: .contiguous() erases gains if tail tiny

Fuse as many elementwise ops as possible. Pitfalls: .contiguous() erases gains if tail tiny. [2]name: fuse_only_contiguous_pointwise_tails description: Fuse memory-bound pointwise epilogues around vendor kernels; don’t replace heavy ops. tags: [fusion, pointwise, memory_bound, triton] --- \#\# Motivation Custom Triton underperforms when it replaces librar...

[29] [29]

Ensure tail input is contiguous

[30] [30]

Flatten to numel() and launch masked 1D grid

[31] [31]

Fuse as many pointwise ops as possible

[32] [32]

call .contiguous(), flat- ten to numel(), and launch a masked 1-D grid,

Avoid extra temporaries unless profiling justifies. Pitfalls: use masks for non-power-of-two sizes. [3]name: hotspot_aware_triton_selection description: Kernelize only bandwidth-bound tails that are actually hot. tags: [hotspot_analysis, fusion_strategy, triton] --- \#\# Motivation Custom Triton often underperforms when it targets only a tiny fraction of ...

[33] [33]

Run conv/matmul in PyTorch

[34] [34]

Make the output contiguous if needed

[35] [35]

Flatten to numel(), launch 1D grid with mask (offs < n)

[36] [36]

Pitfalls: .contiguous() can cost more than the tail

Fuse all pointwise ops before the final store. Pitfalls: .contiguous() can cost more than the tail

[37] [37]

name: hoist invariant scalars out of kernel path←NEW description: Reduce overhead by hoisting invariant scalar/tensor params out of the hot kernel path. tags: [scalar hoisting, host overhead, fast path] --- ## Motivation A surprising amount of overhead comes from re-fetching tiny invariant scalars every call: clamp bounds, strides, shape constants. On sma...

[38] [38]

Inspect constant params once in init

[39] [39]

Convert scalar tensors to Python floats

[40] [40]

Combine with a specialised fast path: if min value == 0.0: # hoist the scalar skip entire pipeline() # 2.0x speedup Pitfalls: hoisting is wrong if value can change. Figure 5:Skill selection evolution on Task A a specific implementation idiom (flat-1D kernels for contiguous tails) substantially increase the fraction of samples producing valid, high-speedup...

[41] [41]

Optionally call read_skill_files to inspect existing skills and avoid duplicates

[42] [42]

Each skill body must contain: ## Motivation, ## Key Idea, ## Example (with code)

Call update_skill_library with at most {max_skills} new skill(s). Each skill body must contain: ## Motivation, ## Key Idea, ## Example (with code)

[43] [43]

Rules: - Skills must be GENERAL (applicable beyond this specific task)

Your turn ends automatically after update_skill_library is called. Rules: - Skills must be GENERAL (applicable beyond this specific task). - Do NOT add task-specific hacks or solutions. - Respond in English only. Summary Agent user prompt (GPT series) ## Task {task_formatted} ## Existing Skills in Library ‘‘‘ {skill_library.get_file_tree()} ‘‘‘ ## Turn 1 ...

2000

[44] [44]

A kernel optimization task (original PyTorch code + performance target). 15 C. System prompt SII-GAIR

[45] [45]

use torch.compile

A numbered list of candidate optimization skills from the skill library. Each entry shows only the skill name, description, tags, and scope | NOT the full content. Your job: identify the top 3 skills most likely to help solve THIS specific task. ## Selection criteria - Relevance: the technique directly applies to the operator/pattern in the task. - Impact...

[46] [46]

A kernel optimization task (original PyTorch code + performance target)

[47] [47]

use torch.compile

A numbered list of candidate optimization skills from the skill library. Each entry shows only the skill name, description, tags, and scope | NOT the full content. Your job: identify the top __top_k_select__ skills most likely to help solve THIS specific task. ## Selection criteria - Relevance: the technique directly applies to the operator/pattern in the...

[48] [48]

Ensures the inputs are contiguous on GPU

[49] [49]

Calculates the grid (blocks) needed

[50] [50]

"" assert x.is_cuda and y.is_cuda,

Launches the Triton kernel. """ assert x.is_cuda and y.is_cuda, "Tensors must be on CUDA." x = x.contiguous() y = y.contiguous() # Prepare output tensor out = torch.empty_like(x) # Number of elements in the tensor n_elements = x.numel() BLOCK_SIZE = 128 # Tunable parameter for block size # Determine the number of blocks needed grid = lambda meta: ((n_elem...