EGG: An Expert-Guided Agent Framework for Kernel Generation
Pith reviewed 2026-06-26 05:04 UTC · model grok-4.3
The pith
Expert-guided two-stage decomposition lets LLM agents generate correct GPU kernels with 2.13x average speedup over PyTorch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EGG decomposes kernel generation into algorithmic structure design, which creates a high-quality computational foundation, followed by hardware-specific tuning that applies parallel mapping, tensor tiling, and memory optimization. Stage-aware multi-agent collaboration handles context across and within stages to maintain stable optimization paths. Experiments show the resulting kernels deliver a 2.13 times average speedup over PyTorch on KernelBench and real workloads while outperforming other agent-based and RL-based generators.
What carries the argument
Two-stage hierarchical decomposition of kernel generation (algorithmic structure design then hardware-specific tuning) supported by stage-aware multi-agent collaboration for context management.
If this is right
- Kernels generated by EGG run 2.13 times faster than PyTorch on average across tested workloads.
- The staged approach structures the design space so refinements build progressively from structure to hardware details.
- Multi-agent context management keeps optimization trajectories stable by sharing information within and between stages.
- The framework outperforms existing agent-based and reinforcement-learning kernel generators on the same benchmarks.
Where Pith is reading between the lines
- If the two-stage split proves essential, similar expert decompositions could be tested for generating optimized code in domains beyond GPU kernels.
- The reliance on expert principles suggests that collecting more detailed workflow traces from human kernel developers might further improve the agents.
- Performance gains on real-world workloads imply the method could reduce inference costs for large models if integrated into standard compilation pipelines.
Load-bearing premise
That splitting kernel generation into algorithmic structure followed by hardware tuning and managing it with multi-agent context will consistently produce both correct and faster kernels than prior LLM approaches.
What would settle it
A benchmark case on KernelBench where EGG produces either incorrect kernel code or performance no better than PyTorch or competing agent methods.
Figures
read the original abstract
High-performance GPU kernels are critical for reducing the exponentially growing computational costs of large language models (LLMs), but their development heavily relies on manual tuning by domain experts. While recent advances in LLM-based approaches show promise for automating kernel generation, they still struggle to achieve both correctness and high performance. This limitation primarily arises from the lack of domain-specific optimization guidance, hindering effective exploration of the optimization space. We propose EGG, an Expert-Guided Agent Framework for Kernel Generation, which incorporates expert optimization principles to guide LLMs' decisions. Inspired by expert workflows, we decompose kernel generation into two hierarchical stages: 1) algorithmic structure design, which establishes a high-quality computational structure foundation; 2) hardware-specific tuning, which performs targeted adjustments through parallel mapping, tensor tiling, and memory optimization. This staged decomposition defines explicit optimization objectives, structuring the design space to achieve progressive refinement. To this end, a stage-aware multi-agent collaboration mechanism is designed for inter and intra-stage context management, ensuring stable optimization trajectories. Experiments on KernelBench and real-world workloads show that EGG achieves a 2.13x average speedup over PyTorch, outperforming existing agent-based and RL-based approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EGG, an Expert-Guided Agent Framework for Kernel Generation. It decomposes kernel generation into two hierarchical stages—algorithmic structure design followed by hardware-specific tuning (parallel mapping, tensor tiling, memory optimization)—and introduces a stage-aware multi-agent collaboration mechanism for context management. The central empirical claim is that EGG achieves a 2.13x average speedup over PyTorch on KernelBench and real-world workloads while outperforming existing agent-based and RL-based approaches.
Significance. If the reported speedups and correctness hold under rigorous verification, the work would offer a concrete advance in automated high-performance kernel generation by structuring LLM agent workflows around expert-inspired decomposition and stage-aware coordination. This could reduce reliance on manual expert tuning for LLM inference kernels.
major comments (2)
- [Experiments / Abstract] The manuscript provides no description of the experimental protocol (number of runs, statistical tests, hardware platform details, or how kernel correctness was verified against reference implementations). This information is load-bearing for the central 2.13x speedup claim and the comparison to baselines.
- [Experiments] No details are given on baseline implementations (e.g., exact agent/RL methods reproduced, their hyper-parameters, or whether they also received the two-stage decomposition). Without this, it is impossible to attribute performance gains specifically to the stage-aware multi-agent mechanism.
minor comments (2)
- [Introduction] The abstract and introduction use the term 'parameter-free' in passing when describing the framework; clarify whether any learned or tuned parameters remain in the agent prompts or decomposition rules.
- [Figures/Tables] Figure captions and table headers should explicitly state the number of workloads, the exact metric (e.g., mean speedup with standard deviation), and the PyTorch version used for the baseline.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on experimental transparency. We address each point below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Experiments / Abstract] The manuscript provides no description of the experimental protocol (number of runs, statistical tests, hardware platform details, or how kernel correctness was verified against reference implementations). This information is load-bearing for the central 2.13x speedup claim and the comparison to baselines.
Authors: We agree that the manuscript does not currently include a dedicated description of the experimental protocol. In the revised version we will add an Experimental Setup subsection specifying the GPU hardware, number of runs performed, any statistical reporting, and the exact procedure used to verify kernel correctness (output equivalence and performance comparison) against reference implementations. revision: yes
-
Referee: [Experiments] No details are given on baseline implementations (e.g., exact agent/RL methods reproduced, their hyper-parameters, or whether they also received the two-stage decomposition). Without this, it is impossible to attribute performance gains specifically to the stage-aware multi-agent mechanism.
Authors: We acknowledge the absence of these baseline details. The revision will expand the Experiments section with precise descriptions of each reproduced baseline (agent-based and RL-based methods), the hyper-parameters employed, and explicit clarification on whether the two-stage decomposition was applied to the baselines. This will enable readers to isolate the contribution of the stage-aware collaboration mechanism. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents a proposed agent framework (EGG) for kernel generation, decomposed into algorithmic structure design and hardware-specific tuning stages, with a stage-aware multi-agent mechanism. The central claim is an empirical performance result (2.13x average speedup on KernelBench and real workloads) obtained via experiments, not a mathematical derivation or prediction. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described structure. The result is self-contained as an empirical benchmark comparison against baselines, with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert optimization principles can be effectively decomposed into algorithmic structure design and hardware-specific tuning stages that LLMs can follow.
invented entities (1)
-
Stage-aware multi-agent collaboration mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
"" Simple model that performs a single square matrix multiplication (C = A * B)
URL https://openreview.net/forum? id=Jb1WkNSfUB. Woo, J., Zhu, S., Nie, A., Jia, Z., Wang, Y ., and Park, Y . Tri- tonRL: Training LLMs to think and code Triton without cheating.arXiv preprint arXiv:2510.17891, 2025. Zhai, Y ., Yang, S., Pan, K., Zhang, R., Liu, S., Liu, C., Ye, Z., Ji, J., Zhao, J., Zhang, Y ., et al. Enabling tensor language model to as...
arXiv 2025
-
[2]
EVERY kernel function MUST have ‘@triton.jit‘ decorator -- MANDATORY
-
[4]
BLOCK sizes MUST be power-of-2 constexpr: 16, 32, 64, 128, 256
-
[5]
‘tl.program_id(axis)‘ only supports axis = 0, 1, 2 (max 3D grid) ## Triton Syntax Rules: - For matmul/conv/linear ops, prefer ‘tl.dot(a, b, allow_tf32=True)‘ over element-wise multiply-add - No ‘continue‘, ‘break‘, ‘return‘ inside loops -- use masking instead - No tensor indexing with loop vars: ‘x[:, i]‘ or ‘x[i, :]‘ is INVALID - No tuple unpacking insid...
-
[6]
Imports: ‘import torch, torch.nn as nn, triton, triton.language as tl‘ (and math if needed)
-
[7]
‘@triton.jit‘ kernel(s) -- MUST have this decorator
-
[8]
Wrapper function with grid calculation
-
[9]
__main__
‘class ModelNew(nn.Module)‘ -- REQUIRED Do NOT include: testing code, ‘if __name__ == "__main__"‘, get_inputs, get_init_inputs Example PyTorch: ’’’ $few_base ’’’ Example Triton: ’’’ 22 EGG: An Expert-Guided Agent Framework for Kernel Generation $few_new ’’’ Target: ‘‘‘python $kernel_src ‘‘‘ """ E.2. Stage System Prompts for Hardware-Specific Tuning During...
-
[10]
**Code Analysis **: Count kernels, identify operations, check for inefficiencies
-
[11]
**Performance Diagnosis **: Use metrics/latency to identify bottleneck type
-
[12]
worth_optimizing
**Root Cause **: Combine code + performance to find the core issue ## Optimization Categories (pick ONE if worth optimizing): ### 1. Operator Fusion Fuse consecutive ops into fewer kernels to reduce memory traffic and launch overhead. ### 2. Algorithm Replacement Replace naive algorithm with optimized variant. - For Attention: Flash Attention, online soft...
-
[13]
**Preserve correctness **: Maintain the same input/output behavior
-
[14]
**Apply the optimization **: Follow the implementation plan exactly
-
[15]
**Use valid Triton syntax **: - Every kernel MUST have ‘@triton.jit‘ decorator - Grid size MUST be > 0: use ‘triton.cdiv(N, BLOCK)‘ or ‘max(1, N // BLOCK)‘ - BLOCK sizes MUST be power-of-2: 16, 32, 64, 128, 256 - No ‘continue‘, ‘break‘, ‘return‘ inside kernels (use masking) - Prefer ‘tl.dot(a, b, allow_tf32=True)‘ for matmul operations
-
[16]
__main__
**Output format **: - Imports: ‘import torch, torch.nn as nn, triton, triton.language as tl‘ - ‘@triton.jit‘ kernel(s) - Wrapper function(s) - ‘class ModelNew(nn.Module)‘ -- REQUIRED - NO testing code, NO ‘if __name__ == "__main__"‘ Do NOT include: testing code, if __name__, get_inputs, get_init_inputs ‘‘‘python # <optimized Triton code> ‘‘‘ """ E.4.2. OP...
-
[17]
EVERY kernel function MUST have ‘@triton.jit‘ decorator
-
[18]
Grid size MUST be > 0: use ‘triton.cdiv(N, BLOCK)‘ or ‘max(1, N // BLOCK)‘
-
[19]
BLOCK sizes MUST be power-of-2: 16, 32, 64, 128, 256
-
[20]
‘tl.program_id(axis)‘ only supports axis = 0, 1, 2
-
[21]
No ‘continue‘, ‘break‘, ‘return‘ inside loops -- use masking
-
[22]
No tensor indexing with loop vars: ‘x[:, i]‘ is INVALID
-
[23]
mask shape MUST match data shape in tl.load/tl.store ## Missing Triton Functions (implement manually): - tl.tanh, tl.sigmoid, tl.gelu, tl.silu, tl.softmax, tl.mish ## OUTPUT FORMAT (STRICT):
-
[24]
Imports: torch, torch.nn, triton, triton.language as tl
-
[27]
REPAIRMODE When kernel execution fails, the code agent receives kernel code and debug analysis as input
class ModelNew(nn.Module) that calls your kernels Do NOT include: testing code, if __name__, get_inputs, get_init_inputs ‘‘‘python # <optimized Triton code> ‘‘‘ E.4.3. REPAIRMODE When kernel execution fails, the code agent receives kernel code and debug analysis as input. The prompt emphasizes strict adherence to Triton syntax rules and output format requ...
-
[28]
Imports: torch, torch.nn, triton, triton.language as tl (and math if needed)
-
[29]
@triton.jit decorated kernel function(s)
-
[30]
Wrapper function(s) for grid calculation and kernel launch
-
[31]
Debug Agent Prompt The debug agent receives error logs, PyTorch reference implementations, and broken kernel code as input
class ModelNew(nn.Module) -- REQUIRED Do NOT include: testing code, if __name__, get_inputs, get_init_inputs 28 EGG: An Expert-Guided Agent Framework for Kernel Generation ‘‘‘python # <corrected code> ‘‘‘ E.5. Debug Agent Prompt The debug agent receives error logs, PyTorch reference implementations, and broken kernel code as input. The prompt guides the a...
-
[32]
Output is wrong
**Focus on root cause **, not symptoms - Bad: "Output is wrong" - Good: "BLOCK_K loop missing, only processes first 32 elements of K dimension"
-
[33]
Memory access issue
**Be specific about WHAT and WHERE ** - Bad: "Memory access issue" - Good: "Line 45: tl.atomic_add(c_block_ptr, acc) - atomic_add requires scalar pointer, not block_ptr"
-
[34]
critical_issue
**Prioritize by impact ** - Correctness bugs > Performance issues > Style problems - Algorithm errors > Implementation details ## Output Format ‘‘‘json { "critical_issue": "<Concise description of THE root cause, max 30 words>", "modification plan": "<What needs to change (not how), max 30 words>" } ‘‘‘ Return JSON only. F. Nsight Compute Profiling Metric...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.