AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
Pith reviewed 2026-05-17 20:00 UTC · model grok-4.3
The pith
AccelOpt shows an LLM agent can improve kernel optimization for AI accelerators over time by learning from slow-fast kernel pairs without expert hardware input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements
What carries the argument
An optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs to inform the LLM agent's iterative generation of improved kernels.
If this is right
- The agent's optimization performance increases as it encounters more kernels and the memory accumulates additional slow-fast pairs.
- Open-source LLMs can deliver kernel performance on par with proprietary frontier models at substantially lower cost.
- Benchmarks built from actual LLM workloads offer a realistic measure of progress on accelerator kernel tuning.
- Kernel optimization for new accelerators can proceed without manual injection of hardware-specific rules.
Where Pith is reading between the lines
- The memory-based learning loop could extend to other code generation domains where past failures and successes provide reusable patterns.
- If memories can be transferred or merged across accelerator families, the approach might accelerate software readiness for successive hardware generations.
- Shared optimization memories across teams could compound improvements and lower the barrier to adopting new AI chips.
Load-bearing premise
That the curated optimization memory of slow-fast kernel pairs supplies sufficient transferable insight for the LLM to generate better kernels on new tasks without any hardware-specific expert knowledge being injected.
What would settle it
Applying AccelOpt to a fresh collection of unseen Trainium kernels from different workloads and checking whether the average percentage of peak throughput rises after multiple rounds of self-improvement.
Figures
read the original abstract
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AccelOpt, a self-improving LLM agentic system for optimizing kernels on emerging AI accelerators such as AWS Trainium. It uses an optimization memory that curates slow-fast kernel pairs from prior iterations to guide future generations, claims to remove the need for expert hardware-specific knowledge, introduces the NKIBench benchmark derived from real-world LLM workloads, and reports average peak-throughput improvements from 49% to 61% on Trainium 1 and 45% to 59% on Trainium 2. It further claims cost-effectiveness, matching Claude Sonnet 4 results at 26× lower cost with open-source models, and releases the code at https://github.com/zhang677/AccelOpt.
Significance. If the central attribution to the memory-based self-improvement loop holds after controlled experiments, the work would be a meaningful step toward automated, generalizable kernel optimization for new accelerators. The open-sourced code and the new NKIBench benchmark constitute concrete, reusable contributions that could accelerate follow-on research in LLM-driven systems optimization.
major comments (3)
- [Abstract] Abstract and Evaluation section: the headline throughput gains (49%→61% on Trainium 1, 45%→59% on Trainium 2) are reported without any mention of the number of independent runs, statistical significance testing, or controls for stochasticity in LLM sampling and prompt variation; these omissions make it impossible to assess whether the observed deltas are reliable or reproducible.
- [Evaluation] Evaluation section: the claim that performance improves specifically because of the optimization memory (slow-fast kernel pairs) is not supported by an ablation that compares the full system against an otherwise identical agent given the same number of generation attempts but without memory curation and reuse; without this isolation the causal link between the memory mechanism and the reported gains remains untested.
- [Abstract] Abstract: the assertion that AccelOpt 'eliminates the need for expert-provided hardware-specific optimization knowledge' rests on the unverified assumption that the curated memory supplies transferable, hardware-agnostic insight; no experiment demonstrates that the memory enables better kernels on new tasks than a baseline agent lacking any hardware-specific prompting or prior expert knowledge.
minor comments (2)
- [Abstract] Abstract contains a typo: 'acclerators' should read 'accelerators'.
- The description of the baseline methods used for the cost-effectiveness comparison (e.g., exact prompting strategy and iteration budget for the Claude Sonnet 4 reference) is insufficiently detailed to allow replication.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the corresponding revisions planned for the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and Evaluation section: the headline throughput gains (49%→61% on Trainium 1, 45%→59% on Trainium 2) are reported without any mention of the number of independent runs, statistical significance testing, or controls for stochasticity in LLM sampling and prompt variation; these omissions make it impossible to assess whether the observed deltas are reliable or reproducible.
Authors: We agree that details on experimental reproducibility are necessary to evaluate the reliability of the reported gains. The original manuscript does not include this information. In the revised version, we will report the number of independent runs performed, include measures of variability such as standard deviations, and present results from statistical significance tests to address stochasticity arising from LLM sampling and prompt variations. revision: yes
-
Referee: [Evaluation] Evaluation section: the claim that performance improves specifically because of the optimization memory (slow-fast kernel pairs) is not supported by an ablation that compares the full system against an otherwise identical agent given the same number of generation attempts but without memory curation and reuse; without this isolation the causal link between the memory mechanism and the reported gains remains untested.
Authors: We concur that an ablation isolating the memory mechanism is required to substantiate the causal contribution. We will add this experiment to the Evaluation section in the revised manuscript, comparing the complete AccelOpt system to a control agent that executes the same number of generation attempts without memory curation or reuse. revision: yes
-
Referee: [Abstract] Abstract: the assertion that AccelOpt 'eliminates the need for expert-provided hardware-specific optimization knowledge' rests on the unverified assumption that the curated memory supplies transferable, hardware-agnostic insight; no experiment demonstrates that the memory enables better kernels on new tasks than a baseline agent lacking any hardware-specific prompting or prior expert knowledge.
Authors: The claim is grounded in the system's design, which begins without expert input and constructs its memory through iterative self-improvement. However, we recognize that a direct comparison to a baseline lacking hardware-specific prompting has not been performed. In the revision we will either include such an experiment or adjust the abstract wording to state that the approach substantially reduces reliance on expert knowledge rather than claiming complete elimination. revision: partial
Circularity Check
No significant circularity; empirical gains measured on external hardware metrics
full rationale
The paper describes an empirical agentic system that iterates LLM generations and curates slow-fast kernel pairs in memory to improve kernel performance on Trainium accelerators. The headline results (49% to 61% and 45% to 59% of peak throughput on NKIBench) are obtained by direct measurement against external hardware throughput, not by any internal fitted parameter, self-referential definition, or mathematical derivation that reduces to its own inputs. No equations, uniqueness theorems, or ansatzes are presented that would trigger self-definitional or fitted-input patterns. The self-improving loop is an engineering mechanism whose effectiveness is evaluated against independent benchmarks rather than being presupposed by construction. This is a standard empirical systems paper whose central claims remain falsifiable outside any internal loop.
Axiom & Free-Parameter Ledger
invented entities (1)
-
optimization memory
no independent evidence
Reference graph
Works this paper leans on
-
[5]
hbm_write_bytes: Total bytes of data written to HBM using the DMA engines
nki.isa.nc_transpose(x) is equivalent to and has the same performance as nki.isa.nc_matmul(x, identity_matrix, is_moving_onezero= True, is_transpose=True) Figure 18.NKI API basics # Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines. hbm_write_bytes: Total bytes of data written to HBM using the DMA engines. psum_re...
-
[7]
Combine the intuitions with the ‘kernel‘ code to come up with the optimization plans to fix the inefficiencies
-
[9]
However, the plan can still target optimizing certain metrics
The compiler exists and thus the profile numbers might not match the source code analysis. However, the plan can still target optimizing certain metrics
-
[10]
Do not invent new APIs in the optimization plans
Just use existing NKI APIs in the baseline kernel. Do not invent new APIs in the optimization plans
-
[11]
Don’t suggest using lower precision than the baseline kernel in the optimization plan. # Problem ‘‘‘ {problem_code} ‘‘‘ # Baseline NKI kernel ‘‘‘ {kernel_code} ‘‘‘ # Profile ‘‘‘ {profile} ‘‘‘ Figure 20.Planner prompt user template AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization # Output dependencies NKI requires iterat...
-
[15]
InstTile cannot be directly assigned to a tensor, use store operation instead. Figure 22.NKI programming guide (continue) AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization You are a helpful assistant for Neural Kernel Interface (NKI) developers. You will be given an old kernel, a new kernel, and the speedup of the new ke...
-
[16]
The last dimension tile size in arrays (v6, v10, v11, v14, v17) is doubled
-
[17]
Loop ranges for i1 (outer) and i8/i10 (inner) are halved
-
[18]
Stride calculations in load/store operations are adjusted accordingly Original code: ‘‘‘python v6 = nl.ndarray((16, 16, nl.par_dim(64), 256), ...) for i0 in nl.affine_range(16): for i1 in nl.affine_range(16): # 16 iterations v6[i0, i1, ...] = nl.load(v2[..., 256 *i1 + ...], ...) for i2 in nl.affine_range(4): for i4 in nl.affine_range(8): v10[...] = nisa.n...
-
[19]
The loop over ‘i5‘ (originally iterating 2 times) is fused into the outer ‘i1‘ loop by extending its range from 4 to 16 iterations
-
[20]
Tensor dimensions are reshaped to reflect the fused loop structure (e.g., ‘v6‘ last dimension changes from 1024 to 256)
-
[21]
Reduction operations are simplified by eliminating the inner ‘i5‘ loop and adjusting tensor reduction axes. Original code: ‘‘‘python # Original nested loop structure with i1 (4 iters) and i5 (2 iters) for i0 in nl.affine_range(16): for i1 in nl.affine_range(4): v6[i0, i1, :, :] = nl.load(v2[i0, :, 1024 *i1 : 1024 *(i1+1)], ...) for i2 in nl.affine_range(4...
-
[22]
Start from analyzing the profiles and find possible inefficiencies
-
[23]
Combine the intuitions with the ‘kernel‘ code to first come up with the optimization plans to fix the inefficiencies, then optimize the kernel according to the plans
-
[24]
Think of loop ordering, tiling, loop split and merge, liveness analysis, data reuse, reordering instructions or blocks of instructions, hoisting redundant operations out of loops, fusion, and other methods not listed here
-
[25]
However, you can still target optimizing certain metrics
The compiler exists and thus the profile numbers might not match the source code analysis. However, you can still target optimizing certain metrics
-
[26]
Here is some information about the NKI API:
Don’t use lower precision than the baseline kernel. Here is some information about the NKI API:
-
[27]
Users could also explicitly annotate the partition dimension with par_dim from nki.language
By default, NKI infers the first dimension (that is, the left most dimension) as the partition dimension of Tensor. Users could also explicitly annotate the partition dimension with par_dim from nki.language. The dimensions on the right of partition dimensions are the free dimension F where elements are read and written sequentially
-
[28]
Each partition of SBUF buffer cannot exceed 192KB
NKI requires the free dimensions size of PSUM to not exceed the architecture limitation of 512. Each partition of SBUF buffer cannot exceed 192KB
-
[29]
NKI requires the number of partitions of a tile to not exceed the architecture limitation of 128
-
[30]
The nc_matmul instruction must read inputs from SBUF and write outputs to PSUM
nki.isa.nc_matmul(stationary, moving, is_stationary_onezero=False, is_moving_onezero=False, mask=None, is_transpose=False): nki.isa.nc_matmul computes transpose(stationary) @ moving matrix multiplication using Tensor Engine. The nc_matmul instruction must read inputs from SBUF and write outputs to PSUM. Therefore, the stationary and moving must be SBUF ti...
-
[31]
nki.isa.nc_transpose(x) is equivalent to and has the same performance as nki.isa.nc_matmul(x, identity_matrix, is_moving_onezero= True, is_transpose=True)
-
[32]
# Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines
‘nki.language.sigmoid‘, ‘nki.language.rsqrt‘, and ‘nki.language.silu‘ can be used as activation functions of ‘nki.isa.activation‘. # Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines. hbm_write_bytes: Total bytes of data written to HBM using the DMA engines. psum_read_bytes: Total bytes of data that are read from ...
-
[33]
Don’t use slice with variable size
-
[34]
List indices must be integers or slices, not Index
-
[35]
Shape element must be integers
-
[36]
InstTile cannot be directly assigned to a tensor, use store operation instead. # Problem ‘‘‘ {problem_code} ‘‘‘ # Baseline NKI kernel ‘‘‘ {kernel_code} ‘‘‘ # Kernel usage ‘‘‘ if __name__ == "__main__": inputs = get_inputs() ref_output = forward( *inputs) kernel_output = transform_nki_outputs(kernel( *transform_to_nki_inputs(inputs)), ref_output) assert np...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.