AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Allen Nie; Anjiang Wei; Genghan Zhang; Kunle Olukotun; Nandita Vijaykumar; Shaowei Zhu; Yida Wang; Zhen Jia; Zhenyu Song

arxiv: 2511.15915 · v2 · submitted 2025-11-19 · 💻 cs.LG · cs.CL

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Genghan Zhang , Shaowei Zhu , Anjiang Wei , Zhenyu Song , Allen Nie , Zhen Jia , Nandita Vijaykumar , Yida Wang

show 1 more author

Kunle Olukotun

This is my paper

Pith reviewed 2026-05-17 20:00 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM agentic systemkernel optimizationAI acceleratorsself-improvingoptimization memoryTrainiumNKIBenchautonomous tuning

0 comments

The pith

AccelOpt shows an LLM agent can improve kernel optimization for AI accelerators over time by learning from slow-fast kernel pairs without expert hardware input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AccelOpt, a self-improving LLM agentic system that autonomously optimizes kernels for emerging AI accelerators. It maintains an optimization memory of experiences from slow-to-fast kernel transformations to guide iterative code generation on new problems. The system is evaluated on NKIBench, a benchmark of real Trainium kernels extracted from LLM workloads, where average peak throughput rises from 49 percent to 61 percent on one platform and 45 percent to 59 percent on the other. This matters because kernel tuning has traditionally required scarce expert knowledge for each new hardware target. The authors further show that open-source models achieve results comparable to Claude Sonnet 4 at 26 times lower cost.

Core claim

AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements

What carries the argument

An optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs to inform the LLM agent's iterative generation of improved kernels.

If this is right

The agent's optimization performance increases as it encounters more kernels and the memory accumulates additional slow-fast pairs.
Open-source LLMs can deliver kernel performance on par with proprietary frontier models at substantially lower cost.
Benchmarks built from actual LLM workloads offer a realistic measure of progress on accelerator kernel tuning.
Kernel optimization for new accelerators can proceed without manual injection of hardware-specific rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory-based learning loop could extend to other code generation domains where past failures and successes provide reusable patterns.
If memories can be transferred or merged across accelerator families, the approach might accelerate software readiness for successive hardware generations.
Shared optimization memories across teams could compound improvements and lower the barrier to adopting new AI chips.

Load-bearing premise

That the curated optimization memory of slow-fast kernel pairs supplies sufficient transferable insight for the LLM to generate better kernels on new tasks without any hardware-specific expert knowledge being injected.

What would settle it

Applying AccelOpt to a fresh collection of unseen Trainium kernels from different workloads and checking whether the average percentage of peak throughput rises after multiple rounds of self-improvement.

Figures

Figures reproduced from arXiv: 2511.15915 by Allen Nie, Anjiang Wei, Genghan Zhang, Kunle Olukotun, Nandita Vijaykumar, Shaowei Zhu, Yida Wang, Zhen Jia, Zhenyu Song.

**Figure 1.** Figure 1: At each iteration of AccelOpt, the agentic workflow shown on the right optimizes the candidate kernels with the latest optimization memory, and generates new candidate kernels, updating optimization memory with newly collected experiences. Section 2 explains the overall workflow and each component in detail. given the limited availability of Trainium-specific optimization knowledge and kernel tuning recip… view at source ↗

**Figure 2.** Figure 2: Prompt template for each agentic in the agentic workflow. to the tail, while the oldest entries in the memory will be discarded once ExpN is reached. Intuitively, increasing ExpN leads to higher inference costs due to more input tokens to the planner, yet the memory can retain more historical experiences that can potentially be beneficial. The TopK parameter controls how eager the memory system can be whe… view at source ↗

**Figure 3.** Figure 3: A snapshot of AccelOpt’s execution trace. In the experience item, the pseudocode of the slow-fast pairs looks like the above candidate and optimized kernels where affine range is a NKI construct for parallel loops without carried dependency. The experience item will be stored in the optimization memory, and the optimized kernel will become a candidate for the next iteration. high speedups from occupying th… view at source ↗

**Figure 4.** Figure 4: NKIBench architecture. Kernels are grouped by the configuration of ML operators. The meshes represent cores of one Trainium chip; trn1.32xlarge and trn2.48xlarge are Amazon EC2 instances for Trainium 1 and 2, respectively. tion. Moreover, existing accelerator kernel benchmarks typically lack information about how well a kernel is optimized relative to the hardware’s theoretical peak performance. To addre… view at source ↗

**Figure 5.** Figure 5: Per-task kernel improvement achieved using Claude Sonnet 4 and AccelOpt on Trainium 1. On-chip Memory HBM Tensor Engine Vector Engine Scalar Engine GPSIMD Engine PeakMM PeakVec Unused Bandwidth [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: One core of a Trainium chip with its device memory (HBM), shown in an abstracted form applicable to multiple chip generations. For additional architecture details, refer to the NKI documentation (AWS, 2025a). peak throughput on Trainium 1, where AccelOpt performs comparably with Claude Sonnet 4 across most kernels. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Non-local optimization discovered by AccelOpt for the fused BatchMatmul+Softmax operator. All variables are tiles of tensors, and code has been simplified to highlight the changed dimensions of allocated tensors in the loop body [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Saturating speedup with effective exploration. Same as [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 13.** Figure 13: Geometric mean of best speedup achieved up to a certain iteration across all tasks obtained through repeated sampling, beam search, and beam search + optimization memory. As defined in Algorithm 1, B is the number of candidates and N is the number of plans for each candidate. We consider the inference cost to properly compare with repeated sampling, which does not have a notion of iterations [PITH_FULL… view at source ↗

**Figure 12.** Figure 12: The orange bars show the distribution of per-iteration speedup over the candidate kernels, while the blue bars show the speedup over the initial kernels. This plot collects the distribution of speedups from all tasks. 480B come from the same model family, and the larger one gets better performance. Qwen3-Coder-30B and gpt-oss120b have the same cost per token, while gpt-oss-120b is a reasoning model. The … view at source ↗

**Figure 15.** Figure 15: Cost-benefit trade-off across different TopK and ExpN. 6 CONCLUSION This paper presents AccelOpt, the first self-improving LLM agentic system for kernel optimization on emerging AI accelerators such as AWS Trainium that combines search with memory accumulation. We demonstrate that combining inference-time scaling with optimization memory enables LLM agents to autonomously optimize real-world Trainium ker… view at source ↗

**Figure 16.** Figure 16: An example NKI program snippet adopted from an official NKI example. We use the single-core peak achievable hardware statistics from the Neuron Architecture documentation (AWS, 2025a;b) in [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 17.** Figure 17: Kernel usage in executor’s user prompt template [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 18.** Figure 18: NKI API basics # Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines. hbm_write_bytes: Total bytes of data written to HBM using the DMA engines. psum_read_bytes: Total bytes of data that are read from PSUM by compute engine instructions. psum_write_bytes: Total bytes of data that are written to PSUM by compute engine instructions. sbuf_read_bytes: Total size of all … view at source ↗

**Figure 19.** Figure 19: Profile terminology [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗

**Figure 20.** Figure 20: Planner prompt user template [PITH_FULL_IMAGE:figures/full_fig_p019_20.png] view at source ↗

**Figure 21.** Figure 21: NKI programming guide [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗

**Figure 22.** Figure 22: NKI programming guide (continue) [PITH_FULL_IMAGE:figures/full_fig_p021_22.png] view at source ↗

**Figure 23.** Figure 23: Summarizer base prompt and user template [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗

**Figure 24.** Figure 24: Example of past experiences after the iteration in [PITH_FULL_IMAGE:figures/full_fig_p023_24.png] view at source ↗

**Figure 25.** Figure 25: Example of past experiences after the iteration in [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗

**Figure 26.** Figure 26: Example of past experiences after the iteration in [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗

**Figure 27.** Figure 27: Base prompt for sampling Claude Sonnet 4 [PITH_FULL_IMAGE:figures/full_fig_p026_27.png] view at source ↗

**Figure 28.** Figure 28: Base prompt for sampling Claude Sonnet 4 (continued) [PITH_FULL_IMAGE:figures/full_fig_p027_28.png] view at source ↗

**Figure 29.** Figure 29: Base prompt for sampling Claude Sonnet 4. The problem code, kernel code, and profile will be replaced with the actual values [PITH_FULL_IMAGE:figures/full_fig_p028_29.png] view at source ↗

read the original abstract

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AccelOpt shows an LLM agent improving kernel throughput on Trainium via a memory of slow-fast pairs, but the memory's contribution over plain iteration is not isolated.

read the letter

The core of this paper is a loop where an LLM generates kernels for AWS Trainium, measures them on hardware, and stores slow-fast pairs in memory to inform the next round. The reported outcome is that average peak throughput rises from 49% to 61% on Trainium 1 and 45% to 59% on Trainium 2 across the new NKIBench kernels, while open-source models match Claude Sonnet 4 at 26 times lower cost. The code is released, which helps anyone who wants to inspect or rerun the experiments.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AccelOpt, a self-improving LLM agentic system for optimizing kernels on emerging AI accelerators such as AWS Trainium. It uses an optimization memory that curates slow-fast kernel pairs from prior iterations to guide future generations, claims to remove the need for expert hardware-specific knowledge, introduces the NKIBench benchmark derived from real-world LLM workloads, and reports average peak-throughput improvements from 49% to 61% on Trainium 1 and 45% to 59% on Trainium 2. It further claims cost-effectiveness, matching Claude Sonnet 4 results at 26× lower cost with open-source models, and releases the code at https://github.com/zhang677/AccelOpt.

Significance. If the central attribution to the memory-based self-improvement loop holds after controlled experiments, the work would be a meaningful step toward automated, generalizable kernel optimization for new accelerators. The open-sourced code and the new NKIBench benchmark constitute concrete, reusable contributions that could accelerate follow-on research in LLM-driven systems optimization.

major comments (3)

[Abstract] Abstract and Evaluation section: the headline throughput gains (49%→61% on Trainium 1, 45%→59% on Trainium 2) are reported without any mention of the number of independent runs, statistical significance testing, or controls for stochasticity in LLM sampling and prompt variation; these omissions make it impossible to assess whether the observed deltas are reliable or reproducible.
[Evaluation] Evaluation section: the claim that performance improves specifically because of the optimization memory (slow-fast kernel pairs) is not supported by an ablation that compares the full system against an otherwise identical agent given the same number of generation attempts but without memory curation and reuse; without this isolation the causal link between the memory mechanism and the reported gains remains untested.
[Abstract] Abstract: the assertion that AccelOpt 'eliminates the need for expert-provided hardware-specific optimization knowledge' rests on the unverified assumption that the curated memory supplies transferable, hardware-agnostic insight; no experiment demonstrates that the memory enables better kernels on new tasks than a baseline agent lacking any hardware-specific prompting or prior expert knowledge.

minor comments (2)

[Abstract] Abstract contains a typo: 'acclerators' should read 'accelerators'.
The description of the baseline methods used for the cost-effectiveness comparison (e.g., exact prompting strategy and iteration budget for the Claude Sonnet 4 reference) is insufficiently detailed to allow replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the corresponding revisions planned for the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: the headline throughput gains (49%→61% on Trainium 1, 45%→59% on Trainium 2) are reported without any mention of the number of independent runs, statistical significance testing, or controls for stochasticity in LLM sampling and prompt variation; these omissions make it impossible to assess whether the observed deltas are reliable or reproducible.

Authors: We agree that details on experimental reproducibility are necessary to evaluate the reliability of the reported gains. The original manuscript does not include this information. In the revised version, we will report the number of independent runs performed, include measures of variability such as standard deviations, and present results from statistical significance tests to address stochasticity arising from LLM sampling and prompt variations. revision: yes
Referee: [Evaluation] Evaluation section: the claim that performance improves specifically because of the optimization memory (slow-fast kernel pairs) is not supported by an ablation that compares the full system against an otherwise identical agent given the same number of generation attempts but without memory curation and reuse; without this isolation the causal link between the memory mechanism and the reported gains remains untested.

Authors: We concur that an ablation isolating the memory mechanism is required to substantiate the causal contribution. We will add this experiment to the Evaluation section in the revised manuscript, comparing the complete AccelOpt system to a control agent that executes the same number of generation attempts without memory curation or reuse. revision: yes
Referee: [Abstract] Abstract: the assertion that AccelOpt 'eliminates the need for expert-provided hardware-specific optimization knowledge' rests on the unverified assumption that the curated memory supplies transferable, hardware-agnostic insight; no experiment demonstrates that the memory enables better kernels on new tasks than a baseline agent lacking any hardware-specific prompting or prior expert knowledge.

Authors: The claim is grounded in the system's design, which begins without expert input and constructs its memory through iterative self-improvement. However, we recognize that a direct comparison to a baseline lacking hardware-specific prompting has not been performed. In the revision we will either include such an experiment or adjust the abstract wording to state that the approach substantially reduces reliance on expert knowledge rather than claiming complete elimination. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external hardware metrics

full rationale

The paper describes an empirical agentic system that iterates LLM generations and curates slow-fast kernel pairs in memory to improve kernel performance on Trainium accelerators. The headline results (49% to 61% and 45% to 59% of peak throughput on NKIBench) are obtained by direct measurement against external hardware throughput, not by any internal fitted parameter, self-referential definition, or mathematical derivation that reduces to its own inputs. No equations, uniqueness theorems, or ansatzes are presented that would trigger self-definitional or fitted-input patterns. The self-improving loop is an engineering mechanism whose effectiveness is evaluated against independent benchmarks rather than being presupposed by construction. This is a standard empirical systems paper whose central claims remain falsifiable outside any internal loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unproven premise that an LLM can extract generalizable optimization rules from stored kernel pairs without additional hardware models or expert rules; no free parameters or invented physical entities are described.

invented entities (1)

optimization memory no independent evidence
purpose: stores and curates experiences from previously encountered slow-fast kernel pairs to guide future generations
Introduced as the key mechanism enabling self-improvement; no independent evidence outside the system is provided in the abstract.

pith-pipeline@v0.9.0 · 5521 in / 1181 out tokens · 54352 ms · 2026-05-17T20:00:28.124064+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[5]

hbm_write_bytes: Total bytes of data written to HBM using the DMA engines

nki.isa.nc_transpose(x) is equivalent to and has the same performance as nki.isa.nc_matmul(x, identity_matrix, is_moving_onezero= True, is_transpose=True) Figure 18.NKI API basics # Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines. hbm_write_bytes: Total bytes of data written to HBM using the DMA engines. psum_re...

work page
[7]

Combine the intuitions with the ‘kernel‘ code to come up with the optimization plans to fix the inefficiencies

work page
[9]

However, the plan can still target optimizing certain metrics

The compiler exists and thus the profile numbers might not match the source code analysis. However, the plan can still target optimizing certain metrics

work page
[10]

Do not invent new APIs in the optimization plans

Just use existing NKI APIs in the baseline kernel. Do not invent new APIs in the optimization plans

work page
[11]

Don’t suggest using lower precision than the baseline kernel in the optimization plan. # Problem ‘‘‘ {problem_code} ‘‘‘ # Baseline NKI kernel ‘‘‘ {kernel_code} ‘‘‘ # Profile ‘‘‘ {profile} ‘‘‘ Figure 20.Planner prompt user template AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization # Output dependencies NKI requires iterat...

work page
[15]

No optimization found

InstTile cannot be directly assigned to a tensor, use store operation instead. Figure 22.NKI programming guide (continue) AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization You are a helpful assistant for Neural Kernel Interface (NKI) developers. You will be given an old kernel, a new kernel, and the speedup of the new ke...

work page
[16]

The last dimension tile size in arrays (v6, v10, v11, v14, v17) is doubled

work page
[17]

Loop ranges for i1 (outer) and i8/i10 (inner) are halved

work page
[18]

This improves performance by reducing loop overhead and enhancing data locality through processing larger chunks of data per iteration

Stride calculations in load/store operations are adjusted accordingly Original code: ‘‘‘python v6 = nl.ndarray((16, 16, nl.par_dim(64), 256), ...) for i0 in nl.affine_range(16): for i1 in nl.affine_range(16): # 16 iterations v6[i0, i1, ...] = nl.load(v2[..., 256 *i1 + ...], ...) for i2 in nl.affine_range(4): for i4 in nl.affine_range(8): v10[...] = nisa.n...

work page
[19]

The loop over ‘i5‘ (originally iterating 2 times) is fused into the outer ‘i1‘ loop by extending its range from 4 to 16 iterations

work page
[20]

Tensor dimensions are reshaped to reflect the fused loop structure (e.g., ‘v6‘ last dimension changes from 1024 to 256)

work page
[21]

Reduction operations are simplified by eliminating the inner ‘i5‘ loop and adjusting tensor reduction axes. Original code: ‘‘‘python # Original nested loop structure with i1 (4 iters) and i5 (2 iters) for i0 in nl.affine_range(16): for i1 in nl.affine_range(4): v6[i0, i1, :, :] = nl.load(v2[i0, :, 1024 *i1 : 1024 *(i1+1)], ...) for i2 in nl.affine_range(4...

work page
[22]

Start from analyzing the profiles and find possible inefficiencies

work page
[23]

Combine the intuitions with the ‘kernel‘ code to first come up with the optimization plans to fix the inefficiencies, then optimize the kernel according to the plans

work page
[24]

Think of loop ordering, tiling, loop split and merge, liveness analysis, data reuse, reordering instructions or blocks of instructions, hoisting redundant operations out of loops, fusion, and other methods not listed here

work page
[25]

However, you can still target optimizing certain metrics

The compiler exists and thus the profile numbers might not match the source code analysis. However, you can still target optimizing certain metrics

work page
[26]

Here is some information about the NKI API:

Don’t use lower precision than the baseline kernel. Here is some information about the NKI API:

work page
[27]

Users could also explicitly annotate the partition dimension with par_dim from nki.language

By default, NKI infers the first dimension (that is, the left most dimension) as the partition dimension of Tensor. Users could also explicitly annotate the partition dimension with par_dim from nki.language. The dimensions on the right of partition dimensions are the free dimension F where elements are read and written sequentially

work page
[28]

Each partition of SBUF buffer cannot exceed 192KB

NKI requires the free dimensions size of PSUM to not exceed the architecture limitation of 512. Each partition of SBUF buffer cannot exceed 192KB

work page
[29]

NKI requires the number of partitions of a tile to not exceed the architecture limitation of 128

work page
[30]

The nc_matmul instruction must read inputs from SBUF and write outputs to PSUM

nki.isa.nc_matmul(stationary, moving, is_stationary_onezero=False, is_moving_onezero=False, mask=None, is_transpose=False): nki.isa.nc_matmul computes transpose(stationary) @ moving matrix multiplication using Tensor Engine. The nc_matmul instruction must read inputs from SBUF and write outputs to PSUM. Therefore, the stationary and moving must be SBUF ti...

work page
[31]

nki.isa.nc_transpose(x) is equivalent to and has the same performance as nki.isa.nc_matmul(x, identity_matrix, is_moving_onezero= True, is_transpose=True)

work page
[32]

# Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines

‘nki.language.sigmoid‘, ‘nki.language.rsqrt‘, and ‘nki.language.silu‘ can be used as activation functions of ‘nki.isa.activation‘. # Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines. hbm_write_bytes: Total bytes of data written to HBM using the DMA engines. psum_read_bytes: Total bytes of data that are read from ...

work page
[33]

Don’t use slice with variable size

work page
[34]

List indices must be integers or slices, not Index

work page
[35]

Shape element must be integers

work page
[36]

__main__

InstTile cannot be directly assigned to a tensor, use store operation instead. # Problem ‘‘‘ {problem_code} ‘‘‘ # Baseline NKI kernel ‘‘‘ {kernel_code} ‘‘‘ # Kernel usage ‘‘‘ if __name__ == "__main__": inputs = get_inputs() ref_output = forward( *inputs) kernel_output = transform_nki_outputs(kernel( *transform_to_nki_inputs(inputs)), ref_output) assert np...

work page

[1] [5]

hbm_write_bytes: Total bytes of data written to HBM using the DMA engines

nki.isa.nc_transpose(x) is equivalent to and has the same performance as nki.isa.nc_matmul(x, identity_matrix, is_moving_onezero= True, is_transpose=True) Figure 18.NKI API basics # Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines. hbm_write_bytes: Total bytes of data written to HBM using the DMA engines. psum_re...

work page

[2] [7]

Combine the intuitions with the ‘kernel‘ code to come up with the optimization plans to fix the inefficiencies

work page

[3] [9]

However, the plan can still target optimizing certain metrics

The compiler exists and thus the profile numbers might not match the source code analysis. However, the plan can still target optimizing certain metrics

work page

[4] [10]

Do not invent new APIs in the optimization plans

Just use existing NKI APIs in the baseline kernel. Do not invent new APIs in the optimization plans

work page

[5] [11]

Don’t suggest using lower precision than the baseline kernel in the optimization plan. # Problem ‘‘‘ {problem_code} ‘‘‘ # Baseline NKI kernel ‘‘‘ {kernel_code} ‘‘‘ # Profile ‘‘‘ {profile} ‘‘‘ Figure 20.Planner prompt user template AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization # Output dependencies NKI requires iterat...

work page

[6] [15]

No optimization found

InstTile cannot be directly assigned to a tensor, use store operation instead. Figure 22.NKI programming guide (continue) AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization You are a helpful assistant for Neural Kernel Interface (NKI) developers. You will be given an old kernel, a new kernel, and the speedup of the new ke...

work page

[7] [16]

The last dimension tile size in arrays (v6, v10, v11, v14, v17) is doubled

work page

[8] [17]

Loop ranges for i1 (outer) and i8/i10 (inner) are halved

work page

[9] [18]

This improves performance by reducing loop overhead and enhancing data locality through processing larger chunks of data per iteration

Stride calculations in load/store operations are adjusted accordingly Original code: ‘‘‘python v6 = nl.ndarray((16, 16, nl.par_dim(64), 256), ...) for i0 in nl.affine_range(16): for i1 in nl.affine_range(16): # 16 iterations v6[i0, i1, ...] = nl.load(v2[..., 256 *i1 + ...], ...) for i2 in nl.affine_range(4): for i4 in nl.affine_range(8): v10[...] = nisa.n...

work page

[10] [19]

The loop over ‘i5‘ (originally iterating 2 times) is fused into the outer ‘i1‘ loop by extending its range from 4 to 16 iterations

work page

[11] [20]

Tensor dimensions are reshaped to reflect the fused loop structure (e.g., ‘v6‘ last dimension changes from 1024 to 256)

work page

[12] [21]

Reduction operations are simplified by eliminating the inner ‘i5‘ loop and adjusting tensor reduction axes. Original code: ‘‘‘python # Original nested loop structure with i1 (4 iters) and i5 (2 iters) for i0 in nl.affine_range(16): for i1 in nl.affine_range(4): v6[i0, i1, :, :] = nl.load(v2[i0, :, 1024 *i1 : 1024 *(i1+1)], ...) for i2 in nl.affine_range(4...

work page

[13] [22]

Start from analyzing the profiles and find possible inefficiencies

work page

[14] [23]

Combine the intuitions with the ‘kernel‘ code to first come up with the optimization plans to fix the inefficiencies, then optimize the kernel according to the plans

work page

[15] [24]

Think of loop ordering, tiling, loop split and merge, liveness analysis, data reuse, reordering instructions or blocks of instructions, hoisting redundant operations out of loops, fusion, and other methods not listed here

work page

[16] [25]

However, you can still target optimizing certain metrics

The compiler exists and thus the profile numbers might not match the source code analysis. However, you can still target optimizing certain metrics

work page

[17] [26]

Here is some information about the NKI API:

Don’t use lower precision than the baseline kernel. Here is some information about the NKI API:

work page

[18] [27]

Users could also explicitly annotate the partition dimension with par_dim from nki.language

By default, NKI infers the first dimension (that is, the left most dimension) as the partition dimension of Tensor. Users could also explicitly annotate the partition dimension with par_dim from nki.language. The dimensions on the right of partition dimensions are the free dimension F where elements are read and written sequentially

work page

[19] [28]

Each partition of SBUF buffer cannot exceed 192KB

NKI requires the free dimensions size of PSUM to not exceed the architecture limitation of 512. Each partition of SBUF buffer cannot exceed 192KB

work page

[20] [29]

NKI requires the number of partitions of a tile to not exceed the architecture limitation of 128

work page

[21] [30]

The nc_matmul instruction must read inputs from SBUF and write outputs to PSUM

nki.isa.nc_matmul(stationary, moving, is_stationary_onezero=False, is_moving_onezero=False, mask=None, is_transpose=False): nki.isa.nc_matmul computes transpose(stationary) @ moving matrix multiplication using Tensor Engine. The nc_matmul instruction must read inputs from SBUF and write outputs to PSUM. Therefore, the stationary and moving must be SBUF ti...

work page

[22] [31]

nki.isa.nc_transpose(x) is equivalent to and has the same performance as nki.isa.nc_matmul(x, identity_matrix, is_moving_onezero= True, is_transpose=True)

work page

[23] [32]

# Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines

‘nki.language.sigmoid‘, ‘nki.language.rsqrt‘, and ‘nki.language.silu‘ can be used as activation functions of ‘nki.isa.activation‘. # Profile terminology hbm_read_bytes: Total bytes of data read from HBM using the DMA engines. hbm_write_bytes: Total bytes of data written to HBM using the DMA engines. psum_read_bytes: Total bytes of data that are read from ...

work page

[24] [33]

Don’t use slice with variable size

work page

[25] [34]

List indices must be integers or slices, not Index

work page

[26] [35]

Shape element must be integers

work page

[27] [36]

__main__

InstTile cannot be directly assigned to a tensor, use store operation instead. # Problem ‘‘‘ {problem_code} ‘‘‘ # Baseline NKI kernel ‘‘‘ {kernel_code} ‘‘‘ # Kernel usage ‘‘‘ if __name__ == "__main__": inputs = get_inputs() ref_output = forward( *inputs) kernel_output = transform_nki_outputs(kernel( *transform_to_nki_inputs(inputs)), ref_output) assert np...

work page