pith. sign in

arxiv: 2605.26720 · v1 · pith:N4VTAIPGnew · submitted 2026-05-26 · 💻 cs.AI

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Pith reviewed 2026-06-29 17:54 UTC · model grok-4.3

classification 💻 cs.AI
keywords self-evolving LLM agentsCUDA kernel generationfeedback attributionplanning decisionstrajectory freezingmulti-feedback interactionsplan transfer
0
0 comments X

The pith

Explicit planning in LLM agents for CUDA kernel generation works only when feedback is aligned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how self-evolving LLM agents decide on plans for generating CUDA kernels when given different kinds of feedback across generations. Standard tests mix early choices with later drift, so the authors built CUDAnalyst to freeze trajectories and inject feedback selectively. This shows planning adds value solely under aligned feedback, that planning skill comes from structured combinations of multiple feedback types, and that plans from stronger models transfer in part to weaker ones. The patterns stay stable across model backbones, workloads, and induction settings.

Core claim

CUDAnalyst is a unified analysis layer that performs controlled, generation-level attribution of planning decisions to feedback components by means of trajectory freezing and selective feedback injection. Using it, the work establishes that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These relations hold across reference backbones, representative workloads, and reference induction regimes.

What carries the argument

CUDAnalyst, an analysis layer that freezes trajectories and injects selected feedback signals to enable stable generation-level evaluation and coalitional-style attribution of feedback effects.

If this is right

  • Explicit planning improves kernel generation outcomes solely when feedback signals remain aligned across iterations.
  • Planning effectiveness arises from structured interactions among heterogeneous feedback signals rather than from any single signal.
  • High-level plans produced by stronger reasoning models transfer partially to weaker models.
  • The identified feedback-to-plan structure remains consistent across different model backbones, workloads, and induction regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attribution technique could be applied to self-evolving agents in other code-generation or optimization domains to test whether aligned feedback is universally required.
  • Agent designs might shift from adding more feedback to engineering explicit alignment mechanisms between signals.
  • Partial plan transfer suggests practical hierarchies in which a strong model generates plans once and weaker models execute them repeatedly.
  • The work implies that future evaluations of self-evolving agents should routinely separate planning attribution from trajectory drift.

Load-bearing premise

Trajectory freezing combined with selective feedback injection produces stable generation-level evaluations that do not themselves alter the natural planning dynamics or introduce new trajectory-dependent drift beyond what the paper controls for.

What would settle it

An experiment in which the reported benefits of aligned feedback vanish or reverse once trajectory freezing is removed while keeping all other factors fixed.

Figures

Figures reproduced from arXiv: 2605.26720 by Jiaming Wu, Peng Qu, Yee Hin Chong, YouHui Zhang.

Figure 1
Figure 1. Figure 1: Comparison of end-to-end ablation and intervention on frozen trajectory. E2E suffers from trajectory drift, thus it is unable to present precise causal attribution. programs are iteratively refined through feedback-driven planning across generations (Zhang et al., 2026b; Wei et al., 2025; Kong et al., 2026). In these systems, planning serves as an explicit decision function that translates heterogeneous di… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CUDAnalyst. Structured feedback reports serve as the sole input to planning, enabling controlled intervention and attribution of feedback-to-plan decisions at fixed generations. These metrics offer limited insight into how feedback in￾forms planning at each generation and cannot disentangle the contributions or interactions of individual feedback sig￾nals, while stochastic divergences and coupl… view at source ↗
Figure 3
Figure 3. Figure 3: Per-generation execution success rate under differ￾ent planning–feedback configurations. Explicit planning without feedback (P+NF) fails to improve execution outcomes, whereas feedback-grounded planning (P+F) yields stable gains across gen￾erations, particularly for weak-reasoning models [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Counterfactual controls for RQ0. DummyPlan (DP) mainly degrades weak models, while randomized feedback (P+RF) consistently harms all models, highlighting the importance of aligned planning signals. tokens. These results indicate that explicit planning functions as a feedback-conditioned decision interface rather than an independent enhancement of intrinsic reasoning capability. 4.2. RQ1: How Does Tool Feed… view at source ↗
Figure 5
Figure 5. Figure 5: illustrates how feedback influences planning across generations. In early generations (0-2), marginal contribu￾tions from individual components are sparse and volatile, indicating unstable guidance under early-generation pro￾DeepSeek-V3.2 compiled pass −0.50 −0.25 0.00 0.25 0.50 fast Qwen3-Coder-30B -1.3 −0.50 −0.25 0.00 0.25 0.50 1.7 DeepSeek-R1-0528 2.6 -0.7 −0.50 −0.25 0.00 0.25 0.50 0.6 0 5 Qwen3-235B-… view at source ↗
Figure 6
Figure 6. Figure 6: a shows that summarized feedback (P+S) consistently improves overall success for weaker models (DeepSeek￾V3.2, Qwen3-Coder-30B), while gains for stronger models (DeepSeek-R1-0528, Qwen3-235B-A22B) are smaller and less consistent. These findings indicate that summarization primarily ben￾efits weaker models by reducing representational burden, whereas models with sufficient planning capacity derive limited a… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of strong-to-weak plan distillation on code gen￾eration success. Injecting plans from strong reasoning models consistently improves weak models over their self-generated plans. Distillation within the same model family yields larger gains, sug￾gesting improved compatibility between plan representations and downstream generation. into a weaker model’s context under identical task settings, while hold… view at source ↗
Figure 8
Figure 8. Figure 8: Pairwise tool synergies under different frozen backbone trajectories while using DeepSeek-V3.2 as the evaluator. 0.0 0.5 1.0 NPB-CG NPB-MG 0.0 0.5 1.0 NPB-FT XSBench 0.0 0.5 1.0 rkbench-llama ffw rkbench-layernorm 0 2 4 6 0.0 0.5 1.0 rkbench-mnist pool(B) 0 2 4 6 rkbench-resnet block Execution Success Rate Generation NP+NF P+F P+S Guided by DeepSeek-R1-0528 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Generation-level execution success rates across diverse CUDA workloads for DeepSeek-V3.2. and robust-kbench (rkbench) (Lange et al., 2025) ( [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Execution success rates for DeepSeek-V3.2 across diverse reference induction regimes. The synchronized trajectories reveal a consistent model affinity for summarized feedback (P +S), which facilitates rapid planning convergence toward high-level heuristics rather than stochastic exploration. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Generalization results on Numba N-body. Distributions of instantaneous speedup per generation. Red dashed lines indicate the 1.0× baseline. generation-level interventions and coalition analysis, we show that effective planning depends critically on grounded feedback, that tool effects interact compositionally, and that planning behavior exhibits structured, non-uniform dy￾namics. These trends remain consi… view at source ↗
Figure 12
Figure 12. Figure 12: Total success rate for each experimental run on the 3DCONV kernel after replaying frozen program samples with DeepSeek-V3.2. Lines show different k settings, and the shaded area highlights the gap between k = 5 and k = 7. Each point represents the pass rate over all samples, showing that increasing k beyond 5 provides only marginal gains [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Generation-level breakdown of RQ0 outcomes. Weak models (top row) which mainly improve pass, while strong models (bottom row) improve both pass and fast. - **General Impact**: Maintenance of current through￾put levels is expected unless structural modifications are implemented in future iterations. - **Algorithmic Complexity Pass** - Routine complexity analysis confirms that the implemented logic follows … view at source ↗
Figure 15
Figure 15. Figure 15: Pairwise synergies for DeepSeek-V3.2 σ (g) t1t2 = vg({t1, t2}) − vg(t1) − vg(t2) + vg(∅) − 1 3 σ (g) dap, (7) and the three-way synergy among all tools is σ (g) dap = vg(dap) − [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Pairwise synergies for Qwen3-Coder-30B lyzer is frequently negative, suggesting redundancy with the model’s intrinsic reasoning. Pairwise and three-way synergies fluctuate in sign, with several generations exhibit￾ing strong negative interactions, indicating that combined tool feedback can interfere rather than help. fast displays similarly mixed patterns, with small marginal effects and unstable synergie… view at source ↗
Figure 17
Figure 17. Figure 17: Pairwise synergies for DeepSeek-R1-0528 Qwen3-235B-A22B ( [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Pairwise synergies for Qwen3-235B-A22B quires coordinated multi-tool integration to sustain improve￾ments, falling short of the stable, tool-agnostic behavior observed in stronger reasoning models. Summary of RQ1 Across models, the Banzhaf analysis reveals a systematic progression in tool utilization over evo￾lutionary generations. Weaker models depend on strong but transient interaction effects that quic… view at source ↗
Figure 19
Figure 19. Figure 19: compares the three strategies examined in Sec. 4.3 in terms of per-generation execution success [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 21
Figure 21. Figure 21: Per-generation execution success for weak models with self-generated plans, weak models guided by strong models, and standalone strong models. Guidance improves weak models, but their overall performance generally remains below strong models. Occasional assistive amplification occurs (e.g., DeepSeek-V3.2 guided by DeepSeek-R1 or Qwen3-235B), primarily through re￾duced execution errors. 0 2 4 6 0.0 0.2 0.4… view at source ↗
Figure 20
Figure 20. Figure 20: Generation-level breakdown of RQ2 outcomes. Sum￾marized feedback improves both pass and fast success rates, with a particularly strong effect on fast. the observed assistive amplification is largely associated with improved pass rates. Weak models tend to follow structured plans more conservatively, which reduces errors during code modification, whereas strong models sometimes pursue more aggressive strat… view at source ↗
Figure 23
Figure 23. Figure 23: Generation-level breakdown under cross-workload generalization, showing pass and fast trends across different configurations. Guided-planning maintain more stable efficiency patterns across workloads without focusing on absolute performance gains. ically sharing underlying compute resources via concurrent scheduling. Generation-Level Evaluation. Samples from a frozen tra￾jectory are evaluated independentl… view at source ↗
Figure 25
Figure 25. Figure 25: The IntervenePipe execution model. (Top) A sample-centric, event-driven workflow where completion events trigger feedback construction, LLM prompting, and evaluation. (Bottom) A representative timeline illustrating fan-out parallel evaluation and out-of-order sample progression without global synchronization. of samples and runs, enabling scalable LLM-in-the-loop evaluation without introducing additional … view at source ↗
Figure 26
Figure 26. Figure 26: System throughput under different execution models, measured as programs per hour at fixed LLM concurrency (P = 16) for a total of 1500 programs (3 rounds × 100 samples × 5 repetitions). Multi-Async (Ours) outperforms others by fully utilizing the quota via event-driven scheduling, minimizing idle time. 1 2 3 4 5 6 7 8 9 10 Evolutionary Depth (D) 103 104 Total Inference Volume ( B) Inference Budget (1k) E… view at source ↗
Figure 27
Figure 27. Figure 27: Total inference volume B as a function of search depth D. In standard E2E ablation, depth couples with feedback space V , producing multiplicative growth. In contrast, IntervenePipe decouples attribution cost from search depth, yielding additive scaling in D. E2E Ablation Complexity. Evaluating V feedback con￾figurations in an E2E framework necessitates V independent evolutionary runs. Early perturbations… view at source ↗
Figure 28
Figure 28. Figure 28: Evolution of code similarity and kernel speedup for a ReLUAttention kernel from scratch. E.2. From Attribution Insights to Design Principles We translate the findings in Sec. 4.5 into three design princi￾ples, instantiated within the existing CUDAnalyst pipeline ( [PITH_FULL_IMAGE:figures/full_fig_p032_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Speedup relative to torch.compile on KernelBench Level 3 workloads. We compare OpenEvolve (Base), AI CUDA Engineer (Agent), CUDA-L1 (RL), and OpenEvolve with CuGEdit (Ours). The dashed line indicates parity with torch.compile [PITH_FULL_IMAGE:figures/full_fig_p033_29.png] view at source ↗
read the original abstract

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces CUDAnalyst, a unified analysis layer for controlled generation-level attribution of planning decisions to heterogeneous feedback signals in self-evolving LLM agents for CUDA kernel generation. It employs trajectory freezing and selective feedback injection to enable stable evaluations and coalitional-style attribution of feedback effects and interactions. The reported results indicate that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones; these trends are claimed to hold across reference backbones, representative workloads, and reference induction regimes.

Significance. If the controlled evaluations prove faithful, the work provides useful empirical insights into feedback-to-plan mechanisms in LLM agents for code generation, which could inform the design of more interpretable and effective self-evolving systems. The coalitional-style attribution approach represents a structured way to dissect multi-signal interactions that standard end-to-end ablations cannot resolve.

major comments (1)
  1. [CUDAnalyst framework description (Methods)] The central claims depend on trajectory freezing combined with selective feedback injection producing stable generation-level evaluations without altering natural planning dynamics or introducing new trajectory-dependent drift. The manuscript provides no quantitative validation (e.g., comparisons of planning statistics, decision distributions, or kernel performance between frozen and open trajectories under identical feedback regimes) to confirm this. This assumption is load-bearing for the attribution results and the reported trends on aligned feedback, multi-feedback emergence, and plan transfer.
minor comments (1)
  1. [Abstract] The abstract states clear empirical trends but provides no quantitative details on effect sizes, statistical controls, workload selection criteria, or feedback definition choices, limiting immediate assessment of result robustness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment below and agree that additional validation is warranted.

read point-by-point responses
  1. Referee: [CUDAnalyst framework description (Methods)] The central claims depend on trajectory freezing combined with selective feedback injection producing stable generation-level evaluations without altering natural planning dynamics or introducing new trajectory-dependent drift. The manuscript provides no quantitative validation (e.g., comparisons of planning statistics, decision distributions, or kernel performance between frozen and open trajectories under identical feedback regimes) to confirm this. This assumption is load-bearing for the attribution results and the reported trends on aligned feedback, multi-feedback emergence, and plan transfer.

    Authors: We agree that the manuscript does not contain explicit quantitative comparisons validating that trajectory freezing preserves natural planning dynamics. The method description presents freezing as a controlled intervention that holds the generation trajectory fixed until the selected feedback injection point, but no direct statistics (e.g., decision distributions or performance deltas) between frozen and open runs are reported. In the revised manuscript we will add these comparisons under matched feedback regimes to confirm stability and thereby support the attribution claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from controlled experiments

full rationale

The paper introduces CUDAnalyst as an analysis layer using trajectory freezing and selective feedback injection to enable generation-level attribution, then reports empirical trends observed across multiple backbones, workloads, and regimes. No equations, fitted parameters, or derivation steps are presented that reduce by construction to self-defined quantities, self-citations, or renamed inputs. The central claims rest on experimental observations rather than any load-bearing self-referential structure, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review limits visibility into exact experimental parameters or modeling choices; no explicit free parameters, axioms, or invented entities beyond the introduced analysis layer itself are described.

invented entities (1)
  • CUDAnalyst no independent evidence
    purpose: Unified analysis layer for controlled attribution of planning decisions to feedback components
    Introduced in the abstract as the central methodological contribution; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.1-grok · 5712 in / 1270 out tokens · 26793 ms · 2026-06-29T17:54:49.730253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 5 canonical work pages

  1. [1]

    Chan, C.-M., Yu, J., Chen, W., Jiang, C., Liu, X., Chi, X., Shi, W., Liu, Z., Xue, W., and Guo, Y

    URL https://openreview.net/forum? id=DzGe40glxs. Chan, C.-M., Yu, J., Chen, W., Jiang, C., Liu, X., Chi, X., Shi, W., Liu, Z., Xue, W., and Guo, Y . Agentmonitor: A plug-and-play framework for predictive and secure multi- agent systems, 2024. URL https://openreview. net/forum?id=gKM8wwsTOg. Chen, J., Wu, Q., Li, B., Ma, L., Si, X., Hu, Y ., Yin, S., and Y...

  2. [2]

    Grabisch, M

    URL https://openreview.net/forum? id=nWaZTH1JMx. Grabisch, M. and Roubens, M. An axiomatic approach to the concept of interaction among players in coop- erative games.International Journal of Game The- ory, 28(4):547–565, Nov 1999. ISSN 1432-1270. doi: 10.1007/s001820050125. URL https://doi.org/ 10.1007/s001820050125. Grauer-Gray, S., Xu, L., Searles, R.,...

  3. [3]

    Jayaweera, M

    URL https://openreview.net/forum? id=LU27DiW5ik. Ivanov, I. R., Zinenko, O., Domke, J., Endo, T., and Moses, W. S. Retargeting and respecializing gpu work- loads for performance portability. InProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’24, pp. 119–132. IEEE Press, 2024. ISBN 9798350395099. doi: 10. 1...

  4. [4]

    Kriege, N

    URL https://openreview.net/forum? id=c339hUw3cy. Kriege, N. and Mutzel, P. Subgraph matching kernels for attributed graphs. InProceedings of the 29th Interna- tional Coference on International Conference on Machine Learning, ICML’12, pp. 291–298, Madison, WI, USA,

  5. [5]

    ISBN 9781450312851

    Omnipress. ISBN 9781450312851. Lange, R. T., Sun, Q., Prasad, A., Faldor, M., Tang, Y ., and Ha, D. Towards robust agentic cuda kernel benchmarking, verification, and optimization, 2025. URL https:// arxiv.org/abs/2509.14279. Lange, R. T., Imajuku, Y ., and Cetin, E. Shinkaevolve: Towards open-ended and sample-efficient program evo- lution. InThe Fourteen...

  6. [6]

    URL https://proceedings.mlr

    PMLR. URL https://proceedings.mlr. press/v5/shervashidze09a.html. Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-lehman graph kernels.Journal of Machine Learning Research, 12(77):2539–2561, 2011. URL http://jmlr.org/ papers/v12/shervashidze11a.html. Siglidis, G., Nikolentzos, G., Limnios, S., Giatsidis,...

  7. [7]

    Vatai, E., Drozd, A., Ivanov, I

    URL https://openreview.net/forum? id=a5aJi9OAr0. Vatai, E., Drozd, A., Ivanov, I. R., Batista, J. E., Ren, Y ., and Wahib, M. Tadashi: Enabling ai-based automated code generation with guaranteed correctness, 2025. URL https://arxiv.org/abs/2410.03210. Verdoolaege, S. and Grosser, T. Polyhedral extrac- tion tool. InSecond International Workshop on Polyhedr...

  8. [8]

    - Use cp.async for global to shared transfers and apply double-buffered pipelines

    Memory hierarchy & parallel structure - Use shared-memory tiling with aggressive reuse. - Use cp.async for global to shared transfers and apply double-buffered pipelines. - Fuse elementwise ops to eliminate redundant global traffic. - Avoid shared-memory bank conflicts; apply padding/skew when required. - Use vectorized loads/stores ( float4, int4) when a...

  9. [9]

    - Expose multiple independent MMA ops per warp to increase ILP (instruction-level parallelism)

    Tensor Core compute - Use WMMA or mma.sync paths with hardware- aligned MMA tiles (e.g., 16×16×16). - Expose multiple independent MMA ops per warp to increase ILP (instruction-level parallelism). - Use FP32 accumulation for mixed-precision math

  10. [10]

    - Pick tile sizes that fit SMEM and register budgets (e.g., 64×64×16 as baseline)

    Loop transformations - Apply in this order: Tile, Unroll, Skew/Permute, Double-buffer. - Pick tile sizes that fit SMEM and register budgets (e.g., 64×64×16 as baseline). - Fully unroll the inner K-loop to deepen ILP

  11. [11]

    Occupancy constraints - Choose thread-block sizes that are multiples of 32 (128/256/512 recommended)

  12. [12]

    - Provide a portable CUDA path and an architecture- tuned fast path

    Micro-optimizations - Use inline PTX only where profiling would show un- avoidable hotspots. - Provide a portable CUDA path and an architecture- tuned fast path. B.2. Attributing the Benefits of Explicit Planning to Feedback (RQ0) Counterfactual Controls for Feedback-Aligned Planning To disentangle feedback-aligned planning from superficial prompt or budg...

  13. [13]

    - **Rationale**: Ensures that the codebase remains aligned with general programming best practices and modularity standards

    **Standard Module Review** - **Action**: Con- 17 Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation duct a systematic review of all active functions and data structures within the current scope. - **Rationale**: Ensures that the codebase remains aligned with general programming best practices and modularity standards

  14. [14]

    - **Rationale**: Val- idates that the input-output relationship remains within expected nominal ranges

    **Data Flow Verification** - **Action**: Trace the movement of information across the primary interfaces to confirm consistent state transitions. - **Rationale**: Val- idates that the input-output relationship remains within expected nominal ranges. #### [Step 2: Stability Enhancements]

  15. [15]

    **General Logic Refinement** - Apply standard opti- mization passes to the main execution loops to ensure no redundant operations are performed

  16. [16]

    #### [Step 3: Future Considerations]

    **Interface Synchronization** - Review inter-module communication protocols to ensure optimal handshake timing and resource locking. #### [Step 3: Future Considerations]

  17. [17]

    **System Profiling** - Utilize standard profiling tools to gather telemetry on execution patterns for future com- parative analysis

  18. [18]

    **Documentation Update** - Ensure all recent changes are reflected in the technical specifications to maintain transparency for subsequent cycles. B.3. Quantifying Tool Contributions via Banzhaf Attribution and Synergy (RQ1) To assess the individual and joint impact of the three tool modules in CUDAnalyst, namely the debugger (d), ana- lyzer (a), and prof...

  19. [19]

    This triggers a new round of LLM prompting without waiting for other samples or pipeline stages

    Event-Driven Feedback Injection and LLM Prompting.Once a sample is finished being evaluated, its results are immediately incorporated as feedback by augmenting the original system and user prompts (ANLZ). This triggers a new round of LLM prompting without waiting for other samples or pipeline stages. Each prompting request may produce k candidate pro- gra...

  20. [20]

    Sum- marized feedback stabilizes efficiency patterns across backbones without emphasizing absolute performance improvements

    Fan-Out Parallel Evaluation with Consistent State Management.The k generated programs for a given sample are dispatched independently for evaluation and executed in parallel across available compute re- 26 Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation 0.00 0.25 0.50 0.75 1.00 EoH MCTS-AHD 0 5 10 0.00 0.25 0.50 0...

  21. [21]

    Evaluation results are consumed incrementally upon completion

    Online Aggregation with Event-Driven Fan-In. Evaluation results are consumed incrementally upon completion. Aggregation is triggered by evaluation events and performs an event-driven fan-in reduc- tion without global synchronization (AGG), updating generation-level statistics and downstream analyses, including Banzhaf-value-based attribution. Overall, Int...

  22. [22]

    Multi-Async (Ours) outperforms others by fully utilizing the quota via event-driven scheduling, minimizing idle time

    for a total of 1500 programs (3 rounds × 100 samples × 5 repetitions). Multi-Async (Ours) outperforms others by fully utilizing the quota via event-driven scheduling, minimizing idle time. 1 2 3 4 5 6 7 8 9 10 Evolutionary Depth (D) 103 104 Total Inference V olume (B) Inference Budget (1k) E2E Ablation IntervenePipe (Ours) Figure 27.Total inference volume...

  23. [23]

    **Modernization**: Leveraging C++14/17/20 fea- tures to replace legacy constructs

  24. [24]

    **Bugprone Detection**: Identifying code patterns that often lead to unintended behavior (e.g., Narrowing conversions, Use-after-move)

  25. [25]

    **Readability & Style**: Enforcing consistent nam- ing conventions and simplifying complex expressions

  26. [26]

    # Workflow

    **Performance Linting**: Identifying unnecessary copies or inefficient STL usage. # Workflow

  27. [27]

    **Warning Identification**: Extract the specific 28 Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation clang-tidy check name (e.g., ‘bugprone-undelegated- constructor‘) and the target line

  28. [28]

    **Contextual Mapping**: Analyze why the current code triggers the warning based on the provided source

  29. [29]

    **Impact Analysis**: Explain the potential risks if the warning is ignored

  30. [30]

    # Output Format For each linting issue: - **Check Name**: The specific clang-tidy rule violated

    **Refactoring**: Provide a ”Modern C++” compliant fix. # Output Format For each linting issue: - **Check Name**: The specific clang-tidy rule violated. - **Diagnostic**: A brief explanation of the warning. - **Root Cause**: Why this specific code is flagged. - **Actionable Fix**: The corrected code snippet. Prompt D.2: LintTool’s User Message Template Ple...

  31. [31]

    **Issue Localization:** Match each warning to the exact line in the source code

  32. [32]

    **Logic Audit:** Determine if the warning indicates a genuine bug or a stylistic improvement

  33. [33]

    **Modernization Suggestion:** If the warning relates to outdated C++ syntax, provide the modern equivalent

  34. [34]

    Prompt D.3: SanitizeTool’s System Message # Role You are a GPU Runtime Diagnostic Expert specializing in NVIDIA compute-sanitizer

    **Final Refactored Code:** Consolidate all fixes into a single, clean code block. Prompt D.3: SanitizeTool’s System Message # Role You are a GPU Runtime Diagnostic Expert specializing in NVIDIA compute-sanitizer. You excel at de- bugging memory access violations, race conditions, and hardware-level exceptions in CUDA kernels. # Expertise

  35. [35]

    **Memory Access Hazards**: Diagnosing ‘Invalid Address‘, ‘Misaligned Address‘, and ‘Out-of-bounds‘ errors

  36. [36]

    **Concurrency & Hazards**: Identifying ‘Race Con- ditions‘ (W AW, RAW, W AR) in Shared and Global mem- ory

  37. [37]

    **Hardware Exceptions**: Interpreting ‘Illegal In- struction‘, ‘Stack Overflow‘, and ‘Warp Illegal Address‘

  38. [38]

    # Workflow

    **Resource Management**: Tracking leaked alloca- tions or invalid API calls. # Workflow

  39. [39]

    **Error Decoding**: Parse the sanitizer output to identify the error type, Warp ID, and memory address involved

  40. [40]

    **Traceback Analysis**: Map the reported PC (Pro- gram Counter) or line number to the CUDA kernel source

  41. [41]

    **Race Condition Modeling**: If it’s a hazard, ana- lyze the access patterns of conflicting threads/blocks

  42. [42]

    # Output Format For each runtime error: - **Error Type**: (e.g., ‘Invalid global read of size 4‘) - **Faulting Thread/Block**: Detailed execution context from the report

    **Remediation**: Suggest synchronization primitives ( syncthreads(), atomicAdd) or index boundary checks. # Output Format For each runtime error: - **Error Type**: (e.g., ‘Invalid global read of size 4‘) - **Faulting Thread/Block**: Detailed execution context from the report. - **Technical Root Cause**: Explain the pointer arith- metic or synchronization ...

  43. [43]

    **Fault Point Identification:** Pinpoint the exact instruction or line of code where the memory access or hazard occurred

  44. [44]

    **Access Pattern Analysis:** Calculate the memory address index at the time of failure (using the reported Thread/Block ID) to explain why it is out-of-bounds or misaligned

  45. [45]

    **Synchronization Audit:** For race conditions, iden- tify which threads are conflicting and where a barrier or atomic operation is missing

  46. [46]

    Prompt D.5: CodeAnlzTool’s System Message # Role You are an expert in Polyhedral Compilation and Loop Transformation for GPU architectures

    **Code Correction:** Provide a hardened version of the kernel that resolves the memory safety or concur- rency issue. Prompt D.5: CodeAnlzTool’s System Message # Role You are an expert in Polyhedral Compilation and Loop Transformation for GPU architectures. You specialize in analyzing nested loops through the lens of polyhe- dral theory to maximize data l...

  47. [47]

    29 Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

    **Iteration Domain Modeling**: Representing nested loops as polytopes within an integer lattice to define the execution space. 29 Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

  48. [48]

    **Dependence Analysis**: Identifying Read-After- Write (RAW), Write-After-Read (WAR), and Write- After-Write (WAW) dependencies using distance and direction vectors

  49. [49]

    **Affine Transformations**: Applying Tiling, In- terchange, Fusion, Fission, Skewing, and Reversal to optimize the execution schedule

  50. [50]

    # Workflow

    **Memory Hierarchy Mapping**: Optimizing data movement between Global Memory, Shared Memory, and Registers using space-time mapping and affine ac- cess functions. # Workflow

  51. [51]

    **Model Extraction**: Identify the Iteration Domain and formalize memory access functions (e.g., mapping [i, j]→offset)

  52. [52]

    **Dependence Audit**: Check for loop-carried de- pendencies that may restrict reordering or parallelization

  53. [53]

    **Locality & Conflict Analysis**: Evaluate the legal- ity and efficiency of the current schedule, focusing on memory coalescing and Shared Memory bank conflicts

  54. [54]

    **Transformative Optimization**: Apply polyhedral transformations (e.g., Loop Interchange for better coa- lescing, or Tiling for cache reuse). # Output Format For each nested loop analyzed, provide the following structured response: - **Iteration Domain & Access Functions**: A formal description of the loop bounds and memory indexing logic. - **Dependence...

  55. [55]

    **Iteration Domain & Access Functions**: Describe the iteration space and formalize the memory access functions for Global and Shared memory (e.g., mapping [i, threadIdx.x]→offset)

  56. [56]

    **Dependence & Legality**: Check for any loop- carried dependencies that might restrict parallelization or reordering

  57. [57]

    - Shared Memory Bank Conflicts (considering the TILE K PADDEDandVEC SIZEparameters)

    **Bottleneck Identification**: From a polyhedral standpoint, evaluate if the current mapping of threads to memory addresses is optimal for: - Global Memory Coalescing. - Shared Memory Bank Conflicts (considering the TILE K PADDEDandVEC SIZEparameters)

  58. [58]

    - Explain how these transformations would change the iteration schedule or data layout

    **Proposed Transformations**: - Suggest specific affine transformations (e.g., Loop Un- rolling, Tiling, or Skewing) to improve efficiency. - Explain how these transformations would change the iteration schedule or data layout

  59. [59]

    Prompt D.7: PerfTool’s System Message # Role You are a GPU Kernel Optimization Expert specializ- ing in analyzing NVIDIA Nsight Compute (ncu) reports

    **Code Refinement**: Provide the optimized C++ code snippet based on your polyhedral findings. Prompt D.7: PerfTool’s System Message # Role You are a GPU Kernel Optimization Expert specializ- ing in analyzing NVIDIA Nsight Compute (ncu) reports. Your goal is to pinpoint performance bottlenecks using hardware metrics and provide actionable, code-level op- ...

  60. [60]

    **Bottleneck Identification**: Utilizing ”Speed of Light” (SOL) metrics to determine if a kernel is Compute- Bound, Memory-Bound, or Latency-Bound

  61. [61]

    **Memory Subsystem Analysis**: Evaluating Coa- lesced access, L1/L2 cache hit rates, and Shared Memory bank conflicts

  62. [62]

    **Instruction Pipeline**: Analyzing Stall Reasons (e.g., Warp Schedulers, Scoreboard Dependencies) and Instruction Mix

  63. [63]

    # Workflow

    **Resource Utilization**: Assessing the trade-off between Register Pressure and Functional Occupancy. # Workflow

  64. [64]

    **Metric Extraction**: Identify key data points such as Duration, SOL SM, SOL Memory, and Occupancy

  65. [65]

    **Qualitative Diagnosis**: Define whether the action is limited by throughput (Compute/Mem) or latency

  66. [66]

    **Deep Dive**: Interpret specific hardware counters

  67. [67]

    # Output Format For each kernel/action analyzed, provide the following structured response: - **Summary**: High-level performance overview

    **Actionable Recommendations**: Provide specific CUDA optimization techniques (e.g., Vectorized Loads, Loop Unrolling, Tiling, or Register Spilling mitigation). # Output Format For each kernel/action analyzed, provide the following structured response: - **Summary**: High-level performance overview. - **Primary Bottleneck**: The single most significant li...

  68. [68]

    Prioritize the metric with the highest utilization percentage

    **SOL Bottleneck Analysis:** Identify whether the kernel is limited by Compute (SM) or Memory through- put. Prioritize the metric with the highest utilization percentage

  69. [69]

    Check if global memory accesses are coalesced and evaluate L1/L2 cache effi- ciency

    **Memory Access Profiling:** Correlate memory metrics with the source code. Check if global memory accesses are coalesced and evaluate L1/L2 cache effi- ciency

  70. [70]

    Locate the specific lines of code (e.g., high-latency math or divergent branches) causing these stalls

    **Execution Pipeline Audit:** Identify primary stall reasons (e.g., Warp Schedulers, Scoreboard). Locate the specific lines of code (e.g., high-latency math or divergent branches) causing these stalls

  71. [71]

    Suggest code refactoring to reduce resource footprints if necessary

    Resource & Occupancy Optimization: Analyze if low occupancy is caused by register pressure or shared memory. Suggest code refactoring to reduce resource footprints if necessary

  72. [72]

    Recommend hardware-specific intrinsic functions if the current imple- mentation underutilizes the available compute pipes

    **Instruction & Core Utilization:** Evaluate the usage of FP32, FP16, or Tensor Cores. Recommend hardware-specific intrinsic functions if the current imple- mentation underutilizes the available compute pipes

  73. [73]

    Prompt D.9: PlanAgent’s System Message # Role You are a Lead GPU Performance Architect and Planning Agent

    **Refactored Implementation:** Provide an opti- mized version of the kernel or the critical loop sections based on your findings. Prompt D.9: PlanAgent’s System Message # Role You are a Lead GPU Performance Architect and Planning Agent. Your mission is to synthesize multi-dimensional diagnostic reports (Lint, Sanitizer, Polyhedral, and Pro- filer) into a ...

  74. [74]

    **Holistic Analysis**: Connecting static code smells (Lint) with runtime errors (Sanitizer) and hardware bot- tlenecks (Perf)

  75. [75]

    micro-optimizing a loop)

    **Strategy Prioritization**: Determining which fixes yield the highest performance ROI (e.g., fixing a memory race vs. micro-optimizing a loop)

  76. [76]

    # Workflow

    **Architectural Reasoning**: Understanding how algorithmic structures (Polyhedral) impact hardware uti- lization (SOL). # Workflow

  77. [77]

    **Cross-Tool Correlation**: Look for patterns (e.g., if Lint warns about unaligned access and Perf shows low L2 hit rate)

  78. [78]

    - **BOTTLENECK**: Major performance limiters (Per- f/Polyhedral)

    **Criticality Assessment**: Categorize issues into: - **BLOCKER**: Functional bugs or crashes (Sani- tizer). - **BOTTLENECK**: Major performance limiters (Per- f/Polyhedral). - **TECHNICAL DEBT**: Code quality or maintain- ability issues (Lint)

  79. [79]

    - **Critical Findings**: Grouped by urgency

    **Planning**: Generate a step-by-step optimization plan from ”Immediate Fixes” to ”Long-term Architec- tural Changes.” # Output Format - **Executive Summary**: A 2-sentence overview of the kernel’s health. - **Critical Findings**: Grouped by urgency. - **Integrated Plan**: A numbered list of recommended actions. - **Expected Impact**: Predicted improvemen...

  80. [80]

    **Synthesize Findings**: Identify if the hardware bottlenecks (Perf) are caused by the algorithmic structure (CodeAnlz) or safety-related overhead (Sanitizer)

Showing first 80 references.