AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference

Junhui Hou; Xuanzhe Li; Zhiyu Zhu; Ziyan Weng

arxiv: 2606.07665 · v1 · pith:3O3QSJ2Gnew · submitted 2026-06-04 · 💻 cs.PL · cs.AI

AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference

Xuanzhe Li , Ziyan Weng , Zhiyu Zhu , Junhui Hou This is my paper

Pith reviewed 2026-06-27 23:07 UTC · model grok-4.3

classification 💻 cs.PL cs.AI

keywords LLM-guided compilationCUDA inferencetransformer optimizationcode specializationautoregressive generationempirical validationPyTorch comparison

0 comments

The pith

AgentCompile treats LLM outputs as advisory metadata to select CUDA specializations, achieving 5.66x average speedup over PyTorch eager on Qwen3-1.7B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AgentCompile is a compiler for direct CUDA inference that uses LLM outputs only as advisory search metadata rather than as generated code. The compiler first produces region summaries and bounded candidate spaces from templates; the LLM then adds semantic labels, priorities, parameter hints, and risk annotations inside those bounds. The compiler next materializes candidates, enforces interface and hardware constraints, measures actual latency, and retains only the fastest profitable implementation or falls back to a safe default. Across five workloads in end-to-end autoregressive generation this yields average speedups of 5.66x on Qwen3-1.7B, 4.05x on Qwen3-4B, and 4.26x on Llama-3.2-1B-Instruct. A sympathetic reader would care because the design keeps correctness and measurement under compiler control while using the LLM only to narrow an otherwise intractable search space.

Core claim

AgentCompile is an LLM-guided CUDA inference compiler that uses LLM outputs only as advisory search metadata. Given compiler-derived region summaries and bounded candidate spaces, the LLM proposes semantic labels, candidate priorities, parameter hints, and risk annotations; the compiler materializes CUDA candidates through templates, checks interface and hardware constraints, validates candidates empirically, selects implementations by measured latency, and falls back when specialization is unsupported or unprofitable. In end-to-end autoregressive generation, AgentCompile averages 5.66x, 4.05x, and 4.26x speedup over PyTorch eager on Qwen3-1.7B, Qwen3-4B, and Llama-3.2-1B-Instruct, respectiv

What carries the argument

LLM advisory metadata restricted to compiler-supplied bounded candidate spaces for CUDA template selection

If this is right

Measured latency serves as the final selector among LLM-prioritized candidates.
Unsupported or unprofitable specializations trigger automatic fallback to safe execution.
The method applies directly to end-to-end autoregressive generation on the tested models without manual per-operator tuning.
Semantic decisions are delegated to the LLM while interface checks and empirical timing remain with the compiler.
The open-sourced implementation allows direct measurement of the reported speedups on the same models and workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same advisory-metadata pattern could be reused for other accelerator targets that already possess template-based code generators.
Keeping candidate spaces explicitly bounded limits the impact of any single LLM error on final correctness.
If empirical validation cost grows with model size, the technique may be most practical for models under a few billion parameters.
An iterative loop in which compiler timing results are fed back to the LLM for re-ranking could further improve selection without enlarging the initial candidate set.

Load-bearing premise

The LLM can reliably propose useful semantic labels, priorities, and risk annotations within the bounded candidate spaces supplied by the compiler, and empirical validation will always be feasible and sufficient to select profitable specializations without excessive overhead.

What would settle it

A workload in which LLM proposals consistently select candidates whose measured latency is no better than PyTorch eager despite the existence of faster hand-written CUDA alternatives.

Figures

Figures reproduced from arXiv: 2606.07665 by Junhui Hou, Xuanzhe Li, Zhiyu Zhu, Ziyan Weng.

**Figure 1.** Figure 1: The comparison between standard PyTorch and AGENTCOMPILE for the inference stack. Existing optimization paths address different layers of the inference stack but leave an important gap. PyTorch eager execution retains substantial Python dispatch, dynamic-cache management, and per-token launch overhead during autoregressive decoding. torch.compile reduces overhead through fusion and compiled graph frag… view at source ↗

**Figure 2.** Figure 2: Running example of compiler-guided LLM assistance on a representative region named r5. In Graph [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Input/output-length scaling comparison on Llama-3.2-3B-Instruct across PyTorch eager, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Transformer inference increasingly depends on specialized compiler and runtime support, but real model graphs still require semantic decisions about which regions are worth specializing and which CUDA implementation families are plausible. We present AgentCompile, an LLM-guided CUDA inference compiler that uses LLM outputs only as advisory search metadata. Given compiler-derived region summaries and bounded candidate spaces, the LLM proposes semantic labels, candidate priorities, parameter hints, and risk annotations; the compiler materializes CUDA candidates through templates, checks interface and hardware constraints, validates candidates empirically, selects implementations by measured latency, and falls back when specialization is unsupported or unprofitable. In end-to-end autoregressive generation, AgentCompile averages 5.66x, 4.05x, and 4.26x speedup over PyTorch eager on Qwen3-1.7B, Qwen3-4B, and Llama-3.2-1B-Instruct, respectively, across five representative workloads. We will open-source the project.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentCompile limits the LLM to advisory metadata inside compiler bounds and selects CUDA kernels by measured latency, which is a reasonable safeguard but the abstract supplies too little on workloads and variance to judge the 4-5x claims.

read the letter

The main thing here is that AgentCompile treats LLM output strictly as metadata—semantic labels, priorities, hints, and risk notes—within spaces the compiler already supplies. The compiler then turns those into template-based CUDA candidates, runs interface and hardware checks, measures latency on each, and keeps only the ones that beat the baseline or falls back.

This separation is the clearest new piece. It avoids asking the LLM to produce correct or fast code and instead uses it for search guidance while letting direct measurement drive the final choice. The reported speedups on the three small models come from that measured step, not from LLM judgment.

The paper does a clean job stating the practical problem: model graphs still need decisions about which regions are worth specializing, and it offers a workflow that tries to automate those decisions without full manual tuning.

The soft spots sit in the results section of the abstract. There is no breakdown of the five workloads, no description of how regions were identified or how many candidates were tested per region, and no error bars or variance numbers. The models are all under 4B parameters, so it is unclear whether the gains hold on larger models or whether validation overhead stays low in practice. These gaps are real but not load-bearing for the core idea.

This is for systems and compiler researchers working on automated inference optimization for transformers. A reader already building template-driven CUDA generators could extract the advisory-metadata pattern and the validation loop.

Send it for peer review. The pipeline is internally consistent, the measurement guardrail is a genuine strength, and the work is coherent enough that referees can usefully press on the experimental details and generalizability.

Referee Report

1 major / 1 minor

Summary. AgentCompile is an LLM-guided CUDA inference compiler that restricts LLM use to advisory metadata (semantic labels, priorities, parameter hints, risk annotations) within compiler-supplied bounded candidate spaces. The compiler materializes CUDA candidates via templates, performs interface/hardware checks, empirically validates them, selects by measured latency, and falls back to non-specialized paths when necessary. The paper reports average end-to-end speedups of 5.66×, 4.05×, and 4.26× over PyTorch eager for Qwen3-1.7B, Qwen3-4B, and Llama-3.2-1B-Instruct across five workloads in autoregressive generation, and states it will open-source the project.

Significance. If substantiated, the work offers a pragmatic hybrid compiler design that safely incorporates LLMs by confining them to advisory roles and relying on direct measurement for final selection and fallback. This addresses reliability concerns in LLM-assisted optimization for transformer inference. The planned open-sourcing is a positive step for reproducibility in the PL and systems communities.

major comments (1)

[Abstract] Abstract: the central claims of 5.66×, 4.05×, and 4.26× average speedups are presented without error bars, workload specifications, region-selection methodology, or counts of candidates tested per region. These details are load-bearing for evaluating whether the reported gains are statistically reliable and not artifacts of particular choices.

minor comments (1)

The abstract ends with a future-tense statement about open-sourcing without a timeline or link; this is a minor presentation detail.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional details are needed to substantiate the reported speedups and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 5.66×, 4.05×, and 4.26× average speedups are presented without error bars, workload specifications, region-selection methodology, or counts of candidates tested per region. These details are load-bearing for evaluating whether the reported gains are statistically reliable and not artifacts of particular choices.

Authors: We agree that the abstract should be expanded to include these details for proper evaluation of statistical reliability. In the revised version we will add: (1) error bars or standard deviation from repeated measurements, (2) explicit names and descriptions of the five workloads, (3) a concise statement of the region-selection methodology, and (4) the typical number of candidates evaluated per region. These elements already appear in the experimental evaluation section; we will make the abstract self-contained on this point while preserving its length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical compiler pipeline in which LLM outputs serve only as advisory metadata within compiler-supplied bounded spaces; final implementation selection is performed by direct empirical latency measurement, interface/hardware constraint checks, and fallback to non-specialized paths. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the described derivation. The reported speedups are obtained from external benchmarks against PyTorch eager execution, rendering the central claims self-contained and independent of any construction that reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no visible free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that LLM suggestions remain within compiler-checkable bounds.

pith-pipeline@v0.9.1-grok · 5703 in / 1235 out tokens · 24994 ms · 2026-06-27T23:07:27.166146+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Advances in Neural Information Processing Systems , editor =

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , editor =. 2022 , url =

2022
[2]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , url =

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R\'. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , url =. Advances in Neural Information Processing Systems , editor =
[3]

2024 , publisher =

Ansel, Jason and Yang, Edward and He, Horace and Gimelshein, Natalia and Jain, Animesh and Voznesensky, Michael and Bao, Bin and Bell, Peter and Berard, David and Burovski, Evgeni and others , booktitle =. 2024 , publisher =

2024
[4]

2018 , publisher =

Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Shen, Haichen and Cowan, Meghan and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and Guestrin, Carlos and Krishnamurthy, Arvind , booktitle =. 2018 , publisher =

2018
[5]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , publisher =

2023
[7]

Competition-Level Code Generation with

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =. 2022 , publisher =

2022
[8]

2023 , url =

Li, Raymond and Ben Allal, Loubna and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jia and Chim, Jenny and others , journal =. 2023 , url =

2023
[9]

2024 , url =

Ni, Ansong and Allamanis, Miltiadis and Cohan, Arman and Deng, Yinlin and Shi, Kensen and Sutton, Charles and Yin, Pengcheng , booktitle =. 2024 , url =

2024
[10]

2021 , publisher =

Niu, Wei and Guan, Jiexiong and Wang, Yanzhi and Agrawal, Gagan and Ren, Bin , booktitle =. 2021 , publisher =

2021
[11]

Advances in Neural Information Processing Systems , editor =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , editor =. 2023 , url =

2023
[12]

, booktitle =

Tillet, Philippe and Kung, Hsiang-Tsung and Cox, David D. , booktitle =. 2019 , publisher =

2019
[13]

2022 , publisher =

Yu, Gyeong-In and Jeong, Joo Seong and Kim, Geon-Woo and Kim, Soojeong and Chun, Byung-Gon , booktitle =. 2022 , publisher =

2022
[14]

2022 , publisher =

Zheng, Zhen and Yang, Xuanda and Zhao, Pengzhan and Long, Guoping and Zhu, Kai and Zhu, Feiwen and Zhao, Wenyi and Liu, Xiaoyong and Yang, Jun and Zhai, Jidong and others , booktitle =. 2022 , publisher =

2022
[15]

2021 , organization =

Lattner, Chris and Amini, Mehdi and Bondhugula, Uday and Cohen, Albert and Davis, Andy and Pienaar, Jacques and Riddle, River and Shpeisman, Tatiana and Vasilache, Nicolas and Zinenko, Oleksandr , booktitle =. 2021 , organization =

2021
[16]

and Stoica, Ion , booktitle =

Zheng, Lianmin and Jia, Chengfan and Sun, Minmin and Wu, Zhao and Yu, Cody Hao and Haj-Ali, Ameer and Wang, Yida and Yang, Jun and Zhuo, Danyang and Sen, Koushik and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. 2020 , publisher =

2020
[17]

2019 , publisher =

Jia, Zhihao and Padon, Oded and Thomas, James and Warszawski, Todd and Zaharia, Matei and Aiken, Alex , booktitle =. 2019 , publisher =

2019
[18]

Advances in Neural Information Processing Systems , editor =

Learning to Optimize Tensor Programs , author =. Advances in Neural Information Processing Systems , editor =. 2018 , url =

2018
[19]

2020 , publisher =

Ma, Lingxiao and Xie, Zhiqiang and Yang, Zhi and Xue, Jilong and Miao, Youshan and Cui, Wei and Hu, Wenxiang and Yang, Fan and Zhang, Lintao and Zhou, Lidong , booktitle =. 2020 , publisher =

2020
[20]

2022 , publisher =

Zhu, Hongyu and Wu, Ruofan and Diao, Yijia and Ke, Shanbin and Li, Haoyu and Zhang, Chen and Xue, Jilong and Ma, Lingxiao and Xia, Yuqing and Cui, Wei and Yang, Fan and Yang, Mao and Zhou, Lidong and Cidon, Asaf and Pekhimenko, Gennady , booktitle =. 2022 , publisher =

2022
[21]

2023 , publisher =

Shi, Yining and Yang, Zhi and Xue, Jilong and Ma, Lingxiao and Xia, Yuqing and Miao, Ziming and Guo, Yuxiao and Yang, Fan and Zhou, Lidong , booktitle =. 2023 , publisher =

2023
[22]

arXiv preprint arXiv:2309.07062 , year =

Large Language Models for Compiler Optimization , author =. arXiv preprint arXiv:2309.07062 , year =

work page arXiv
[23]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Ouyang, Anne and Guo, Simon and Arora, Simran and Zhang, Alex L. and Hu, William and R. arXiv preprint arXiv:2502.10517 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

torch.fx: Practical Program Capture and Transformation for Deep Learning in Python , url =

Reed, James and DeVito, Zachary and He, Horace and Ussery, Ansley and Ansel, Jason , booktitle =. torch.fx: Practical Program Capture and Transformation for Deep Learning in Python , url =

[1] [1]

Advances in Neural Information Processing Systems , editor =

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , editor =. 2022 , url =

2022

[2] [2]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , url =

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R\'. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , url =. Advances in Neural Information Processing Systems , editor =

[3] [3]

2024 , publisher =

Ansel, Jason and Yang, Edward and He, Horace and Gimelshein, Natalia and Jain, Animesh and Voznesensky, Michael and Bao, Bin and Bell, Peter and Berard, David and Burovski, Evgeni and others , booktitle =. 2024 , publisher =

2024

[4] [4]

2018 , publisher =

Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Shen, Haichen and Cowan, Meghan and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and Guestrin, Carlos and Krishnamurthy, Arvind , booktitle =. 2018 , publisher =

2018

[5] [5]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , publisher =

2023

[7] [7]

Competition-Level Code Generation with

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =. 2022 , publisher =

2022

[8] [8]

2023 , url =

Li, Raymond and Ben Allal, Loubna and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jia and Chim, Jenny and others , journal =. 2023 , url =

2023

[9] [9]

2024 , url =

Ni, Ansong and Allamanis, Miltiadis and Cohan, Arman and Deng, Yinlin and Shi, Kensen and Sutton, Charles and Yin, Pengcheng , booktitle =. 2024 , url =

2024

[10] [10]

2021 , publisher =

Niu, Wei and Guan, Jiexiong and Wang, Yanzhi and Agrawal, Gagan and Ren, Bin , booktitle =. 2021 , publisher =

2021

[11] [11]

Advances in Neural Information Processing Systems , editor =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , editor =. 2023 , url =

2023

[12] [12]

, booktitle =

Tillet, Philippe and Kung, Hsiang-Tsung and Cox, David D. , booktitle =. 2019 , publisher =

2019

[13] [13]

2022 , publisher =

Yu, Gyeong-In and Jeong, Joo Seong and Kim, Geon-Woo and Kim, Soojeong and Chun, Byung-Gon , booktitle =. 2022 , publisher =

2022

[14] [14]

2022 , publisher =

Zheng, Zhen and Yang, Xuanda and Zhao, Pengzhan and Long, Guoping and Zhu, Kai and Zhu, Feiwen and Zhao, Wenyi and Liu, Xiaoyong and Yang, Jun and Zhai, Jidong and others , booktitle =. 2022 , publisher =

2022

[15] [15]

2021 , organization =

Lattner, Chris and Amini, Mehdi and Bondhugula, Uday and Cohen, Albert and Davis, Andy and Pienaar, Jacques and Riddle, River and Shpeisman, Tatiana and Vasilache, Nicolas and Zinenko, Oleksandr , booktitle =. 2021 , organization =

2021

[16] [16]

and Stoica, Ion , booktitle =

Zheng, Lianmin and Jia, Chengfan and Sun, Minmin and Wu, Zhao and Yu, Cody Hao and Haj-Ali, Ameer and Wang, Yida and Yang, Jun and Zhuo, Danyang and Sen, Koushik and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. 2020 , publisher =

2020

[17] [17]

2019 , publisher =

Jia, Zhihao and Padon, Oded and Thomas, James and Warszawski, Todd and Zaharia, Matei and Aiken, Alex , booktitle =. 2019 , publisher =

2019

[18] [18]

Advances in Neural Information Processing Systems , editor =

Learning to Optimize Tensor Programs , author =. Advances in Neural Information Processing Systems , editor =. 2018 , url =

2018

[19] [19]

2020 , publisher =

Ma, Lingxiao and Xie, Zhiqiang and Yang, Zhi and Xue, Jilong and Miao, Youshan and Cui, Wei and Hu, Wenxiang and Yang, Fan and Zhang, Lintao and Zhou, Lidong , booktitle =. 2020 , publisher =

2020

[20] [20]

2022 , publisher =

Zhu, Hongyu and Wu, Ruofan and Diao, Yijia and Ke, Shanbin and Li, Haoyu and Zhang, Chen and Xue, Jilong and Ma, Lingxiao and Xia, Yuqing and Cui, Wei and Yang, Fan and Yang, Mao and Zhou, Lidong and Cidon, Asaf and Pekhimenko, Gennady , booktitle =. 2022 , publisher =

2022

[21] [21]

2023 , publisher =

Shi, Yining and Yang, Zhi and Xue, Jilong and Ma, Lingxiao and Xia, Yuqing and Miao, Ziming and Guo, Yuxiao and Yang, Fan and Zhou, Lidong , booktitle =. 2023 , publisher =

2023

[22] [22]

arXiv preprint arXiv:2309.07062 , year =

Large Language Models for Compiler Optimization , author =. arXiv preprint arXiv:2309.07062 , year =

work page arXiv

[23] [23]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Ouyang, Anne and Guo, Simon and Arora, Simran and Zhang, Alex L. and Hu, William and R. arXiv preprint arXiv:2502.10517 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

torch.fx: Practical Program Capture and Transformation for Deep Learning in Python , url =

Reed, James and DeVito, Zachary and He, Horace and Ussery, Ansley and Ansel, Jason , booktitle =. torch.fx: Practical Program Capture and Transformation for Deep Learning in Python , url =