arxiv: 2511.16964 · v2 · pith:Z4VOBKGXnew · submitted 2025-11-21 · 💻 cs.MA · cs.AI· cs.DC

Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Kirill Nagaitsev , Luka Grbcic , Samuel Williams , Costin Iancu This is my paper

Pith reviewed 2026-05-17 20:59 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.DC

keywords PyTorch optimizationmulti-agent systemsLLM code generationGPU inferenceKernelBenchperformance tuningautomatic kernel optimization

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{Z4VOBKGX}

Prints a linked pith:Z4VOBKGX badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Multi-agent LLM systems optimize PyTorch code for 2.88x faster inference than eager execution on H100 GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether teams of large language models can automatically rewrite and tune PyTorch programs to run faster on GPUs. It builds a comparison framework for different multi-agent setups and measures how choices like heavy exploitation of known tricks, error-fixing agents, and step granularity affect results. The strongest configuration beats both standard PyTorch eager mode and the built-in torch.compile tool by clear margins on a suite of realistic machine-learning tasks. If this approach holds, it could reduce the need for hand-written GPU kernels and let developers reach high performance without deep hardware expertise.

Core claim

Exploit-heavy strategies paired with dedicated error-fixing agents and finer-grained optimization steps produce the highest speedups; the best system delivers an average 2.88 times faster execution than PyTorch Eager and 1.85 times faster than torch.compile across the KernelBench suite on an H100 GPU.

What carries the argument

A logical comparison framework that ranks multi-agent PyTorch optimizers by strategy type (exploit versus explore) and optimization granularity, with error-fixing agents as a key supporting component.

If this is right

Exploit-focused strategies with error correction outperform explore-heavy or purely autonomous approaches.
Smaller, more granular optimization steps correlate with larger final speedups.
The generated kernels can match or exceed hand-tuned performance without manual intervention.
The same framework can rank future multi-agent designs for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production inference pipelines could adopt these agents as a drop-in tuning layer before deployment.
The approach might extend to other frameworks like TensorFlow or JAX if similar agent roles are defined.
Improving the error-fixing agent alone could push speedups higher on harder tasks.

Load-bearing premise

LLM agents will reliably produce correct, bug-free optimized code that generalizes beyond the test suite to real workloads.

What would settle it

A KernelBench task where the multi-agent system outputs code that either runs slower than the eager baseline or produces wrong numerical results.

Figures

Figures reproduced from arXiv: 2511.16964 by Costin Iancu, Kirill Nagaitsev, Luka Grbcic, Samuel Williams.

**Figure 1.** Figure 1: Logical framework which multi-agent systems operate in, for the task of PyTorch inference optimization. Logical steps of the optimization process are shown above, and core options/parameters are shown below near the steps they relate to. 1 2 n. . . 1 n. . . . . . 1 2 n 2’ 2’’ n’ n’’ optimization [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: PIKE-B implementation, illustrating parallel evaluation and error fixing, followed by top-k selection and mutation balances the trade-off between exploration and exploitation. By systematically addressing these configuration questions, we aim to enhance the adaptability and efficacy of our LLM-based approach in translating and optimizing PyTorch models. 3 EVOLUTIONARY STRATEGIES FOR PYTORCH OPTIMIZATION … view at source ↗

**Figure 3.** Figure 3: (b) illustrates the geomean speedup achieved by different PIKE configurations based on cost per task, as opposed to the number of LLM queries. This analysis highlights the tradeoffs between performance gains and cost, demonstrating how each configuration scales with budget constraints. Notably, the PIKE-B configuration with a cheap EFA (Gemini 2.5 Flash) provides the most significant speedup of 2.51 for … view at source ↗

**Figure 4.** Figure 4: presents a comparative analysis of geomean speedup achieved by different PIKE configurations across the Level 5 benchmark suite, focusing on LLM queries per task ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Speedup relative to PyTorch Eager of our approaches (including extra ablations) and other approaches, using an H100. All of our approaches get a budget of exactly 300 LLM queries. However, the exploitation-tuned PIKE-O (mut,npar,1isl,EO,SL) achieves speedup of 2.47 with 300 queries per task budget, and 2.33 with $50 per task budget, reaching quite similar performance levels to PIKE-B. This is a notable imp… view at source ↗

**Figure 6.** Figure 6: PIKE-B and PIKE-O Level 3-pike per step analysis of (a) correctness attempt count (b) LoC changed. Dashed lines indicate mean of means for each implementation. 1 2 3 4 5 (a) Mean Error Fix Attempts 0 2 4 6 Tasks PIKE-B PIKE-O (mut,npar,1isl) 50 100 150 200 250 300 (b) Mean LoC Changed per Optimization Step 0 2 4 6 8 Tasks [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: PIKE-B and PIKE-O (mut,npar,1isl) Level 3- pike per-step analysis of (a) correctness attempt count (b) LoC changed. Legend is shared between (a) and (b). Dashed lines indicate mean of means for each implementation [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup over PyTorch Eager (1.85x over torch.compile) on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch. Code is publicly available at: https://github.com/pike-project/pike

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets 2.88x over Eager and 1.85x over torch.compile on KernelBench with an exploit-heavy multi-agent setup, but leaves the fraction of tasks that produce correct code unmeasured.

read the letter

The main takeaway is that their best multi-agent LLM pipeline reaches 2.88 times the speed of plain PyTorch Eager and 1.85 times torch.compile on an H100 across KernelBench tasks. They also lay out a logical framework for comparing different multi-agent strategies and report that exploit-heavy approaches work best when paired with error-fixing agents, while finer optimization granularity tends to improve results. The code is released on GitHub, which lets others check the implementation directly. What stands out is the targeted empirical comparison on a public benchmark that covers several machine learning architectures. This adds concrete observations about strategy choice and step size that build on earlier LLM multi-agent tuning papers without claiming a first-principles advance. The evaluation is straightforward and the numbers are given in the abstract, so a reader can see the claimed gains right away. The soft spot is the missing data on reliability. The headline speedups only make sense if most tasks produce functionally correct kernels, yet there are no numbers on success rate per task, average number of LLM calls or retries, or variance across runs. Without error bars or a breakdown of how many attempts were filtered out, it is hard to know whether the averages reflect typical behavior or selected successes. Generalization beyond KernelBench to production workloads is also not tested. This paper is for people working on automated kernel generation or LLM agents for performance tuning. A reader who wants practical ideas on agent strategies and granularity effects will get something useful from the comparisons. It deserves peer review because the results are concrete, the benchmark is public, and the code is available; referees can ask for the missing success-rate and statistical details.

Referee Report

2 major / 2 minor

Summary. The paper introduces a logical framework for comparing LLM-based multi-agent systems for optimizing PyTorch inference. It evaluates exploit-heavy strategies paired with error-fixing agents, reports that performance correlates with the granularity of optimization steps, and claims that the best implementation achieves an average 2.88× speedup over PyTorch Eager (1.85× over torch.compile) on an H100 GPU across diverse tasks in the KernelBench benchmark suite. Public code is released at https://github.com/pike-project/pike.

Significance. If the empirical claims hold after addressing the gaps below, the work would demonstrate that multi-agent LLM pipelines can automate GPU kernel tuning at a level competitive with or better than existing compilers, reducing reliance on manual kernel development. The public code release and use of a named public benchmark (KernelBench) are clear strengths that support reproducibility.

major comments (2)

[Abstract] Abstract: The headline speedups (2.88× over Eager, 1.85× over torch.compile) are reported as averages without any accompanying data on (a) the fraction of KernelBench tasks for which the multi-agent pipeline produced functionally correct code, (b) the number of LLM calls or retries required per task, or (c) variance, error bars, or the number of runs averaged. These omissions are load-bearing because the reported means could reflect selective success rather than reliable optimization.
[Evaluation] Evaluation section: No details are provided on how the torch.compile baseline was configured (e.g., mode, inductor settings), how functional equivalence was verified for generated kernels, or what statistical tests were used to establish that the observed speedups are significant across the diverse ML architectures in KernelBench.

minor comments (2)

[Framework] The logical framework for comparing multi-agent systems is introduced but would benefit from an explicit diagram or pseudocode showing agent roles, communication protocol, and decision points for exploit vs. explore strategies.
[Abstract] Ensure the public repository contains the exact scripts, prompts, and random seeds used to generate the reported KernelBench results so that independent verification is possible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The headline speedups (2.88× over Eager, 1.85× over torch.compile) are reported as averages without any accompanying data on (a) the fraction of KernelBench tasks for which the multi-agent pipeline produced functionally correct code, (b) the number of LLM calls or retries required per task, or (c) variance, error bars, or the number of runs averaged. These omissions are load-bearing because the reported means could reflect selective success rather than reliable optimization.

Authors: We agree that additional context on success rates, computational overhead, and variability would strengthen interpretation of the reported averages. The current manuscript focuses on the performance of successful optimizations but does not explicitly tabulate the fraction of KernelBench tasks yielding correct code, average LLM calls/retries, or run-to-run variance. In the revised version we will add a dedicated summary table in the Evaluation section reporting these metrics (including standard deviations where multiple runs were performed) so that readers can assess reliability directly. revision: yes
Referee: [Evaluation] Evaluation section: No details are provided on how the torch.compile baseline was configured (e.g., mode, inductor settings), how functional equivalence was verified for generated kernels, or what statistical tests were used to establish that the observed speedups are significant across the diverse ML architectures in KernelBench.

Authors: We will expand the Evaluation section to supply the requested details. The torch.compile baseline was run with the default mode and the inductor backend enabled; we will state this explicitly. Functional equivalence was checked by executing both the original PyTorch module and the generated kernel on identical test input tensors and confirming numerical agreement within a small tolerance (e.g., 1e-4 relative error). For statistical significance we will report results of a paired non-parametric test (Wilcoxon signed-rank) across the KernelBench tasks and include the associated p-values. These clarifications will be incorporated in the next revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with released code

full rationale

The paper reports experimental speedups from an LLM multi-agent PyTorch optimizer evaluated on the public KernelBench suite. No mathematical derivations, equations, fitted parameters, or self-citation chains for uniqueness theorems are present in the provided text. The central claims rest on measured performance against external baselines (PyTorch Eager and torch.compile) rather than any reduction of outputs to inputs by construction. This is the expected non-finding for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central performance claims rest on standard assumptions about LLM agent reliability and benchmark representativeness rather than new axioms or invented entities; no free parameters or postulated objects are described in the abstract.

pith-pipeline@v0.9.0 · 5478 in / 1088 out tokens · 51549 ms · 2026-05-17T20:59:11.566462+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/BranchSelection branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Accessed: 2025-10-17

URL https://docs.pytorch.org/docs/ stable/generated/torch.allclose.html. Accessed: 2025-10-17. Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R´e, C. FlashAt- tention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2022. Docker, Inc. Docker: Empowering app development for de- v...

work page doi:10.1145/3703412.3703416 2025
[2]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Accessed: 2025-02-14. Microsoft Corporation. ONNX Runtime. https:// onnxruntime.ai/, 2024. Novikov, A., V˜u, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A. Z., Shirobokov, S., Kozlovskii, B., Ruiz, F. J., Mehrabian, A., et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025. Nugteren, C...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

URL https://github.com/laude- institute/terminal-bench. Tillet, P., Kung, H.-T., and Cox, D. Triton: an intermediate language and compiler for tiled neural network computa- tions. InProceedings of the 3rd ACM SIGPLAN Interna- tional Workshop on Machine Learning and Programming Languages, pp. 10–19, 2019. van Werkhoven, B. Kernel tuner: A search-optimizing...

work page arXiv 2019
[4]

Zhang, Z., Bajaj, A

URL https://openreview.net/forum? id=6okaSfANzh. Zhang, Z., Bajaj, A. P., Handa, D., Liu, S., Raj, A. S., Chen, H., Wang, H., Liu, Y ., Basque, Z. L., Nath, S., et al. Build- bench: Benchmarking llm agents on compiling real-world open-source software.arXiv preprint arXiv:2509.25248, 2025. Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S....

work page arXiv 2025
[5]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. A CORRECTNESS To check correctness of solutions, we adopt the same nu- merical equivalence tests and tolerance values as used in the prior works. More specifically, each benchmark pro- vides a function for generating random inputs of correct dimensions for the problem. We pass this input into the original PyTorch...

work page 2025