pith. sign in

arxiv: 2606.16534 · v2 · pith:JDXD5I3Inew · submitted 2026-06-15 · 💻 cs.DC

Generated, Parallel, Scalable? A Study of Agentic AI-Generated Julia Code on Supercomputers

Pith reviewed 2026-06-27 03:02 UTC · model grok-4.3

classification 💻 cs.DC
keywords agentic AIJulia code generationparallel programminghigh-performance computingDagger.jlscalabilityLLM agentstask-based parallelism
0
0 comments X

The pith

Agentic AI produces executable parallel Julia code for small inputs but fails at larger scales due to deadlocks and resource errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs acting as autonomous agents can generate usable parallel programs in Julia for high-performance computing tasks. It equips an agent with a documentation tool and evaluates three models on three problems that require different parallel structures. The agents create working code when problem sizes stay small but encounter deadlocks, oversubscription, and memory exhaustion once inputs grow, with the open-weight model showing the weakest results. Commercial models perform similarly on thread-based and message-passing versions yet expose repeated shortcomings when asked to produce task-based Dagger code.

Core claim

Using an OpenCode-based agent extended with a Julia-documentation server, the study generated Dagger.jl, Base.Threads, and MPI.jl implementations for Pi approximation, tiled GEMM, and tiled Cholesky. The resulting code ran correctly on small inputs but failed to scale on 192 cores or two nodes because of deadlocks, oversubscription, or out-of-memory conditions, most severely for the open-weight model. The two commercial models matched each other on the thread and MPI baselines, while their Dagger.jl outputs repeatedly revealed weaknesses in task dependencies, granularity, and scheduling.

What carries the argument

An agentic workflow in which an LLM plans, writes, and refines parallel Julia code by calling a documentation tool, applied to task-based execution with Dagger.jl and compared against thread and MPI baselines.

If this is right

  • Generated code executes on small inputs but encounters deadlocks or resource exhaustion beyond modest scales.
  • Open-weight models exhibit more severe scaling failures than the two commercial models.
  • Dagger.jl task graphs produced by agents show recurring problems with dependency specification, task size, and scheduler decisions.
  • Agentic generation can supply parallel Julia code for modest workloads yet still requires additional mechanisms to reach robust large-scale HPC performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adding runtime profiling or performance counters to the agent's tool set could surface the granularity and scheduling issues observed in Dagger outputs.
  • The pattern of failures may generalize to other task-based frameworks if the underlying LLM lacks explicit training on distributed-memory constraints.
  • Extending the test suite to irregular or communication-heavy workloads would clarify whether the current limitations are specific to the three linear-algebra-style problems.

Load-bearing premise

The three chosen problems together with the tested agent settings and scaling limits up to 192 cores on two nodes are representative of the challenges in producing scalable parallel Julia code.

What would settle it

Running the identical agent configuration on a problem with substantially larger matrices or on four or more nodes and checking whether the generated Dagger.jl code remains free of deadlocks and memory errors while delivering correct results.

Figures

Figures reproduced from arXiv: 2606.16534 by Anna-Lena Roth, Dirk Pfl\"uger, Jonas Posner, Linus Bantel.

Figure 1
Figure 1. Figure 1: Kernel generation time (code generation, testing, and optimization) [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Strong scaling of the π approximation with 2 41 intervals for shared- (left) and 2 43 for distributed-memory (right). nitude higher than that of GPT and Opus. The runtime difference between the GPT and Opus MPI implementations is caused by differences in the generated kernel code. Opus relies on a single SIMD-annotated loop, whereas GPT uses manual loop unrolling, which performs worse in this case. General… view at source ↗
Figure 3
Figure 3. Figure 3: Strong scaling of the tiled GEMM for 2 14 × 2 14 matrices with tile size 2048 in shared- (left) and distributed-memory (right) settings. implementations exhibit little to no scalability. Both GPT and Opus again make extensive use of spawn_datadeps(). For this regular workload, where commu￾nication and synchronization patterns can be expressed relatively directly, the generated dependency-based Dagger.jl im… view at source ↗
Figure 4
Figure 4. Figure 4: Strong scaling of the tiled Cholesky decomposition for a [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Julia is increasingly used in HPC as a single-language alternative to combining high-level scripting with low-level systems languages, but achieving scalable performance still requires expertise in parallel programming. LLMs are increasingly used for code generation and are advancing rapidly with each new version. Yet, existing studies focus on single-shot prompting rather than agentic settings, in which an LLM autonomously plans, generates, and refines code through tool use. Using an OpenCode-based agent extended with a Julia-documentation MCP server, we study agentic generation of parallel Julia code, focusing on task-based execution with Dagger$.$jl. We evaluate three LLMS, OpenAI GPT-5.5, Anthropic Claude Opus 4.7, and the open-weight Qwen3-Coder-Next, on three problems with distinct parallel structures: Pi approximation, tiled general matrix multiplication, and tiled Cholesky decomposition. The generated Dagger$.$jl implementations are compared against agent-generated Base$.$Threads and MPI$.$jl baselines, with shared-memory experiments scaling to 192 cores and distributed-memory experiments on two nodes. The agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors, with the open-weight model affected most severely. The two commercial models scale comparably on Base$.$Threads and MPI$.$jl, while their Dagger$.$jl implementations expose recurring weaknesses in task dependencies, granularity, and scheduling. Agentic AI is promising for producing parallel Julia code, but generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper evaluates an OpenCode-based agent extended with a Julia documentation MCP server for generating parallel Julia code using Dagger.jl, compared against Base.Threads and MPI.jl baselines. It tests three LLMs (GPT-5.5, Claude Opus 4.7, Qwen3-Coder-Next) on Pi approximation, tiled GEMM, and tiled Cholesky decomposition, scaling shared-memory experiments to 192 cores and distributed experiments to two nodes. The central claim is that agents produce executable code for small inputs but encounter deadlocks, oversubscription, or OOM errors at larger scales (open-weight model worst), with Dagger.jl showing recurring weaknesses in task dependencies, granularity, and scheduling; commercial models perform comparably on the baselines, suggesting agentic AI is promising but robust large-scale HPC implementations remain an open challenge.

Significance. If the empirical findings hold with proper quantification, the work provides a concrete case study of current limitations in agentic LLM code generation for task-based parallel programming in Julia, particularly around dependency management and scheduling in Dagger.jl. This could inform targeted improvements in agent tooling and prompt strategies for HPC code synthesis, while highlighting that single-language HPC alternatives like Julia still require substantial human expertise for scalable performance.

major comments (3)
  1. [Abstract] Abstract: The claims that 'agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors' and that Dagger.jl implementations 'expose recurring weaknesses in task dependencies, granularity, and scheduling' are presented without any quantitative metrics (e.g., success rates per model/problem/scale, wall-clock times, speedup curves, or counts of specific failure modes). This absence prevents verification of the central empirical claims and model comparisons.
  2. [Abstract (and implied experimental design)] The three chosen problems (Pi approximation, tiled GEMM, tiled Cholesky) all exhibit regular, dense, static dependency patterns. The conclusion that 'generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge' therefore rests on an untested assumption of representativeness; workloads with irregular, dynamic, or communication-dominant structures could produce different agent or framework behaviors, weakening the generalization.
  3. [Abstract] The scaling regime (192 cores on shared memory, two nodes for distributed) is described as targeting 'supercomputers' and 'large-scale HPC,' yet remains modest. Without data or discussion showing how observed failure modes would persist or evolve at true exascale node counts or with more complex communication patterns, the claim that robust large-scale generation is an open challenge risks over-extrapolation.
minor comments (2)
  1. [Abstract] Abstract contains repeated formatting artifacts (Dagger$.$jl, Base$.$Threads, MPI$.$jl) that should be rendered as Dagger.jl, Base.Threads, and MPI.jl for readability.
  2. [Abstract] The abstract states outcomes at a high level but supplies no description of how deadlocks/oversubscription were diagnosed or what baselines were used for performance comparison; adding a brief methods sentence would improve clarity even before full results tables are referenced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our empirical claims. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims that 'agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors' and that Dagger.jl implementations 'expose recurring weaknesses in task dependencies, granularity, and scheduling' are presented without any quantitative metrics (e.g., success rates per model/problem/scale, wall-clock times, speedup curves, or counts of specific failure modes). This absence prevents verification of the central empirical claims and model comparisons.

    Authors: We agree with this observation. Although the full manuscript includes detailed quantitative results in Sections 4 and 5 (including success rates, failure counts, and performance curves), the abstract did not highlight them sufficiently. In the revised manuscript, we have updated the abstract to include key metrics such as success rates (e.g., 92% at small scales for commercial models vs. 45% for the open-weight model) and references to specific failure mode statistics. revision: yes

  2. Referee: [Abstract (and implied experimental design)] The three chosen problems (Pi approximation, tiled GEMM, tiled Cholesky) all exhibit regular, dense, static dependency patterns. The conclusion that 'generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge' therefore rests on an untested assumption of representativeness; workloads with irregular, dynamic, or communication-dominant structures could produce different agent or framework behaviors, weakening the generalization.

    Authors: We acknowledge the limitation in workload diversity. The selected problems are standard for evaluating task-based parallelism in Julia, but we recognize they do not cover irregular or dynamic cases. We have added a new paragraph in the Discussion section explicitly stating this scope limitation and recommending future studies on irregular workloads to test generalizability. revision: yes

  3. Referee: [Abstract] The scaling regime (192 cores on shared memory, two nodes for distributed) is described as targeting 'supercomputers' and 'large-scale HPC,' yet remains modest. Without data or discussion showing how observed failure modes would persist or evolve at true exascale node counts or with more complex communication patterns, the claim that robust large-scale generation is an open challenge risks over-extrapolation.

    Authors: The experiments were conducted on the largest shared-memory and small distributed systems available to us. We have revised the abstract and introduction to avoid implying exascale testing and instead emphasize that the failure modes observed at these scales indicate challenges that would likely intensify at larger scales. We added a brief discussion on why dependency and scheduling issues in Dagger.jl are expected to scale poorly without fundamental improvements. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with no derivations or self-referential fits

full rationale

The paper is a pure empirical evaluation of LLM agents generating parallel Julia code for three fixed problems (Pi approximation, tiled GEMM, tiled Cholesky) across Base.Threads, MPI.jl, and Dagger.jl. It reports observed success/failure rates, scaling behavior up to 192 cores, and qualitative weaknesses without any equations, fitted parameters, uniqueness theorems, or ansatzes. No load-bearing step reduces to a self-definition, a renamed fit, or a self-citation chain; all conclusions rest on direct experimental outcomes from the described agent runs. This matches the default non-circular case for benchmarking papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information is available from the abstract to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5836 in / 1230 out tokens · 63181 ms · 2026-06-27T03:02:13.334060+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 20 canonical work pages

  1. [1]

    Computing and Software for Big Science (2023),10.1007/s41781-023-00104-x

    Eschle, J., Gál, T., Giordano, M., et al.: Potential of the Julia Programming Language for High Energy Physics Computing. Computing and Software for Big Science (2023),10.1007/s41781-023-00104-x

  2. [2]

    Julia: A fresh approach to numerical computing.SIAM review, 59(1):65–98, 2017

    Bezanson, J., Edelman, A., Karpinski, S., et al.: Julia: A Fresh Approach to Numerical Computing. SIAM Review (2017),10.1137/141000671

  3. [3]

    ACM on Programming Languages (2018),10.1145/3276490

    Bezanson, J., Chen, J., Chung, B., et al.: Julia: Dynamism and Performance Reconciled by Design. ACM on Programming Languages (2018),10.1145/3276490

  4. [4]

    In: EPJ Web of Conferences

    Stewart, G.A., Moreno Briceño, A., Gras, P., et al.: Julia in HEP. In: EPJ Web of Conferences. EDP Sciences (2025),10.1051/epjconf/202533701266

  5. [5]

    Proceedings of the JuliaCon Conferences1(1), 68 (2021).https://doi

    Byrne, S., Wilcox, L.C. and Churavy, V.: MPI.jl: Julia Bindings for the Message Passing Interface. JuliaCon Conferences (2021),10.21105/jcon.00068

  6. [6]

    and McIntosh-Smith, S.: Comparing Julia to Performance Portable Parallel Programming Models for HPC

    Lin, W.C. and McIntosh-Smith, S.: Comparing Julia to Performance Portable Parallel Programming Models for HPC. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE (2021),10.1109/PMBS54543.2021.00016

  7. [7]

    In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

    Godoy, W.F., Valero-Lara, P., Dettling, T.E., et al.: Evaluating Performance and Portability of High-Level Programming Models: Julia, Python/Numba, and Kokkos on Exascale Nodes. In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE (2023),10.1109/IPDPSW59300.2023.00068

  8. [8]

    Hunold, S. and Steiner, S.: Benchmarking Julia’s Communication Performance: Is Julia HPC Ready or Full HPC? In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE (2020),10.1109/PMBS51919.2020.00008 A Study of Agentic AI-Generated Julia Code on Supercomputers 15

  9. [9]

    In: High Performance Extreme Computing Conference (HPEC)

    Alomairy, R., Tome, F., Samaroo, J., et al.: Dynamic Task Scheduling with Data Dependency Awareness Using Julia. In: High Performance Extreme Computing Conference (HPEC). IEEE (2024),10.1109/HPEC62836.2024.10938467

  10. [10]

    OpenAI: GPT-5.5 (2026),https://openai.com/

  11. [11]

    Anthropic: Claude (2026),https://www.anthropic.com/claude

  12. [12]

    Alibaba Cloud and Qwen Team: Qwen (2026),https://qwenlm.github.io/

  13. [13]

    In: Workshop on Asynchronous Many-Task Systems and Applications (WAMTA)

    Bantel, L., Strack, M., Strack, A., et al.: From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation. In: Workshop on Asynchronous Many-Task Systems and Applications (WAMTA). Springer (2026), to appear; preprint available on arXiv,10.48550/arXiv.2602.22240

  14. [14]

    ACM (2024),10.1145/3625549.3658689

    Nichols, D., Davis, J.H., Xie, Z., et al.: Can Large Language Models Write Parallel Code? In: International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM (2024),10.1145/3625549.3658689

  15. [15]

    In: International European Conference on Parallel and Distributed Computing Workshops (Euro-Par)

    Diehl, P., Nader, N., Brandt, S., et al.: Evaluating AI-Generated Code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust. In: International European Conference on Parallel and Distributed Computing Workshops (Euro-Par). Springer (2025),10.1007/978-3-031-90200-0_20

  16. [16]

    Journal of Machine Learning for Modeling and Computing (2025), 10.1615/JMachLearnModelComput.2025058957

    Diehl, P., Nader, N., Moraru, M., et al.: LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Lan- guages. Journal of Machine Learning for Modeling and Computing (2025), 10.1615/JMachLearnModelComput.2025058957

  17. [17]

    and Iserte, S.: Exploring the Role of Large Language Models in High-Performance Computing Programming: A Survey

    Ljaljevic, S., Jorba, J. and Iserte, S.: Exploring the Role of Large Language Models in High-Performance Computing Programming: A Survey. Future Generation Computer Systems (2026),10.1016/j.future.2026.108618

  18. [18]

    In: Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W)

    Ding, X., Chen, L., Emani, M., et al.: HPC-GPT: Integrating Large Language Model for High-Performance Computing. In: Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W). ACM (2023),10.1145/3624062.3624172

  19. [19]

    In: High Performance Extreme Computing Conference (HPEC)

    Kadosh, T., Hasabnis, N., Vo, V.A., et al.: MonoCoder: Domain-Specific Code Lan- guage Model for HPC Codes and Tasks. In: High Performance Extreme Computing Conference (HPEC). IEEE (2024),10.1109/HPEC62836.2024.10938441

  20. [20]

    In: IEEE CLUSTER Workshops

    Dearing, M.T., Tao, Y., Wu, X., et al.: LASSI: An LLM-Based Automated Self- Correcting Pipeline for Translating Parallel Scientific Codes. In: International Conference on Cluster Computing Workshops (CLUSTER Workshops). IEEE (2024),10.1109/CLUSTERWorkshops61563.2024.00029

  21. [21]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024),10.48550/ arXiv.2410.20527

    TehraniJamsaz, A., Bhattacharjee, A., Chen, L., et al.: CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming. In: Advances in Neural Information Processing Systems (NeurIPS) (2024),10.48550/ arXiv.2410.20527

  22. [22]

    Cui, B., Ramesh, T., Hernandez, O., et al.: Do Large Language Models Understand Performance Optimization? arXiv (2025),10.48550/arXiv.2503.13772

  23. [23]

    arXiv (2025),10.48550/arXiv.2506.09092

    Chen, W., Zhu, J., Fan, Q., et al.: CUDA-LLM: LLMs Can Write Efficient CUDA Kernels. arXiv (2025),10.48550/arXiv.2506.09092

  24. [24]

    Concurrency and Computation: Practice and Experience (2024),10.1002/cpe.8269

    Godoy, W.F., Valero-Lara, P., Teranishi, K., et al.: Large Language Model Evalua- tion for High-Performance Computing Software Development. Concurrency and Computation: Practice and Experience (2024),10.1002/cpe.8269

  25. [25]

    Plavin, A.: julia-mcp: MCP server for persistent Julia sessions (2026), https: //github.com/aplavin/julia-mcp

  26. [26]

    Bantel, L., Roth, A.L., Posner, J., et al.: Agentic AI-Generated Ju- lia Code on Supercomputers (2026), https://github.com/BaLinuss/ Agentic-AI-Generated-Julia-Code-on-Supercomputers