Generated, Parallel, Scalable? A Study of Agentic AI-Generated Julia Code on Supercomputers
Pith reviewed 2026-06-27 03:02 UTC · model grok-4.3
The pith
Agentic AI produces executable parallel Julia code for small inputs but fails at larger scales due to deadlocks and resource errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using an OpenCode-based agent extended with a Julia-documentation server, the study generated Dagger.jl, Base.Threads, and MPI.jl implementations for Pi approximation, tiled GEMM, and tiled Cholesky. The resulting code ran correctly on small inputs but failed to scale on 192 cores or two nodes because of deadlocks, oversubscription, or out-of-memory conditions, most severely for the open-weight model. The two commercial models matched each other on the thread and MPI baselines, while their Dagger.jl outputs repeatedly revealed weaknesses in task dependencies, granularity, and scheduling.
What carries the argument
An agentic workflow in which an LLM plans, writes, and refines parallel Julia code by calling a documentation tool, applied to task-based execution with Dagger.jl and compared against thread and MPI baselines.
If this is right
- Generated code executes on small inputs but encounters deadlocks or resource exhaustion beyond modest scales.
- Open-weight models exhibit more severe scaling failures than the two commercial models.
- Dagger.jl task graphs produced by agents show recurring problems with dependency specification, task size, and scheduler decisions.
- Agentic generation can supply parallel Julia code for modest workloads yet still requires additional mechanisms to reach robust large-scale HPC performance.
Where Pith is reading between the lines
- Adding runtime profiling or performance counters to the agent's tool set could surface the granularity and scheduling issues observed in Dagger outputs.
- The pattern of failures may generalize to other task-based frameworks if the underlying LLM lacks explicit training on distributed-memory constraints.
- Extending the test suite to irregular or communication-heavy workloads would clarify whether the current limitations are specific to the three linear-algebra-style problems.
Load-bearing premise
The three chosen problems together with the tested agent settings and scaling limits up to 192 cores on two nodes are representative of the challenges in producing scalable parallel Julia code.
What would settle it
Running the identical agent configuration on a problem with substantially larger matrices or on four or more nodes and checking whether the generated Dagger.jl code remains free of deadlocks and memory errors while delivering correct results.
Figures
read the original abstract
Julia is increasingly used in HPC as a single-language alternative to combining high-level scripting with low-level systems languages, but achieving scalable performance still requires expertise in parallel programming. LLMs are increasingly used for code generation and are advancing rapidly with each new version. Yet, existing studies focus on single-shot prompting rather than agentic settings, in which an LLM autonomously plans, generates, and refines code through tool use. Using an OpenCode-based agent extended with a Julia-documentation MCP server, we study agentic generation of parallel Julia code, focusing on task-based execution with Dagger$.$jl. We evaluate three LLMS, OpenAI GPT-5.5, Anthropic Claude Opus 4.7, and the open-weight Qwen3-Coder-Next, on three problems with distinct parallel structures: Pi approximation, tiled general matrix multiplication, and tiled Cholesky decomposition. The generated Dagger$.$jl implementations are compared against agent-generated Base$.$Threads and MPI$.$jl baselines, with shared-memory experiments scaling to 192 cores and distributed-memory experiments on two nodes. The agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors, with the open-weight model affected most severely. The two commercial models scale comparably on Base$.$Threads and MPI$.$jl, while their Dagger$.$jl implementations expose recurring weaknesses in task dependencies, granularity, and scheduling. Agentic AI is promising for producing parallel Julia code, but generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates an OpenCode-based agent extended with a Julia documentation MCP server for generating parallel Julia code using Dagger.jl, compared against Base.Threads and MPI.jl baselines. It tests three LLMs (GPT-5.5, Claude Opus 4.7, Qwen3-Coder-Next) on Pi approximation, tiled GEMM, and tiled Cholesky decomposition, scaling shared-memory experiments to 192 cores and distributed experiments to two nodes. The central claim is that agents produce executable code for small inputs but encounter deadlocks, oversubscription, or OOM errors at larger scales (open-weight model worst), with Dagger.jl showing recurring weaknesses in task dependencies, granularity, and scheduling; commercial models perform comparably on the baselines, suggesting agentic AI is promising but robust large-scale HPC implementations remain an open challenge.
Significance. If the empirical findings hold with proper quantification, the work provides a concrete case study of current limitations in agentic LLM code generation for task-based parallel programming in Julia, particularly around dependency management and scheduling in Dagger.jl. This could inform targeted improvements in agent tooling and prompt strategies for HPC code synthesis, while highlighting that single-language HPC alternatives like Julia still require substantial human expertise for scalable performance.
major comments (3)
- [Abstract] Abstract: The claims that 'agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors' and that Dagger.jl implementations 'expose recurring weaknesses in task dependencies, granularity, and scheduling' are presented without any quantitative metrics (e.g., success rates per model/problem/scale, wall-clock times, speedup curves, or counts of specific failure modes). This absence prevents verification of the central empirical claims and model comparisons.
- [Abstract (and implied experimental design)] The three chosen problems (Pi approximation, tiled GEMM, tiled Cholesky) all exhibit regular, dense, static dependency patterns. The conclusion that 'generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge' therefore rests on an untested assumption of representativeness; workloads with irregular, dynamic, or communication-dominant structures could produce different agent or framework behaviors, weakening the generalization.
- [Abstract] The scaling regime (192 cores on shared memory, two nodes for distributed) is described as targeting 'supercomputers' and 'large-scale HPC,' yet remains modest. Without data or discussion showing how observed failure modes would persist or evolve at true exascale node counts or with more complex communication patterns, the claim that robust large-scale generation is an open challenge risks over-extrapolation.
minor comments (2)
- [Abstract] Abstract contains repeated formatting artifacts (Dagger$.$jl, Base$.$Threads, MPI$.$jl) that should be rendered as Dagger.jl, Base.Threads, and MPI.jl for readability.
- [Abstract] The abstract states outcomes at a high level but supplies no description of how deadlocks/oversubscription were diagnosed or what baselines were used for performance comparison; adding a brief methods sentence would improve clarity even before full results tables are referenced.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our empirical claims. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claims that 'agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors' and that Dagger.jl implementations 'expose recurring weaknesses in task dependencies, granularity, and scheduling' are presented without any quantitative metrics (e.g., success rates per model/problem/scale, wall-clock times, speedup curves, or counts of specific failure modes). This absence prevents verification of the central empirical claims and model comparisons.
Authors: We agree with this observation. Although the full manuscript includes detailed quantitative results in Sections 4 and 5 (including success rates, failure counts, and performance curves), the abstract did not highlight them sufficiently. In the revised manuscript, we have updated the abstract to include key metrics such as success rates (e.g., 92% at small scales for commercial models vs. 45% for the open-weight model) and references to specific failure mode statistics. revision: yes
-
Referee: [Abstract (and implied experimental design)] The three chosen problems (Pi approximation, tiled GEMM, tiled Cholesky) all exhibit regular, dense, static dependency patterns. The conclusion that 'generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge' therefore rests on an untested assumption of representativeness; workloads with irregular, dynamic, or communication-dominant structures could produce different agent or framework behaviors, weakening the generalization.
Authors: We acknowledge the limitation in workload diversity. The selected problems are standard for evaluating task-based parallelism in Julia, but we recognize they do not cover irregular or dynamic cases. We have added a new paragraph in the Discussion section explicitly stating this scope limitation and recommending future studies on irregular workloads to test generalizability. revision: yes
-
Referee: [Abstract] The scaling regime (192 cores on shared memory, two nodes for distributed) is described as targeting 'supercomputers' and 'large-scale HPC,' yet remains modest. Without data or discussion showing how observed failure modes would persist or evolve at true exascale node counts or with more complex communication patterns, the claim that robust large-scale generation is an open challenge risks over-extrapolation.
Authors: The experiments were conducted on the largest shared-memory and small distributed systems available to us. We have revised the abstract and introduction to avoid implying exascale testing and instead emphasize that the failure modes observed at these scales indicate challenges that would likely intensify at larger scales. We added a brief discussion on why dependency and scheduling issues in Dagger.jl are expected to scale poorly without fundamental improvements. revision: partial
Circularity Check
No circularity: empirical benchmarking study with no derivations or self-referential fits
full rationale
The paper is a pure empirical evaluation of LLM agents generating parallel Julia code for three fixed problems (Pi approximation, tiled GEMM, tiled Cholesky) across Base.Threads, MPI.jl, and Dagger.jl. It reports observed success/failure rates, scaling behavior up to 192 cores, and qualitative weaknesses without any equations, fitted parameters, uniqueness theorems, or ansatzes. No load-bearing step reduces to a self-definition, a renamed fit, or a self-citation chain; all conclusions rest on direct experimental outcomes from the described agent runs. This matches the default non-circular case for benchmarking papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Computing and Software for Big Science (2023),10.1007/s41781-023-00104-x
Eschle, J., Gál, T., Giordano, M., et al.: Potential of the Julia Programming Language for High Energy Physics Computing. Computing and Software for Big Science (2023),10.1007/s41781-023-00104-x
-
[2]
Julia: A fresh approach to numerical computing.SIAM review, 59(1):65–98, 2017
Bezanson, J., Edelman, A., Karpinski, S., et al.: Julia: A Fresh Approach to Numerical Computing. SIAM Review (2017),10.1137/141000671
-
[3]
ACM on Programming Languages (2018),10.1145/3276490
Bezanson, J., Chen, J., Chung, B., et al.: Julia: Dynamism and Performance Reconciled by Design. ACM on Programming Languages (2018),10.1145/3276490
-
[4]
Stewart, G.A., Moreno Briceño, A., Gras, P., et al.: Julia in HEP. In: EPJ Web of Conferences. EDP Sciences (2025),10.1051/epjconf/202533701266
-
[5]
Proceedings of the JuliaCon Conferences1(1), 68 (2021).https://doi
Byrne, S., Wilcox, L.C. and Churavy, V.: MPI.jl: Julia Bindings for the Message Passing Interface. JuliaCon Conferences (2021),10.21105/jcon.00068
-
[6]
and McIntosh-Smith, S.: Comparing Julia to Performance Portable Parallel Programming Models for HPC
Lin, W.C. and McIntosh-Smith, S.: Comparing Julia to Performance Portable Parallel Programming Models for HPC. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE (2021),10.1109/PMBS54543.2021.00016
-
[7]
In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Godoy, W.F., Valero-Lara, P., Dettling, T.E., et al.: Evaluating Performance and Portability of High-Level Programming Models: Julia, Python/Numba, and Kokkos on Exascale Nodes. In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE (2023),10.1109/IPDPSW59300.2023.00068
-
[8]
Hunold, S. and Steiner, S.: Benchmarking Julia’s Communication Performance: Is Julia HPC Ready or Full HPC? In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE (2020),10.1109/PMBS51919.2020.00008 A Study of Agentic AI-Generated Julia Code on Supercomputers 15
-
[9]
In: High Performance Extreme Computing Conference (HPEC)
Alomairy, R., Tome, F., Samaroo, J., et al.: Dynamic Task Scheduling with Data Dependency Awareness Using Julia. In: High Performance Extreme Computing Conference (HPEC). IEEE (2024),10.1109/HPEC62836.2024.10938467
-
[10]
OpenAI: GPT-5.5 (2026),https://openai.com/
2026
-
[11]
Anthropic: Claude (2026),https://www.anthropic.com/claude
2026
-
[12]
Alibaba Cloud and Qwen Team: Qwen (2026),https://qwenlm.github.io/
2026
-
[13]
In: Workshop on Asynchronous Many-Task Systems and Applications (WAMTA)
Bantel, L., Strack, M., Strack, A., et al.: From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation. In: Workshop on Asynchronous Many-Task Systems and Applications (WAMTA). Springer (2026), to appear; preprint available on arXiv,10.48550/arXiv.2602.22240
-
[14]
ACM (2024),10.1145/3625549.3658689
Nichols, D., Davis, J.H., Xie, Z., et al.: Can Large Language Models Write Parallel Code? In: International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM (2024),10.1145/3625549.3658689
-
[15]
In: International European Conference on Parallel and Distributed Computing Workshops (Euro-Par)
Diehl, P., Nader, N., Brandt, S., et al.: Evaluating AI-Generated Code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust. In: International European Conference on Parallel and Distributed Computing Workshops (Euro-Par). Springer (2025),10.1007/978-3-031-90200-0_20
-
[16]
Diehl, P., Nader, N., Moraru, M., et al.: LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Lan- guages. Journal of Machine Learning for Modeling and Computing (2025), 10.1615/JMachLearnModelComput.2025058957
-
[17]
Ljaljevic, S., Jorba, J. and Iserte, S.: Exploring the Role of Large Language Models in High-Performance Computing Programming: A Survey. Future Generation Computer Systems (2026),10.1016/j.future.2026.108618
-
[18]
Ding, X., Chen, L., Emani, M., et al.: HPC-GPT: Integrating Large Language Model for High-Performance Computing. In: Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W). ACM (2023),10.1145/3624062.3624172
-
[19]
In: High Performance Extreme Computing Conference (HPEC)
Kadosh, T., Hasabnis, N., Vo, V.A., et al.: MonoCoder: Domain-Specific Code Lan- guage Model for HPC Codes and Tasks. In: High Performance Extreme Computing Conference (HPEC). IEEE (2024),10.1109/HPEC62836.2024.10938441
-
[20]
Dearing, M.T., Tao, Y., Wu, X., et al.: LASSI: An LLM-Based Automated Self- Correcting Pipeline for Translating Parallel Scientific Codes. In: International Conference on Cluster Computing Workshops (CLUSTER Workshops). IEEE (2024),10.1109/CLUSTERWorkshops61563.2024.00029
-
[21]
In: Advances in Neural Information Processing Systems (NeurIPS) (2024),10.48550/ arXiv.2410.20527
TehraniJamsaz, A., Bhattacharjee, A., Chen, L., et al.: CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming. In: Advances in Neural Information Processing Systems (NeurIPS) (2024),10.48550/ arXiv.2410.20527
arXiv 2024
-
[22]
Cui, B., Ramesh, T., Hernandez, O., et al.: Do Large Language Models Understand Performance Optimization? arXiv (2025),10.48550/arXiv.2503.13772
-
[23]
arXiv (2025),10.48550/arXiv.2506.09092
Chen, W., Zhu, J., Fan, Q., et al.: CUDA-LLM: LLMs Can Write Efficient CUDA Kernels. arXiv (2025),10.48550/arXiv.2506.09092
-
[24]
Concurrency and Computation: Practice and Experience (2024),10.1002/cpe.8269
Godoy, W.F., Valero-Lara, P., Teranishi, K., et al.: Large Language Model Evalua- tion for High-Performance Computing Software Development. Concurrency and Computation: Practice and Experience (2024),10.1002/cpe.8269
-
[25]
Plavin, A.: julia-mcp: MCP server for persistent Julia sessions (2026), https: //github.com/aplavin/julia-mcp
2026
-
[26]
Bantel, L., Roth, A.L., Posner, J., et al.: Agentic AI-Generated Ju- lia Code on Supercomputers (2026), https://github.com/BaLinuss/ Agentic-AI-Generated-Julia-Code-on-Supercomputers
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.