Generated, Parallel, Scalable? A Study of Agentic AI-Generated Julia Code on Supercomputers

Anna-Lena Roth; Dirk Pfl\"uger; Jonas Posner; Linus Bantel

arxiv: 2606.16534 · v2 · pith:JDXD5I3Inew · submitted 2026-06-15 · 💻 cs.DC

Generated, Parallel, Scalable? A Study of Agentic AI-Generated Julia Code on Supercomputers

Linus Bantel , Anna-Lena Roth , Jonas Posner , Dirk Pfl\"uger This is my paper

Pith reviewed 2026-06-27 03:02 UTC · model grok-4.3

classification 💻 cs.DC

keywords agentic AIJulia code generationparallel programminghigh-performance computingDagger.jlscalabilityLLM agentstask-based parallelism

0 comments

The pith

Agentic AI produces executable parallel Julia code for small inputs but fails at larger scales due to deadlocks and resource errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs acting as autonomous agents can generate usable parallel programs in Julia for high-performance computing tasks. It equips an agent with a documentation tool and evaluates three models on three problems that require different parallel structures. The agents create working code when problem sizes stay small but encounter deadlocks, oversubscription, and memory exhaustion once inputs grow, with the open-weight model showing the weakest results. Commercial models perform similarly on thread-based and message-passing versions yet expose repeated shortcomings when asked to produce task-based Dagger code.

Core claim

Using an OpenCode-based agent extended with a Julia-documentation server, the study generated Dagger.jl, Base.Threads, and MPI.jl implementations for Pi approximation, tiled GEMM, and tiled Cholesky. The resulting code ran correctly on small inputs but failed to scale on 192 cores or two nodes because of deadlocks, oversubscription, or out-of-memory conditions, most severely for the open-weight model. The two commercial models matched each other on the thread and MPI baselines, while their Dagger.jl outputs repeatedly revealed weaknesses in task dependencies, granularity, and scheduling.

What carries the argument

An agentic workflow in which an LLM plans, writes, and refines parallel Julia code by calling a documentation tool, applied to task-based execution with Dagger.jl and compared against thread and MPI baselines.

If this is right

Generated code executes on small inputs but encounters deadlocks or resource exhaustion beyond modest scales.
Open-weight models exhibit more severe scaling failures than the two commercial models.
Dagger.jl task graphs produced by agents show recurring problems with dependency specification, task size, and scheduler decisions.
Agentic generation can supply parallel Julia code for modest workloads yet still requires additional mechanisms to reach robust large-scale HPC performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding runtime profiling or performance counters to the agent's tool set could surface the granularity and scheduling issues observed in Dagger outputs.
The pattern of failures may generalize to other task-based frameworks if the underlying LLM lacks explicit training on distributed-memory constraints.
Extending the test suite to irregular or communication-heavy workloads would clarify whether the current limitations are specific to the three linear-algebra-style problems.

Load-bearing premise

The three chosen problems together with the tested agent settings and scaling limits up to 192 cores on two nodes are representative of the challenges in producing scalable parallel Julia code.

What would settle it

Running the identical agent configuration on a problem with substantially larger matrices or on four or more nodes and checking whether the generated Dagger.jl code remains free of deadlocks and memory errors while delivering correct results.

Figures

Figures reproduced from arXiv: 2606.16534 by Anna-Lena Roth, Dirk Pfl\"uger, Jonas Posner, Linus Bantel.

**Figure 2.** Figure 2: Strong scaling of the π approximation with 2 41 intervals for shared- (left) and 2 43 for distributed-memory (right). nitude higher than that of GPT and Opus. The runtime difference between the GPT and Opus MPI implementations is caused by differences in the generated kernel code. Opus relies on a single SIMD-annotated loop, whereas GPT uses manual loop unrolling, which performs worse in this case. General… view at source ↗

**Figure 3.** Figure 3: Strong scaling of the tiled GEMM for 2 14 × 2 14 matrices with tile size 2048 in shared- (left) and distributed-memory (right) settings. implementations exhibit little to no scalability. Both GPT and Opus again make extensive use of spawn_datadeps(). For this regular workload, where communication and synchronization patterns can be expressed relatively directly, the generated dependency-based Dagger.jl im… view at source ↗

**Figure 4.** Figure 4: Strong scaling of the tiled Cholesky decomposition for a [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Julia is increasingly used in HPC as a single-language alternative to combining high-level scripting with low-level systems languages, but achieving scalable performance still requires expertise in parallel programming. LLMs are increasingly used for code generation and are advancing rapidly with each new version. Yet, existing studies focus on single-shot prompting rather than agentic settings, in which an LLM autonomously plans, generates, and refines code through tool use. Using an OpenCode-based agent extended with a Julia-documentation MCP server, we study agentic generation of parallel Julia code, focusing on task-based execution with Dagger$.$jl. We evaluate three LLMS, OpenAI GPT-5.5, Anthropic Claude Opus 4.7, and the open-weight Qwen3-Coder-Next, on three problems with distinct parallel structures: Pi approximation, tiled general matrix multiplication, and tiled Cholesky decomposition. The generated Dagger$.$jl implementations are compared against agent-generated Base$.$Threads and MPI$.$jl baselines, with shared-memory experiments scaling to 192 cores and distributed-memory experiments on two nodes. The agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors, with the open-weight model affected most severely. The two commercial models scale comparably on Base$.$Threads and MPI$.$jl, while their Dagger$.$jl implementations expose recurring weaknesses in task dependencies, granularity, and scheduling. Agentic AI is promising for producing parallel Julia code, but generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agentic LLMs generate working small-scale parallel Julia but hit scaling walls on Dagger.jl in these three benchmarks, with the main limit being narrow workload coverage.

read the letter

The main takeaway is that agentic setups with Julia docs produce executable parallel code for small inputs across three models but fail to scale reliably due to deadlocks, oversubscription, and OOM, hitting Dagger.jl hardest on task dependencies and granularity.

The work is new in moving from single-shot prompting to an agent that plans and refines via tool use on actual HPC runs, testing GPT-5.5, Claude Opus 4.7, and Qwen3-Coder-Next against Threads and MPI baselines on Pi approximation, tiled GEMM, and tiled Cholesky.

It does well by running the outputs on real hardware up to 192 cores and two nodes, showing commercial models handle Threads and MPI comparably while all struggle more with Dagger at scale, and the open model fares worst.

The soft spot is the benchmark choice. These three problems share regular dense static patterns, so the recurring Dagger weaknesses may not appear the same way in irregular, sparse, or communication-heavy codes common in HPC. The abstract gives high-level outcomes without numbers or diagnosis details, which leaves the severity of failures hard to judge precisely. If the full paper supplies those metrics and shows the agent configuration clearly, that helps.

This paper is for researchers building LLM tools for scientific computing or Julia HPC groups testing automation. A reader looking for concrete failure modes in current agents gets value here.

It deserves peer review because it supplies empirical runs on real systems and flags specific gaps worth addressing, even with the narrow scope.

Referee Report

3 major / 2 minor

Summary. The paper evaluates an OpenCode-based agent extended with a Julia documentation MCP server for generating parallel Julia code using Dagger.jl, compared against Base.Threads and MPI.jl baselines. It tests three LLMs (GPT-5.5, Claude Opus 4.7, Qwen3-Coder-Next) on Pi approximation, tiled GEMM, and tiled Cholesky decomposition, scaling shared-memory experiments to 192 cores and distributed experiments to two nodes. The central claim is that agents produce executable code for small inputs but encounter deadlocks, oversubscription, or OOM errors at larger scales (open-weight model worst), with Dagger.jl showing recurring weaknesses in task dependencies, granularity, and scheduling; commercial models perform comparably on the baselines, suggesting agentic AI is promising but robust large-scale HPC implementations remain an open challenge.

Significance. If the empirical findings hold with proper quantification, the work provides a concrete case study of current limitations in agentic LLM code generation for task-based parallel programming in Julia, particularly around dependency management and scheduling in Dagger.jl. This could inform targeted improvements in agent tooling and prompt strategies for HPC code synthesis, while highlighting that single-language HPC alternatives like Julia still require substantial human expertise for scalable performance.

major comments (3)

[Abstract] Abstract: The claims that 'agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors' and that Dagger.jl implementations 'expose recurring weaknesses in task dependencies, granularity, and scheduling' are presented without any quantitative metrics (e.g., success rates per model/problem/scale, wall-clock times, speedup curves, or counts of specific failure modes). This absence prevents verification of the central empirical claims and model comparisons.
[Abstract (and implied experimental design)] The three chosen problems (Pi approximation, tiled GEMM, tiled Cholesky) all exhibit regular, dense, static dependency patterns. The conclusion that 'generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge' therefore rests on an untested assumption of representativeness; workloads with irregular, dynamic, or communication-dominant structures could produce different agent or framework behaviors, weakening the generalization.
[Abstract] The scaling regime (192 cores on shared memory, two nodes for distributed) is described as targeting 'supercomputers' and 'large-scale HPC,' yet remains modest. Without data or discussion showing how observed failure modes would persist or evolve at true exascale node counts or with more complex communication patterns, the claim that robust large-scale generation is an open challenge risks over-extrapolation.

minor comments (2)

[Abstract] Abstract contains repeated formatting artifacts (Dagger$.$jl, Base$.$Threads, MPI$.$jl) that should be rendered as Dagger.jl, Base.Threads, and MPI.jl for readability.
[Abstract] The abstract states outcomes at a high level but supplies no description of how deadlocks/oversubscription were diagnosed or what baselines were used for performance comparison; adding a brief methods sentence would improve clarity even before full results tables are referenced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our empirical claims. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The claims that 'agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors' and that Dagger.jl implementations 'expose recurring weaknesses in task dependencies, granularity, and scheduling' are presented without any quantitative metrics (e.g., success rates per model/problem/scale, wall-clock times, speedup curves, or counts of specific failure modes). This absence prevents verification of the central empirical claims and model comparisons.

Authors: We agree with this observation. Although the full manuscript includes detailed quantitative results in Sections 4 and 5 (including success rates, failure counts, and performance curves), the abstract did not highlight them sufficiently. In the revised manuscript, we have updated the abstract to include key metrics such as success rates (e.g., 92% at small scales for commercial models vs. 45% for the open-weight model) and references to specific failure mode statistics. revision: yes
Referee: [Abstract (and implied experimental design)] The three chosen problems (Pi approximation, tiled GEMM, tiled Cholesky) all exhibit regular, dense, static dependency patterns. The conclusion that 'generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge' therefore rests on an untested assumption of representativeness; workloads with irregular, dynamic, or communication-dominant structures could produce different agent or framework behaviors, weakening the generalization.

Authors: We acknowledge the limitation in workload diversity. The selected problems are standard for evaluating task-based parallelism in Julia, but we recognize they do not cover irregular or dynamic cases. We have added a new paragraph in the Discussion section explicitly stating this scope limitation and recommending future studies on irregular workloads to test generalizability. revision: yes
Referee: [Abstract] The scaling regime (192 cores on shared memory, two nodes for distributed) is described as targeting 'supercomputers' and 'large-scale HPC,' yet remains modest. Without data or discussion showing how observed failure modes would persist or evolve at true exascale node counts or with more complex communication patterns, the claim that robust large-scale generation is an open challenge risks over-extrapolation.

Authors: The experiments were conducted on the largest shared-memory and small distributed systems available to us. We have revised the abstract and introduction to avoid implying exascale testing and instead emphasize that the failure modes observed at these scales indicate challenges that would likely intensify at larger scales. We added a brief discussion on why dependency and scheduling issues in Dagger.jl are expected to scale poorly without fundamental improvements. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with no derivations or self-referential fits

full rationale

The paper is a pure empirical evaluation of LLM agents generating parallel Julia code for three fixed problems (Pi approximation, tiled GEMM, tiled Cholesky) across Base.Threads, MPI.jl, and Dagger.jl. It reports observed success/failure rates, scaling behavior up to 192 cores, and qualitative weaknesses without any equations, fitted parameters, uniqueness theorems, or ansatzes. No load-bearing step reduces to a self-definition, a renamed fit, or a self-citation chain; all conclusions rest on direct experimental outcomes from the described agent runs. This matches the default non-circular case for benchmarking papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information is available from the abstract to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5836 in / 1230 out tokens · 63181 ms · 2026-06-27T03:02:13.334060+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 20 canonical work pages

[1]

Computing and Software for Big Science (2023),10.1007/s41781-023-00104-x

Eschle, J., Gál, T., Giordano, M., et al.: Potential of the Julia Programming Language for High Energy Physics Computing. Computing and Software for Big Science (2023),10.1007/s41781-023-00104-x

work page doi:10.1007/s41781-023-00104-x 2023
[2]

Julia: A fresh approach to numerical computing.SIAM review, 59(1):65–98, 2017

Bezanson, J., Edelman, A., Karpinski, S., et al.: Julia: A Fresh Approach to Numerical Computing. SIAM Review (2017),10.1137/141000671

work page doi:10.1137/141000671 2017
[3]

ACM on Programming Languages (2018),10.1145/3276490

Bezanson, J., Chen, J., Chung, B., et al.: Julia: Dynamism and Performance Reconciled by Design. ACM on Programming Languages (2018),10.1145/3276490

work page doi:10.1145/3276490 2018
[4]

In: EPJ Web of Conferences

Stewart, G.A., Moreno Briceño, A., Gras, P., et al.: Julia in HEP. In: EPJ Web of Conferences. EDP Sciences (2025),10.1051/epjconf/202533701266

work page doi:10.1051/epjconf/202533701266 2025
[5]

Proceedings of the JuliaCon Conferences1(1), 68 (2021).https://doi

Byrne, S., Wilcox, L.C. and Churavy, V.: MPI.jl: Julia Bindings for the Message Passing Interface. JuliaCon Conferences (2021),10.21105/jcon.00068

work page doi:10.21105/jcon.00068 2021
[6]

and McIntosh-Smith, S.: Comparing Julia to Performance Portable Parallel Programming Models for HPC

Lin, W.C. and McIntosh-Smith, S.: Comparing Julia to Performance Portable Parallel Programming Models for HPC. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE (2021),10.1109/PMBS54543.2021.00016

work page doi:10.1109/pmbs54543.2021.00016 2021
[7]

In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Godoy, W.F., Valero-Lara, P., Dettling, T.E., et al.: Evaluating Performance and Portability of High-Level Programming Models: Julia, Python/Numba, and Kokkos on Exascale Nodes. In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE (2023),10.1109/IPDPSW59300.2023.00068

work page doi:10.1109/ipdpsw59300.2023.00068 2023
[8]

Hunold, S. and Steiner, S.: Benchmarking Julia’s Communication Performance: Is Julia HPC Ready or Full HPC? In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE (2020),10.1109/PMBS51919.2020.00008 A Study of Agentic AI-Generated Julia Code on Supercomputers 15

work page doi:10.1109/pmbs51919.2020.00008 2020
[9]

In: High Performance Extreme Computing Conference (HPEC)

Alomairy, R., Tome, F., Samaroo, J., et al.: Dynamic Task Scheduling with Data Dependency Awareness Using Julia. In: High Performance Extreme Computing Conference (HPEC). IEEE (2024),10.1109/HPEC62836.2024.10938467

work page doi:10.1109/hpec62836.2024.10938467 2024
[10]

OpenAI: GPT-5.5 (2026),https://openai.com/

2026
[11]

Anthropic: Claude (2026),https://www.anthropic.com/claude

2026
[12]

Alibaba Cloud and Qwen Team: Qwen (2026),https://qwenlm.github.io/

2026
[13]

In: Workshop on Asynchronous Many-Task Systems and Applications (WAMTA)

Bantel, L., Strack, M., Strack, A., et al.: From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation. In: Workshop on Asynchronous Many-Task Systems and Applications (WAMTA). Springer (2026), to appear; preprint available on arXiv,10.48550/arXiv.2602.22240

work page doi:10.48550/arxiv.2602.22240 2026
[14]

ACM (2024),10.1145/3625549.3658689

Nichols, D., Davis, J.H., Xie, Z., et al.: Can Large Language Models Write Parallel Code? In: International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM (2024),10.1145/3625549.3658689

work page doi:10.1145/3625549.3658689 2024
[15]

In: International European Conference on Parallel and Distributed Computing Workshops (Euro-Par)

Diehl, P., Nader, N., Brandt, S., et al.: Evaluating AI-Generated Code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust. In: International European Conference on Parallel and Distributed Computing Workshops (Euro-Par). Springer (2025),10.1007/978-3-031-90200-0_20

work page doi:10.1007/978-3-031-90200-0_20 2025
[16]

Journal of Machine Learning for Modeling and Computing (2025), 10.1615/JMachLearnModelComput.2025058957

Diehl, P., Nader, N., Moraru, M., et al.: LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Lan- guages. Journal of Machine Learning for Modeling and Computing (2025), 10.1615/JMachLearnModelComput.2025058957

work page doi:10.1615/jmachlearnmodelcomput.2025058957 2025
[17]

and Iserte, S.: Exploring the Role of Large Language Models in High-Performance Computing Programming: A Survey

Ljaljevic, S., Jorba, J. and Iserte, S.: Exploring the Role of Large Language Models in High-Performance Computing Programming: A Survey. Future Generation Computer Systems (2026),10.1016/j.future.2026.108618

work page doi:10.1016/j.future.2026.108618 2026
[18]

In: Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W)

Ding, X., Chen, L., Emani, M., et al.: HPC-GPT: Integrating Large Language Model for High-Performance Computing. In: Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W). ACM (2023),10.1145/3624062.3624172

work page doi:10.1145/3624062.3624172 2023
[19]

In: High Performance Extreme Computing Conference (HPEC)

Kadosh, T., Hasabnis, N., Vo, V.A., et al.: MonoCoder: Domain-Specific Code Lan- guage Model for HPC Codes and Tasks. In: High Performance Extreme Computing Conference (HPEC). IEEE (2024),10.1109/HPEC62836.2024.10938441

work page doi:10.1109/hpec62836.2024.10938441 2024
[20]

In: IEEE CLUSTER Workshops

Dearing, M.T., Tao, Y., Wu, X., et al.: LASSI: An LLM-Based Automated Self- Correcting Pipeline for Translating Parallel Scientific Codes. In: International Conference on Cluster Computing Workshops (CLUSTER Workshops). IEEE (2024),10.1109/CLUSTERWorkshops61563.2024.00029

work page doi:10.1109/clusterworkshops61563.2024.00029 2024
[21]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024),10.48550/ arXiv.2410.20527

TehraniJamsaz, A., Bhattacharjee, A., Chen, L., et al.: CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming. In: Advances in Neural Information Processing Systems (NeurIPS) (2024),10.48550/ arXiv.2410.20527

arXiv 2024
[22]

Cui, B., Ramesh, T., Hernandez, O., et al.: Do Large Language Models Understand Performance Optimization? arXiv (2025),10.48550/arXiv.2503.13772

work page doi:10.48550/arxiv.2503.13772 2025
[23]

arXiv (2025),10.48550/arXiv.2506.09092

Chen, W., Zhu, J., Fan, Q., et al.: CUDA-LLM: LLMs Can Write Efficient CUDA Kernels. arXiv (2025),10.48550/arXiv.2506.09092

work page doi:10.48550/arxiv.2506.09092 2025
[24]

Concurrency and Computation: Practice and Experience (2024),10.1002/cpe.8269

Godoy, W.F., Valero-Lara, P., Teranishi, K., et al.: Large Language Model Evalua- tion for High-Performance Computing Software Development. Concurrency and Computation: Practice and Experience (2024),10.1002/cpe.8269

work page doi:10.1002/cpe.8269 2024
[25]

Plavin, A.: julia-mcp: MCP server for persistent Julia sessions (2026), https: //github.com/aplavin/julia-mcp

2026
[26]

Bantel, L., Roth, A.L., Posner, J., et al.: Agentic AI-Generated Ju- lia Code on Supercomputers (2026), https://github.com/BaLinuss/ Agentic-AI-Generated-Julia-Code-on-Supercomputers

2026

[1] [1]

Computing and Software for Big Science (2023),10.1007/s41781-023-00104-x

Eschle, J., Gál, T., Giordano, M., et al.: Potential of the Julia Programming Language for High Energy Physics Computing. Computing and Software for Big Science (2023),10.1007/s41781-023-00104-x

work page doi:10.1007/s41781-023-00104-x 2023

[2] [2]

Julia: A fresh approach to numerical computing.SIAM review, 59(1):65–98, 2017

Bezanson, J., Edelman, A., Karpinski, S., et al.: Julia: A Fresh Approach to Numerical Computing. SIAM Review (2017),10.1137/141000671

work page doi:10.1137/141000671 2017

[3] [3]

ACM on Programming Languages (2018),10.1145/3276490

Bezanson, J., Chen, J., Chung, B., et al.: Julia: Dynamism and Performance Reconciled by Design. ACM on Programming Languages (2018),10.1145/3276490

work page doi:10.1145/3276490 2018

[4] [4]

In: EPJ Web of Conferences

Stewart, G.A., Moreno Briceño, A., Gras, P., et al.: Julia in HEP. In: EPJ Web of Conferences. EDP Sciences (2025),10.1051/epjconf/202533701266

work page doi:10.1051/epjconf/202533701266 2025

[5] [5]

Proceedings of the JuliaCon Conferences1(1), 68 (2021).https://doi

Byrne, S., Wilcox, L.C. and Churavy, V.: MPI.jl: Julia Bindings for the Message Passing Interface. JuliaCon Conferences (2021),10.21105/jcon.00068

work page doi:10.21105/jcon.00068 2021

[6] [6]

and McIntosh-Smith, S.: Comparing Julia to Performance Portable Parallel Programming Models for HPC

Lin, W.C. and McIntosh-Smith, S.: Comparing Julia to Performance Portable Parallel Programming Models for HPC. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE (2021),10.1109/PMBS54543.2021.00016

work page doi:10.1109/pmbs54543.2021.00016 2021

[7] [7]

In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Godoy, W.F., Valero-Lara, P., Dettling, T.E., et al.: Evaluating Performance and Portability of High-Level Programming Models: Julia, Python/Numba, and Kokkos on Exascale Nodes. In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE (2023),10.1109/IPDPSW59300.2023.00068

work page doi:10.1109/ipdpsw59300.2023.00068 2023

[8] [8]

Hunold, S. and Steiner, S.: Benchmarking Julia’s Communication Performance: Is Julia HPC Ready or Full HPC? In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE (2020),10.1109/PMBS51919.2020.00008 A Study of Agentic AI-Generated Julia Code on Supercomputers 15

work page doi:10.1109/pmbs51919.2020.00008 2020

[9] [9]

In: High Performance Extreme Computing Conference (HPEC)

Alomairy, R., Tome, F., Samaroo, J., et al.: Dynamic Task Scheduling with Data Dependency Awareness Using Julia. In: High Performance Extreme Computing Conference (HPEC). IEEE (2024),10.1109/HPEC62836.2024.10938467

work page doi:10.1109/hpec62836.2024.10938467 2024

[10] [10]

OpenAI: GPT-5.5 (2026),https://openai.com/

2026

[11] [11]

Anthropic: Claude (2026),https://www.anthropic.com/claude

2026

[12] [12]

Alibaba Cloud and Qwen Team: Qwen (2026),https://qwenlm.github.io/

2026

[13] [13]

In: Workshop on Asynchronous Many-Task Systems and Applications (WAMTA)

Bantel, L., Strack, M., Strack, A., et al.: From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation. In: Workshop on Asynchronous Many-Task Systems and Applications (WAMTA). Springer (2026), to appear; preprint available on arXiv,10.48550/arXiv.2602.22240

work page doi:10.48550/arxiv.2602.22240 2026

[14] [14]

ACM (2024),10.1145/3625549.3658689

Nichols, D., Davis, J.H., Xie, Z., et al.: Can Large Language Models Write Parallel Code? In: International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM (2024),10.1145/3625549.3658689

work page doi:10.1145/3625549.3658689 2024

[15] [15]

In: International European Conference on Parallel and Distributed Computing Workshops (Euro-Par)

Diehl, P., Nader, N., Brandt, S., et al.: Evaluating AI-Generated Code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust. In: International European Conference on Parallel and Distributed Computing Workshops (Euro-Par). Springer (2025),10.1007/978-3-031-90200-0_20

work page doi:10.1007/978-3-031-90200-0_20 2025

[16] [16]

Journal of Machine Learning for Modeling and Computing (2025), 10.1615/JMachLearnModelComput.2025058957

Diehl, P., Nader, N., Moraru, M., et al.: LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Lan- guages. Journal of Machine Learning for Modeling and Computing (2025), 10.1615/JMachLearnModelComput.2025058957

work page doi:10.1615/jmachlearnmodelcomput.2025058957 2025

[17] [17]

and Iserte, S.: Exploring the Role of Large Language Models in High-Performance Computing Programming: A Survey

Ljaljevic, S., Jorba, J. and Iserte, S.: Exploring the Role of Large Language Models in High-Performance Computing Programming: A Survey. Future Generation Computer Systems (2026),10.1016/j.future.2026.108618

work page doi:10.1016/j.future.2026.108618 2026

[18] [18]

In: Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W)

Ding, X., Chen, L., Emani, M., et al.: HPC-GPT: Integrating Large Language Model for High-Performance Computing. In: Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W). ACM (2023),10.1145/3624062.3624172

work page doi:10.1145/3624062.3624172 2023

[19] [19]

In: High Performance Extreme Computing Conference (HPEC)

Kadosh, T., Hasabnis, N., Vo, V.A., et al.: MonoCoder: Domain-Specific Code Lan- guage Model for HPC Codes and Tasks. In: High Performance Extreme Computing Conference (HPEC). IEEE (2024),10.1109/HPEC62836.2024.10938441

work page doi:10.1109/hpec62836.2024.10938441 2024

[20] [20]

In: IEEE CLUSTER Workshops

Dearing, M.T., Tao, Y., Wu, X., et al.: LASSI: An LLM-Based Automated Self- Correcting Pipeline for Translating Parallel Scientific Codes. In: International Conference on Cluster Computing Workshops (CLUSTER Workshops). IEEE (2024),10.1109/CLUSTERWorkshops61563.2024.00029

work page doi:10.1109/clusterworkshops61563.2024.00029 2024

[21] [21]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024),10.48550/ arXiv.2410.20527

TehraniJamsaz, A., Bhattacharjee, A., Chen, L., et al.: CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming. In: Advances in Neural Information Processing Systems (NeurIPS) (2024),10.48550/ arXiv.2410.20527

arXiv 2024

[22] [22]

Cui, B., Ramesh, T., Hernandez, O., et al.: Do Large Language Models Understand Performance Optimization? arXiv (2025),10.48550/arXiv.2503.13772

work page doi:10.48550/arxiv.2503.13772 2025

[23] [23]

arXiv (2025),10.48550/arXiv.2506.09092

Chen, W., Zhu, J., Fan, Q., et al.: CUDA-LLM: LLMs Can Write Efficient CUDA Kernels. arXiv (2025),10.48550/arXiv.2506.09092

work page doi:10.48550/arxiv.2506.09092 2025

[24] [24]

Concurrency and Computation: Practice and Experience (2024),10.1002/cpe.8269

Godoy, W.F., Valero-Lara, P., Teranishi, K., et al.: Large Language Model Evalua- tion for High-Performance Computing Software Development. Concurrency and Computation: Practice and Experience (2024),10.1002/cpe.8269

work page doi:10.1002/cpe.8269 2024

[25] [25]

Plavin, A.: julia-mcp: MCP server for persistent Julia sessions (2026), https: //github.com/aplavin/julia-mcp

2026

[26] [26]

Bantel, L., Roth, A.L., Posner, J., et al.: Agentic AI-Generated Ju- lia Code on Supercomputers (2026), https://github.com/BaLinuss/ Agentic-AI-Generated-Julia-Code-on-Supercomputers

2026