arxiv: 2604.22571 · v1 · submitted 2026-04-24 · ⚛️ physics.comp-ph

Recognition: unknown

LARA: Validation-Driven Agentic Supercomputer Workflows for Atomistic Modeling

William Dawson , Louis Beal , Yoann Cur\'e , Giuseppe Fisicaro , Dorian Rolland , Luigi Genovese

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:01 UTC · model grok-4.3

classification ⚛️ physics.comp-ph

keywords atomistic modelingagentic workflowsvalidation-driven generationhigh-performance computingdensity functional theoryscientific workflowsHPC automationLLM agents

0 comments

The pith

Validation-driven agentic systems produce reliable workflows for atomistic modeling on supercomputers by catching errors early.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that AI-generated workflows for atomistic simulations frequently contain syntactic errors, incorrect API calls, or physically invalid setups that cause failures on high-performance computers. It introduces LARA-HPC as a framework built around a controlled execution layer, dry-run validation that checks code without consuming resources, and a multi-phase pipeline that retrieves information and refines outputs iteratively. When tested on end-to-end density functional theory workflows, this structure corrects inconsistencies that standard generation approaches miss. A sympathetic reader would see this as evidence that embedding validation throughout the process can make automated scientific computing practical and reproducible. The work argues this represents a necessary move away from purely generative methods toward ones that prioritize verification at every stage.

Core claim

LARA-HPC combines a controlled execution layer that mediates all interactions with HPC resources, simulation-native dry-run validation for cost-free execution-level checks, and a multi-phase agentic pipeline using retrieval-augmented generation plus iterative refinement to generate and correct atomistic simulation workflows, as demonstrated by successful application to density functional theory calculations where both syntactic and physical inconsistencies are resolved.

What carries the argument

The multi-phase agentic pipeline with simulation-native dry-run validation, which performs execution-level verification without full resource costs and supports iterative correction of generated workflows.

If this is right

End-to-end atomistic simulation workflows can be generated and run reliably on HPC systems without manual debugging.
Syntactic errors and physical inconsistencies in generated code can be caught and fixed iteratively before any full simulation runs.
AI-assisted scientific computing can shift from generation-first to validation-first designs.
Domain-specific agentic systems can support a co-piloted research ecosystem on high-performance computers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dry-run and refinement structure could be applied to other simulation types such as molecular dynamics to automate routine calculations across computational physics.
By handling error correction automatically, the approach could reduce the expertise needed to launch complex atomistic studies and let researchers focus on interpreting results.
Over repeated use, accumulated validation data from such systems might reveal common failure patterns in physical modeling that could inform better initial generation strategies.

Load-bearing premise

Dry-run capabilities and the multi-phase agentic pipeline can reliably detect and correct physical inconsistencies and invalid configurations without requiring full costly executions or human intervention.

What would settle it

A test case in which the framework receives a workflow with a known uncorrectable physical inconsistency, such as a non-physical interatomic distance, and either fails to flag it during dry-runs or cannot produce a valid corrected version without external input.

Figures

Figures reproduced from arXiv: 2604.22571 by Dorian Rolland, Giuseppe Fisicaro, Louis Beal, Luigi Genovese, William Dawson, Yoann Cur\'e.

**Figure 1.** Figure 1: FIG. 1. ReAct (Reasoning and Acting) loop [21] at the foun view at source ↗

**Figure 2.** Figure 2: FIG. 2. LARA-HPC architecture. A user request is trans view at source ↗

read the original abstract

Large language models (LLMs) and agentic systems have recently demonstrated potential for automating scientific workflows, including atomistic simulations. However, their deployment in high-performance computing (HPC) environments remains limited by the lack of mechanisms ensuring correctness, reproducibility, and safe interaction with computational resources. Generated workflows suffer from inconsistencies, incorrect API usage, or invalid physical configurations - leading to failed or unreliable simulations. In this work, we introduce LARA-HPC, a validation-driven agentic framework to enable reliable workflow generation for atomistic modeling on HPC systems. Our approach is based on three key components: (i) a controlled execution layer that mediates all interactions with HPC resources; (ii) simulation-native validation through dry-run capabilities, enabling execution-level verification without incurring resource cost; and (iii) a multi-phase agentic pipeline combining retrieval-augmented generation and iterative refinement. We demonstrate the effectiveness of this approach performing an end-to-end atomistic simulation workflow on HPC by applying LARA-HPC to Density Functional Theory simulations. The results show that validation-driven generation significantly improves robustness and enables iterative correction of both syntactic and physical inconsistencies. More broadly, this work advocates for a shift from generation-first to validation-first paradigms in Artificial Intelligence (AI) assisted scientific computing. We argue that the future task of the computational physics community is to develop domain specific agentic systems based on structured tooling to realize an HPC enabled co-piloted research ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LARA-HPC offers a practical validation layer for agentic DFT workflows on HPC but the evidence for catching physical inconsistencies via dry-runs is thin.

read the letter

LARA-HPC introduces a validation-driven agentic framework for generating reliable workflows in atomistic modeling on HPC. The main pitch is that adding controlled execution, dry-run validation, and iterative refinement with retrieval-augmented generation lets AI agents produce and fix DFT simulation setups without as many failures. The paper does well in outlining a concrete system that addresses real pain points like incorrect API calls and invalid configurations in HPC environments. The multi-phase pipeline and the focus on simulation-native checks are sensible engineering choices. Demonstrating an end-to-end run on DFT shows the approach in action and supports the call for domain-specific agents. What is actually new is this targeted integration for atomistic simulations, moving beyond general agentic ideas to something HPC-ready. The main limitation is the thin evidence. No success rates, no baseline comparisons, and no specifics on which physical inconsistencies were fixed or how. The concern that dry-runs won't catch deeper issues like SCF non-convergence or unphysical results from bad parameters is reasonable. Those often require full execution to detect, so the robustness claim rests on an assumption that may not fully hold. This paper is for researchers in computational physics and chemistry who want to use or build AI tools for their simulations. It also speaks to the broader community thinking about AI in scientific computing. It deserves a serious referee because the framework is novel enough and the demonstration is there, even if the evaluation needs strengthening.

Referee Report

3 major / 2 minor

Summary. The paper presents LARA-HPC, a validation-driven agentic framework for generating reliable atomistic simulation workflows on HPC systems. It consists of a controlled execution layer, simulation-native dry-run validation for execution-level checks without full resource cost, and a multi-phase pipeline using retrieval-augmented generation plus iterative refinement. Applied to DFT simulations, the work claims that this validation-first approach significantly improves robustness by iteratively correcting syntactic and physical inconsistencies, advocating a broader shift from generation-first to validation-first paradigms in AI-assisted scientific computing.

Significance. If the central claims hold with quantitative support, the work could meaningfully advance reliable LLM/agent use in computational physics by reducing failed HPC jobs and enabling safer co-piloted research. Strengths include the engineering focus on HPC mediation and dry-run tooling, which directly targets reproducibility and safety issues common in agentic scientific workflows.

major comments (3)

[Abstract and Results/Demonstration] Abstract and Results/Demonstration sections: The claim that 'validation-driven generation significantly improves robustness' and enables correction of 'physical inconsistencies' is presented without any quantitative metrics (e.g., success rates, failure reduction percentages, number of iterations required, or baseline comparisons to non-validation agentic pipelines). This absence makes it impossible to assess the strength of the evidence for the central claim.
[Methods (dry-run and multi-phase pipeline)] Methods/Validation components (dry-run and multi-phase pipeline): The assertion that dry-runs provide 'execution-level verification' sufficient to detect and correct physical inconsistencies (e.g., SCF non-convergence from bad initial guesses, incorrect functionals yielding unphysical densities, or k-point artifacts) rests on an unproven assumption. Static input parsing and resource checks cannot surface these deeper issues, which typically require actual execution; the manuscript provides no concrete examples or ablation showing how limited dry-run signals suffice without full DFT runs.
[Demonstration/Results] Demonstration on DFT workflows: The end-to-end example lacks details on the specific physical inconsistencies encountered, how the RAG/refinement phases identified them via dry-runs, and whether corrections were achieved without human intervention or costly full executions. This leaves the 'iterative correction of physical inconsistencies' claim unsupported by traceable evidence.

minor comments (2)

[Methods] The manuscript would benefit from clearer notation distinguishing syntactic/API errors from physical/DFT-specific errors throughout the pipeline description.
[Discussion/Conclusion] Add explicit discussion of limitations, such as cases where dry-runs are insufficient and full execution is still required.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We agree that the central claims would be substantially strengthened by quantitative metrics, clearer delineation of dry-run capabilities, and traceable details from the demonstration. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Results/Demonstration] Abstract and Results/Demonstration sections: The claim that 'validation-driven generation significantly improves robustness' and enables correction of 'physical inconsistencies' is presented without any quantitative metrics (e.g., success rates, failure reduction percentages, number of iterations required, or baseline comparisons to non-validation agentic pipelines). This absence makes it impossible to assess the strength of the evidence for the central claim.

Authors: We acknowledge that the current manuscript relies on a qualitative end-to-end demonstration rather than explicit quantitative metrics such as success rates, iteration counts, or baseline comparisons. This limits the ability to evaluate the strength of the robustness claim. In the revised manuscript we will add a dedicated quantitative evaluation subsection (including success rates over repeated trials, average refinement iterations, failure reduction relative to a generation-only baseline, and specific counts of corrected inconsistencies) and will update the abstract to reference these results. revision: yes
Referee: [Methods (dry-run and multi-phase pipeline)] Methods/Validation components (dry-run and multi-phase pipeline): The assertion that dry-runs provide 'execution-level verification' sufficient to detect and correct physical inconsistencies (e.g., SCF non-convergence from bad initial guesses, incorrect functionals yielding unphysical densities, or k-point artifacts) rests on an unproven assumption. Static input parsing and resource checks cannot surface these deeper issues, which typically require actual execution; the manuscript provides no concrete examples or ablation showing how limited dry-run signals suffice without full DFT runs.

Authors: We agree that static dry-run checks (input syntax, resource allocation, and basic structural validation) cannot directly detect runtime physical phenomena such as SCF non-convergence or unphysical densities. The manuscript description may have overstated the reach of dry-runs for these deeper issues. The multi-phase pipeline uses dry-run error signals to trigger RAG-based refinement, where the agent proposes corrections drawing on retrieved domain knowledge; deeper physical problems are intended to be caught via subsequent agent reasoning or limited execution feedback. We will revise the Methods section to explicitly separate the scope of dry-run checks from the iterative refinement mechanism, add concrete examples from the DFT workflow, and include a brief limitations discussion. revision: yes
Referee: [Demonstration/Results] Demonstration on DFT workflows: The end-to-end example lacks details on the specific physical inconsistencies encountered, how the RAG/refinement phases identified them via dry-runs, and whether corrections were achieved without human intervention or costly full executions. This leaves the 'iterative correction of physical inconsistencies' claim unsupported by traceable evidence.

Authors: We accept that the current demonstration is presented at too high a level and does not provide a traceable step-by-step account of the inconsistencies, detection signals, or automation status. In the revision we will expand the Demonstration section with a detailed trace of the workflow generation process, enumerating each syntactic and physical inconsistency encountered, the exact dry-run or agent signals that surfaced them, the RAG/refinement actions taken, confirmation that corrections occurred without human intervention, and the resource costs avoided. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework and demonstration are self-contained

full rationale

The paper describes an agentic workflow architecture (controlled execution layer, dry-run validation, multi-phase RAG/refinement pipeline) and reports results from its application to DFT simulations. No mathematical derivation, fitted parameters, or first-principles predictions exist. Claims rest on the proposed components and observed improvements in the demonstration, with no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central result to its own inputs. The skeptic concern addresses empirical adequacy of dry-runs for physical errors, which is a correctness question outside circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Relies on assumptions regarding the utility of LLM-generated workflows and the availability of dry-run modes in simulation software; introduces the LARA-HPC system as a new entity without independent evidence beyond the paper's claims.

axioms (2)

domain assumption LLMs can produce workflows that benefit from external validation layers
Invoked as the motivation for the controlled execution and refinement pipeline
domain assumption Dry-run capabilities exist that verify execution without full resource cost
Central to the simulation-native validation component

invented entities (1)

LARA-HPC framework no independent evidence
purpose: Mediate agentic workflow generation with validation for HPC atomistic modeling
New integrated system proposed in the paper

pith-pipeline@v0.9.0 · 9104 in / 1387 out tokens · 101899 ms · 2026-05-08T09:01:08.757342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 14 canonical work pages · 7 internal anchors

[1]

materializing a batch script from the selected tem- plate,
[2]

staging input files and Python sources to the re- mote resource,
[3]

submitting the job to the scheduler,
[4]

monitoring completion,
[5]

Example Server

retrieving output files, return values, and execution logs. A major strength of this approach is that it does not require any daemon or privileged service on the remote machine. The remote side only needs the standard tools already available to HPC users, such as a shell environ- ment, scheduler commands, and secure file transfer. This makes the approach ...
[6]

Understanding: extraction of scientific intent and mapping to domain-specific modules
[7]

Generation: construction of candidate workflows using templates and retrieval
[8]

V alidation: multi-level verification including syn- tax, API, physical constraints, and dry-run execu- tion
[9]

The AI Scientist

Review: higher-level critique and optimization of the generated workflow. This structured decomposition allows separating con- cerns between semantic interpretation, code generation, correctness verification, and optimization. The decomposition also reflects distinct failure modes. Misunderstanding the user request leads to an inappro- priate workflow fam...

2025
[10]

Evaluating Large Language Models Trained on Code

M. Chen et al., Evaluating large language models trained on code, arXiv:2107.03374 (2021)

work page internal anchor Pith review arXiv 2021
[11]

Towards an AI co-scientist

J. Gottweis et al. , Towards an ai co-scientist, arXiv:2502.18864 (2025)

work page internal anchor Pith review arXiv 2025
[12]

Zimmermann et al

Y. Zimmermann et al. , 32 examples of llm applications in materials science and chemistry: towards automation, assistants, agents, and accelerated scientific discovery, Mach. Learn. Sci. Technol. 6, 030701 (2025)

2025
[13]

Zimmermann et al

Y. Zimmermann et al. , Reflections from the 2024 large language model (llm) hackathon for applications in ma- terials science and chemistry, arXiv:2411.15221 (2024)

work page arXiv 2024
[14]

Alampara et al

N. Alampara et al. , General-purpose models for the chemical sciences: Llms and beyond, Chem. Rev. 126, 2484 (2026)

2026
[15]

Mandal et al

I. Mandal et al. , Evaluating large language model agents for automation of atomic force microscopy, Nat. Com- mun. 16, 64105 (2025)

2025
[16]

A. M. Bran et al. , Augmenting large language models with chemistry tools, Nat. Mach. Intell. 6, 525 (2024)

2024
[17]

Vriza et al., Multi-agentic ai framework for end-to-end atomistic simulations, Digit

A. Vriza et al., Multi-agentic ai framework for end-to-end atomistic simulations, Digit. Discov. 5, 440 (2026)

2026
[18]

Lu et al

C. Lu et al. , Towards end-to-end automation of AI re- search, Nature 651, 914 (2026)

2026
[19]

Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

S. Gao et al. , Democratizing ai scientists using tooluni- verse, arXiv:2509.23426 (2025)

work page arXiv 2025
[20]

Qu et al

Y. Qu et al. , Crispr-gpt for agentic automation of gene- editing experiments, Nat. Biomed. Eng. 10, 245 (2026)

2026
[21]

Zou et al

Y. Zou et al. , El agente: An autonomous agent for quan- tum chemistry, Matter 8 (2025)

2025
[22]

T. D. Pham, A. Tanikanti, and M. Keçeli, Chemgraph as an agentic framework for computational chemistry work- flows, Commun. Chem. 9, 33 (2026)

2026
[23]

Campbell, S

Q. Campbell, S. Cox, J. Medina, B. Watterson, and A. D. White, Mdcrow: automating molecular dynamics work- flows with large language models, Mach. Learn.: Sci. Technol. 7, 025037 (2026)

2026
[24]

Z. Wang, H. Huang, H. Zhao, C. Xu, S. Zhu, J. Janssen, and V. Viswanathan, Dreams: Density functional theory based research engine for agentic materials simulation, arXiv:2507.14267 (2025)

work page arXiv 2025
[25]

Deelman, D

E. Deelman, D. Gannon, M. Shields, and I. Taylor, Work- flows and e-science: An overview of workflow system fea- tures and capabilities, Future Gener. Comput. Syst. 25, 528 (2009)

2009
[26]

S. P. Huber et al. , Aiida 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance, Sci. Data. 7, 300 (2020)

2020
[27]

R. M. Martin, Electronic structure: basic theory and practical methods (Cambridge university press, 2020)

2020
[28]

Brázdová and D

V. Brázdová and D. R. Bowler, Atomistic Computer Sim- ulations: A Practical Guide (Wiley, Chichester, UK, 2013)

2013
[29]

Lewis et al

P. Lewis et al. , Retrieval-augmented generation for knowledge-intensive nlp tasks, in Adv. Neural Inf. Pro- cess. Syst. , Vol. 33 (2020) pp. 9459–9474

2020
[30]

Yao et al

S. Yao et al. , React: Synergizing reasoning and acting in language models, in The Eleventh Int. Conference. on Learn. Representations (2023)

2023
[31]

L. E. Ratcliff, W. Dawson, G. Fisicaro, et al. , Flexibil- ities of wavelets as a computational basis set for large- scale electronic structure calculations, J. Chem. Phys. 152, 194110 (2020)

2020
[32]

Dawson, L

W. Dawson, L. Beal, L. E. Ratcliff, et al. , Exploratory data science on supercomputers for quantum mechanical calculations, Electron. Struct. 6, 027003 (2024)

2024
[33]

Blount, A

A. Blount, A. Gulli, S. Saboo, M. Zimmermann, and V. Vuskovic, Introduction to agents, https://www. kaggle.com/whitepaper-introduction-to-agents (2025), kaggle/Google Whitepaper. Accessed: April 11, 2026

2025
[34]

Wang et al

L. Wang et al. , A survey on large language model based autonomous agents, Front. Comput. Sci. 18, 186345 (2024)

2024
[35]

Navigating the risks: A survey of security, privacy, and ethics threats in llm-based agents.arXiv preprintarXiv:2411.09523, 2024

Y. Gan et al. , Navigating the risks: A survey of se- curity, privacy, and ethics threats in llm-based agents, arXiv:2411.09523 (2024)

work page arXiv 2024
[36]

OpenAI, A practical guide to building agents, OpenAI Documentation, https://platform.openai.com/docs/ guides/agents (2024), accessed: April 11, 2026. 12

2024
[37]

MemGPT: Towards LLMs as Operating Systems

C. Packer et al. , Memgpt: Towards llms as operating systems, arXiv:2310.08560 (2023)

work page internal anchor Pith review arXiv 2023
[38]

W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang, A-mem: Agentic memory for llm agents, arXiv:2502.12110 (2025)

work page internal anchor Pith review arXiv 2025
[39]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, Agentic retrieval-augmented generation: A survey on agentic rag, arXiv:2501.09136 (2025)

work page internal anchor Pith review arXiv 2025
[40]

Madaan, N

A. Madaan, N. Tandon, et al. , Self-refine: Iterative re- finement with self-feedback, in Adv. Neural Inf. Process. Syst., Vol. 36 (2023) pp. 46534–46594

2023
[41]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici et al., Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, arXiv:2507.06261 (2025)

work page internal anchor Pith review arXiv 2025
[42]

Fisicaro et al

G. Fisicaro et al. , Wet environment effects for ethanol and water adsorption on anatase tio2 (101) surfaces, J. Phys. Chem. C 124, 2406 (2020)

2020
[43]

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neu- big, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, ACM Comput. Surv. 55 (2023)

2023
[44]

Weng et al

Y. Weng et al. , Deepscientist: Advancing frontier- pushing scientific findings progressively, in Fourteenth Int. Conference. Learn. Representations (2026)

2026
[45]

Analemma Intelligence, F ARS: Fully Automated Research System, https://analemma.ai/blog/ introducing-fars/ (2026), accessed: 2026-04-09

2026
[46]

arXiv preprint arXiv:2603.08127 (2026)

Y. Lyu et al. , Evoscientist: Towards multi-agent evolv- ing ai scientists for end-to-end scientific discovery, arXiv:2603.08127 (2026)

work page arXiv 2026
[47]

arXiv preprint arXiv:2505.15155 (2025)

Y. Li et al. , R&d-agent-quant: a multi-agent framework for data-centric factors and model joint optimization, arXiv:2505.15155 (2025)

work page arXiv 2025
[48]

Schmidgall et al

S. Schmidgall et al. , Agent laboratory: Using llm agents as research assistants, Findings Assoc. Comput. Linguis- tics: EMNLP 2025 , 5977 (2025)

2025
[49]

Villaescusa-Navarro, B

F. Villaescusa-Navarro et al. , The denario project: Deep knowledge ai agents for scientific discovery, arXiv:2510.26887 (2025)

work page arXiv 2025
[50]

Kwa et al

T. Kwa et al. , Measuring AI ability to complete long tasks, METR Blog, https://metr.org/blog/ 2025-03-19-measuring-ai-ability-to-complete-long-tasks/ (2025), accessed: April 11, 2026

2025
[51]

R. Raj, H. Wang, and T. Krishna, A cpu-centric perspec- tive on agentic ai, arXiv:2511.00739 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025