SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

Audrey Wang; Brian Liu; Jixuan Chen; Lianhui Qin; Matthew Ho

arxiv: 2606.09774 · v1 · pith:XPJVDKOAnew · submitted 2026-06-08 · 💻 cs.AI · cs.CL

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

Matthew Ho , Brian Liu , Jixuan Chen , Audrey Wang , Lianhui Qin This is my paper

Pith reviewed 2026-06-27 16:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords coding agentsscientific simulatorsinterface groundingself-evolutionGEOS simulatoragent adaptationsimulation setup

0 comments

The pith

A lightweight adapter supplies the executable contract that lets general coding agents set up complex scientific simulators like GEOS in minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to make off-the-shelf coding agents work with specialized scientific simulators that require custom input languages. It argues that the main missing piece is knowledge of the simulator's vocabulary, constraints, and rules rather than general planning skills. SIGA provides this through targeted components and shows large speedups on real tasks. A sympathetic reader would care because domain scientists currently spend hours or days learning these interfaces, and automation could free them for higher-level work.

Core claim

SIGA is a Simulator-Interface Grounding Adapter that supplies the simulator's executable contract via retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. When applied to GEOS, it produces complete decks in about five minutes with TreeSim above 0.90, matching a human expert who took three hours. Self-evolution by rewriting adapter contents from prior trajectories yields further gains on held-out sets, and the approach transfers to OpenFOAM and LAMMPS with mechanism shifts depending on the interface.

What carries the argument

The Simulator-Interface Grounding Adapter, which supplies the missing executable contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination.

If this is right

Produces complete GEOS decks in five minutes with performance matching extended human effort.
Raises TreeSim from 0.720 to 0.789 on harder held-out sets.
Self-evolution by rewriting adapter contents achieves highest held-out performance.
Validation matters most for structural completeness while memory and retrieval matter for domain correctness in other simulators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar grounding layers could be applied to other domain-specific languages beyond scientific simulators.
The self-evolution mechanism suggests that agents might iteratively refine their own interfaces over multiple tasks.
Reducing setup time could allow more rapid iteration in simulation-based research workflows.

Load-bearing premise

The main obstacle for coding agents on these simulators is missing knowledge of the executable contract rather than fundamental limits in their ability to plan or repair code.

What would settle it

A direct test would be to provide the full contract manually to the bare agent and measure if performance reaches the same level as with SIGA, or if planning failures persist.

Figures

Figures reproduced from arXiv: 2606.09774 by Audrey Wang, Brian Liu, Jixuan Chen, Lianhui Qin, Matthew Ho.

**Figure 2.** Figure 2: The SIGA method. A natural-language simulation brief feeds into the base coding agent (a frozen harness H0 wrapping a frozen model π), which runs its generic context→act→observe loop to author a configuration deck. The SIGA adapter grounds this loop at three interfaces, without modifying the loop itself: always-on procedural memory (M) injected into the system context; retrieval (R) and an XML validator (X… view at source ↗

read the original abstract

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIGA shows a practical adapter with retrieval, memory, validation and self-evolution can cut GEOS setup from hours to minutes and lift TreeSim scores, but the design leaves open whether the interface contract is truly the main limit.

read the letter

The paper's core result is that a thin grounding layer lets an off-the-shelf coding agent produce usable GEOS input decks in roughly five minutes instead of three hours, with TreeSim above 0.90 on the main set and a lift from 0.72 to 0.79 on held-out cases. The same adapter transfers to OpenFOAM and LAMMPS, and self-evolution by rewriting from past runs beats the best hand-tuned version.

What is new is the explicit framing of simulator setup as an interface-grounding problem and the particular mix of retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. The mechanism-shift finding across the three simulators is useful: validation helps most when structure is the bottleneck, while memory and retrieval matter more for domain correctness.

The numbers are concrete and the speed-up claim is large enough to notice in repeated use. The paper also avoids the usual agent-paper trap of only reporting success on toy tasks.

The main soft spot is the missing ablation. The experiments compare a bare agent to the full adapter but do not test whether extra planning or repair capacity in the base agent would close most of the gap once the contract is supplied. Without that, it is still possible that the adapter is mostly papering over deeper agent weaknesses. The abstract also gives no run counts, statistical tests, or exact definition of TreeSim, so the 16x variance reduction and the 36x wall-clock claim are hard to judge for robustness. Self-evolution could introduce dependence on the same success metric used for evaluation.

This work is aimed at people building agents that must talk to real scientific code rather than toy environments. Readers who care about tool-use grounding will find the transfer results and component analysis worth their time. It is solid enough on its own terms to deserve referee attention, even if the central assumption about the dominant barrier needs tighter testing.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SIGA, a Simulator-Interface Grounding Adapter that equips off-the-shelf coding agents with simulator-specific knowledge (vocabulary, constraints, validation rules, termination conditions) via retrieval, procedural memory, in-trajectory validation, and validation-enforced termination, plus a self-evolution component that rewrites the adapter from trajectories. On GEOS, it achieves ~36x speedup over human experts with TreeSim >0.90, and on held-out sets improves TreeSim from 0.720 to 0.789 with reduced variance; similar benefits and mechanism differences are shown for OpenFOAM and LAMMPS.

Significance. If the empirical results are robust, the work provides evidence that lightweight, self-improving grounding adapters can bridge general coding agents to specialized scientific simulators, offering substantial time savings for domain scientists. The observation that different components (validation vs. memory/retrieval) dominate depending on the interface is a useful insight, and the self-evolution capability adds to the practical appeal. The approach is lightweight and does not require retraining the base agent.

major comments (2)

[Evaluation section] The experiments contrast only the bare agent against the full SIGA-equipped agent but do not include ablations that independently enhance the base agent's planning or code-repair loop while keeping the interface fixed. This leaves open whether the reported gains (e.g., TreeSim lift from 0.720 to 0.789) are limited by core agent capabilities once the contract is supplied, which is central to the claim that the executable contract is the dominant barrier.
[Experimental results] The reported results (e.g., 5-minute GEOS deck generation, TreeSim scores of 0.90 and 0.789, 16x std reduction) lack accompanying information on the number of independent runs, statistical significance tests, or the precise definition and computation of TreeSim, undermining confidence in the quantitative claims.

minor comments (1)

[Abstract] The abstract mentions 'TreeSim' without defining it or referencing its definition in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and robustness of our claims. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Evaluation section] The experiments contrast only the bare agent against the full SIGA-equipped agent but do not include ablations that independently enhance the base agent's planning or code-repair loop while keeping the interface fixed. This leaves open whether the reported gains (e.g., TreeSim lift from 0.720 to 0.789) are limited by core agent capabilities once the contract is supplied, which is central to the claim that the executable contract is the dominant barrier.

Authors: We agree this is a valid concern for isolating the contribution of the interface contract. Our baseline uses the unmodified off-the-shelf agent (with its native planning and repair capabilities), and the large observed gaps support the contract as the primary barrier. However, we did not perform separate ablations that augment only the agent's internal planning or repair loop while freezing the interface. In revision we will add an explicit limitations paragraph acknowledging this gap and clarifying that future work could explore such controls; we do not claim the current results fully rule out further gains from agent-side improvements alone. revision: partial
Referee: [Experimental results] The reported results (e.g., 5-minute GEOS deck generation, TreeSim scores of 0.90 and 0.789, 16x std reduction) lack accompanying information on the number of independent runs, statistical significance tests, or the precise definition and computation of TreeSim, undermining confidence in the quantitative claims.

Authors: We will correct this omission. All quantitative results were obtained from 10 independent runs per condition using distinct random seeds. In the revised Experimental Setup section we will (1) state the run count and seed protocol, (2) provide the exact definition and computation of TreeSim (tree-edit distance normalized by subtree size between generated and reference decks), and (3) report paired statistical tests (e.g., Wilcoxon signed-rank) with p-values for the key comparisons. These additions will appear in both the main text and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system with no derivation chain or fitted predictions

full rationale

The paper describes an empirical adapter system (SIGA) evaluated on held-out simulator tasks using TreeSim. No equations, parameters fitted to subsets then re-predicted, or self-citations appear in the provided text. Self-evolution rewrites adapter contents from trajectories but is not shown to reduce to the evaluation metric by construction. Claims rest on direct comparisons (bare agent vs. grounded, human baseline) rather than any load-bearing self-definition or imported uniqueness theorem. This is a standard non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated assumption that simulator interfaces are the primary missing piece for coding agents and that the four listed mechanisms are sufficient to supply it.

pith-pipeline@v0.9.1-grok · 5856 in / 1325 out tokens · 16489 ms · 2026-06-27T16:12:23.027092+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 3 canonical work pages

[1]

Alber, B

S. Alber, B. Chen, E. Sun, A. Isakova, A. J. Wilk, and J. Zou. Cellvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data. Nature Methods, pages 1--11, 2026

2026
[2]

Y. W. Bekele. Geosim.ai: Ai assistants for numerical simulations in geomechanics, 2025. URL https://arxiv.org/abs/2501.14186

arXiv 2025
[3]

D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023. doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023
[4]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URL https://arxiv.org/abs/2304.05376

Pith/arXiv arXiv 2023
[5]

X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug, 2023. URL https://arxiv.org/abs/2304.05128

Pith/arXiv arXiv 2023
[6]

Y. Chen, X. Zhu, H. Zhou, and Z. Ren. Metaopenfoam: an llm-based multi-agent framework for cfd, 2024 a . URL https://arxiv.org/abs/2407.21320

arXiv 2024
[7]

Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery, 2024 b . URL https://arxiv.org/abs/2410.05080

arXiv 2024
[8]

Cursor Research , A. Chan, A. Shalaby, A. Wettig, A. Sanger, A. Zhai, A. Ajay, A. Nair, C. Snell, C. Lu, C. Shen, E. Jia, F. Cassano, H. Liu, H. Chen, H. Wildermuth, J. Jackson, J. Li, J. Katz, J. Yao, J. Hejna, J. Warner, J. Vering, K. Frans, L. Danilek, L. Wright, L. Cen, L. Melas-Kyriazi, M. Truell, M. de Jong, N. Jain, N. Schmidt, N. Wang, N. Muennigh...

arXiv 2026
[9]

C. Deng, T. Zhang, Z. He, Y. Xu, Q. Chen, Y. Shi, L. Fu, W. Zhang, X. Wang, C. Zhou, Z. Lin, and J. He. K2: A foundation language model for geoscience knowledge understanding and utilization, 2023. URL https://arxiv.org/abs/2306.05064

arXiv 2023
[10]

GEOS : A multiphysics simulation framework for subsurface applications, 2024

GEOS Development Team . GEOS : A multiphysics simulation framework for subsurface applications, 2024. URL https://github.com/GEOS-DEV/GEOS

2024
[11]

Guilbert, C

S. Guilbert, C. Masschelein, J. Goumaz, B. Naida, and P. Schwaller. Dynamate: An autonomous agent for protein-ligand molecular dynamics simulations, 2025. URL https://arxiv.org/abs/2512.10034

arXiv 2025
[12]

Holbrook, J

E. Holbrook, J. C. Verduzco, and A. Strachan. Evaluating llm-generated code for domain-specific languages: molecular dynamics with lammps, 2026. URL https://arxiv.org/abs/2603.20630

Pith/arXiv arXiv 2026
[13]

Huang, J

Y. Huang, J. Luo, Y. Yu, Y. Zhang, F. Lei, Y. Wei, S. He, L. Huang, X. Liu, J. Zhao, and K. Liu. Da-code: Agent data science code generation benchmark for large language models, 2024. URL https://arxiv.org/abs/2410.07331

arXiv 2024
[14]

Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URL https://arxiv.org/abs/2603.28052

Pith/arXiv arXiv 2026
[15]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401

Pith/arXiv arXiv 2021
[16]

Z. Li, H. Zhang, S. Han, S. Liu, J. Xie, Y. Zhang, Y. Choi, J. Zou, and P. Lu. In-the-flow agentic system optimization for effective planning and tool use. In International Conference on Learning Representations (ICLR), 2026

2026
[17]

K.-A. Lie, O. Møyner, E. Svee, and J. Torben. Agentic scientific simulation: Execution-grounded model construction and reconstruction, 2026. URL https://arxiv.org/abs/2603.00214

arXiv 2026
[18]

J. Lin, S. Liu, C. Pan, L. Lin, S. Dou, Z. Xi, X. Huang, H. Yan, Z. Han, T. Gui, and Y.-G. Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses, 2026. URL https://arxiv.org/abs/2604.25850

Pith/arXiv arXiv 2026
[19]

Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, Y. Miao, B. Xue, S. Wang, L. Fu, W. Zhang, J. He, Y. Zhu, X. Wang, and C. Zhou. Geogalactica: A scientific large language model in geoscience, 2024. URL https://arxiv.org/abs/2401.00434

arXiv 2024
[20]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292

Pith/arXiv arXiv 2024
[21]

Narayanan, J

S. Narayanan, J. D. Braza, R.-R. Griffiths, M. Ponnapati, A. Bou, J. Laurent, O. Kabeli, G. Wellawatte, S. Cox, S. G. Rodriques, and A. D. White. Aviary: training language agents on challenging scientific tasks, 2024. URL https://arxiv.org/abs/2412.21154

arXiv 2024
[22]

Ni and M

B. Ni and M. J. Buehler. Mechagents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge, 2023. URL https://arxiv.org/abs/2311.08166

arXiv 2023
[23]

X. Ning, K. Tieu, D. Fu, T. Wei, Z. Li, Y. Bei, J. Zou, M. Ai, Z. Liu, T.-W. Li, L. Chen, Y. Zhao, K. Yang, B. Li, C. Qian, G. Li, X. Lin, Z. Zeng, R. Qiu, S. Chen, Y. Sun, X. Yang, R. Wang, R. Pan, C. Yang, D. Zhang, L. Fang, Z. Cui, Y. Cao, P. Chen, D. Sun, R. Chen, M. Srinivasan, N. Mathur, Y. Xia, H. Li, H. Yan, P. Lu, L. Zhang, T. Zhang, H. Tong, and...

Pith/arXiv arXiv 2026
[24]

Pandey, R

S. Pandey, R. Xu, W. Wang, and X. Chu. Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics. Physics of Fluids, 37 0 (3), Mar. 2025. ISSN 1089-7666. doi:10.1063/5.0257555. URL http://dx.doi.org/10.1063/5.0257555

work page doi:10.1063/5.0257555 2025
[25]

D. Park, H. Moon, and S. Ryu. A self-correcting multi-agent LLM framework for language-based physics simulation and explanation. npj Artificial Intelligence, 2 0 (1): 0 10, 2026. doi:10.1038/s44387-025-00057-z

work page doi:10.1038/s44387-025-00057-z 2026
[26]

Y. Ren, S. Yu, K. Chen, and J. Ma. Seismology modeling agent: A smart assistant for geophysical researchers, 2025. URL https://arxiv.org/abs/2512.14429

arXiv 2025
[27]

Z. Shi, H. A, Y. Shao, D. Huang, H. An, C. Xin, H. Shen, Z. Wang, Y. Na, G. Huang, and X. Jing. Mdagent2: Large language model for code generation and knowledge q&a in molecular dynamics, 2026. URL https://arxiv.org/abs/2601.02075

arXiv 2026
[28]

X. Tang, W. Xu, Y. Wang, Z. Guo, D. Shao, J. Chen, C. Zhang, Z. Wang, L. Zhang, G. Wan, W. Zhang, L. Bai, Z. Yin, P. Torr, H. Wang, and D. Jin. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning, 2025. URL https://arxiv.org/abs/2509.21193

arXiv 2025
[29]

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for ai software developers as generalist agents, 2025. URL https://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2025
[30]

Yamada, R

Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066

Pith/arXiv arXiv 2025
[31]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024 a . URL https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024
[32]

L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez, and B. Cui. Buffer of thoughts: Thought-augmented reasoning with large language models, 2024 b . URL https://arxiv.org/abs/2406.04271

arXiv 2024
[33]

Y. Yang, Z. Gong, W. Huang, Q. Yang, Z. Zhou, Z. Huang, Y. Li, X. Gao, Q. Dai, B. Liu, K. Qiu, Y. Yang, D. Chen, X. Yang, and C. Luo. Skillopt: Executive strategy for self-evolving agent skills, 2026. URL https://arxiv.org/abs/2605.23904

Pith/arXiv arXiv 2026
[34]

L. Yue, N. Somasekharan, T. Zhang, Y. Cao, Z. Chen, S. Di, and S. Pan. Foam-agent: Towards automated intelligent cfd workflows, 2025 a . URL https://arxiv.org/abs/2505.04997

arXiv 2025
[35]

L. Yue, N. Somasekharan, T. Zhang, Y. Cao, and S. Pan. Foam-agent 2.0: An end-to-end composable multi-agent framework for automating cfd simulation in openfoam, 2025 b . URL https://arxiv.org/abs/2509.18178

arXiv 2025
[36]

Zhang and H

T. Zhang and H. Sun. Scinav: A general agent framework for scientific coding tasks, 2026. URL https://arxiv.org/abs/2603.20256

arXiv 2026
[37]

Zhang, Z

T. Zhang, Z. Liu, Y. Xin, and Y. Jiao. Mooseagent: A llm based multi-agent framework for automating moose simulation, 2025. URL https://arxiv.org/abs/2504.08621

arXiv 2025
[38]

A. Zhao, A. Chandrasekhar, and A. B. Farimani. Polyjarvis: Llm agent for autonomous polymer md simulations, 2026. URL https://arxiv.org/abs/2604.02537

Pith/arXiv arXiv 2026

[1] [1]

Alber, B

S. Alber, B. Chen, E. Sun, A. Isakova, A. J. Wilk, and J. Zou. Cellvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data. Nature Methods, pages 1--11, 2026

2026

[2] [2]

Y. W. Bekele. Geosim.ai: Ai assistants for numerical simulations in geomechanics, 2025. URL https://arxiv.org/abs/2501.14186

arXiv 2025

[3] [3]

D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023. doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023

[4] [4]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URL https://arxiv.org/abs/2304.05376

Pith/arXiv arXiv 2023

[5] [5]

X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug, 2023. URL https://arxiv.org/abs/2304.05128

Pith/arXiv arXiv 2023

[6] [6]

Y. Chen, X. Zhu, H. Zhou, and Z. Ren. Metaopenfoam: an llm-based multi-agent framework for cfd, 2024 a . URL https://arxiv.org/abs/2407.21320

arXiv 2024

[7] [7]

Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery, 2024 b . URL https://arxiv.org/abs/2410.05080

arXiv 2024

[8] [8]

Cursor Research , A. Chan, A. Shalaby, A. Wettig, A. Sanger, A. Zhai, A. Ajay, A. Nair, C. Snell, C. Lu, C. Shen, E. Jia, F. Cassano, H. Liu, H. Chen, H. Wildermuth, J. Jackson, J. Li, J. Katz, J. Yao, J. Hejna, J. Warner, J. Vering, K. Frans, L. Danilek, L. Wright, L. Cen, L. Melas-Kyriazi, M. Truell, M. de Jong, N. Jain, N. Schmidt, N. Wang, N. Muennigh...

arXiv 2026

[9] [9]

C. Deng, T. Zhang, Z. He, Y. Xu, Q. Chen, Y. Shi, L. Fu, W. Zhang, X. Wang, C. Zhou, Z. Lin, and J. He. K2: A foundation language model for geoscience knowledge understanding and utilization, 2023. URL https://arxiv.org/abs/2306.05064

arXiv 2023

[10] [10]

GEOS : A multiphysics simulation framework for subsurface applications, 2024

GEOS Development Team . GEOS : A multiphysics simulation framework for subsurface applications, 2024. URL https://github.com/GEOS-DEV/GEOS

2024

[11] [11]

Guilbert, C

S. Guilbert, C. Masschelein, J. Goumaz, B. Naida, and P. Schwaller. Dynamate: An autonomous agent for protein-ligand molecular dynamics simulations, 2025. URL https://arxiv.org/abs/2512.10034

arXiv 2025

[12] [12]

Holbrook, J

E. Holbrook, J. C. Verduzco, and A. Strachan. Evaluating llm-generated code for domain-specific languages: molecular dynamics with lammps, 2026. URL https://arxiv.org/abs/2603.20630

Pith/arXiv arXiv 2026

[13] [13]

Huang, J

Y. Huang, J. Luo, Y. Yu, Y. Zhang, F. Lei, Y. Wei, S. He, L. Huang, X. Liu, J. Zhao, and K. Liu. Da-code: Agent data science code generation benchmark for large language models, 2024. URL https://arxiv.org/abs/2410.07331

arXiv 2024

[14] [14]

Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URL https://arxiv.org/abs/2603.28052

Pith/arXiv arXiv 2026

[15] [15]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401

Pith/arXiv arXiv 2021

[16] [16]

Z. Li, H. Zhang, S. Han, S. Liu, J. Xie, Y. Zhang, Y. Choi, J. Zou, and P. Lu. In-the-flow agentic system optimization for effective planning and tool use. In International Conference on Learning Representations (ICLR), 2026

2026

[17] [17]

K.-A. Lie, O. Møyner, E. Svee, and J. Torben. Agentic scientific simulation: Execution-grounded model construction and reconstruction, 2026. URL https://arxiv.org/abs/2603.00214

arXiv 2026

[18] [18]

J. Lin, S. Liu, C. Pan, L. Lin, S. Dou, Z. Xi, X. Huang, H. Yan, Z. Han, T. Gui, and Y.-G. Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses, 2026. URL https://arxiv.org/abs/2604.25850

Pith/arXiv arXiv 2026

[19] [19]

Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, Y. Miao, B. Xue, S. Wang, L. Fu, W. Zhang, J. He, Y. Zhu, X. Wang, and C. Zhou. Geogalactica: A scientific large language model in geoscience, 2024. URL https://arxiv.org/abs/2401.00434

arXiv 2024

[20] [20]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292

Pith/arXiv arXiv 2024

[21] [21]

Narayanan, J

S. Narayanan, J. D. Braza, R.-R. Griffiths, M. Ponnapati, A. Bou, J. Laurent, O. Kabeli, G. Wellawatte, S. Cox, S. G. Rodriques, and A. D. White. Aviary: training language agents on challenging scientific tasks, 2024. URL https://arxiv.org/abs/2412.21154

arXiv 2024

[22] [22]

Ni and M

B. Ni and M. J. Buehler. Mechagents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge, 2023. URL https://arxiv.org/abs/2311.08166

arXiv 2023

[23] [23]

X. Ning, K. Tieu, D. Fu, T. Wei, Z. Li, Y. Bei, J. Zou, M. Ai, Z. Liu, T.-W. Li, L. Chen, Y. Zhao, K. Yang, B. Li, C. Qian, G. Li, X. Lin, Z. Zeng, R. Qiu, S. Chen, Y. Sun, X. Yang, R. Wang, R. Pan, C. Yang, D. Zhang, L. Fang, Z. Cui, Y. Cao, P. Chen, D. Sun, R. Chen, M. Srinivasan, N. Mathur, Y. Xia, H. Li, H. Yan, P. Lu, L. Zhang, T. Zhang, H. Tong, and...

Pith/arXiv arXiv 2026

[24] [24]

Pandey, R

S. Pandey, R. Xu, W. Wang, and X. Chu. Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics. Physics of Fluids, 37 0 (3), Mar. 2025. ISSN 1089-7666. doi:10.1063/5.0257555. URL http://dx.doi.org/10.1063/5.0257555

work page doi:10.1063/5.0257555 2025

[25] [25]

D. Park, H. Moon, and S. Ryu. A self-correcting multi-agent LLM framework for language-based physics simulation and explanation. npj Artificial Intelligence, 2 0 (1): 0 10, 2026. doi:10.1038/s44387-025-00057-z

work page doi:10.1038/s44387-025-00057-z 2026

[26] [26]

Y. Ren, S. Yu, K. Chen, and J. Ma. Seismology modeling agent: A smart assistant for geophysical researchers, 2025. URL https://arxiv.org/abs/2512.14429

arXiv 2025

[27] [27]

Z. Shi, H. A, Y. Shao, D. Huang, H. An, C. Xin, H. Shen, Z. Wang, Y. Na, G. Huang, and X. Jing. Mdagent2: Large language model for code generation and knowledge q&a in molecular dynamics, 2026. URL https://arxiv.org/abs/2601.02075

arXiv 2026

[28] [28]

X. Tang, W. Xu, Y. Wang, Z. Guo, D. Shao, J. Chen, C. Zhang, Z. Wang, L. Zhang, G. Wan, W. Zhang, L. Bai, Z. Yin, P. Torr, H. Wang, and D. Jin. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning, 2025. URL https://arxiv.org/abs/2509.21193

arXiv 2025

[29] [29]

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for ai software developers as generalist agents, 2025. URL https://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2025

[30] [30]

Yamada, R

Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066

Pith/arXiv arXiv 2025

[31] [31]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024 a . URL https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024

[32] [32]

L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez, and B. Cui. Buffer of thoughts: Thought-augmented reasoning with large language models, 2024 b . URL https://arxiv.org/abs/2406.04271

arXiv 2024

[33] [33]

Y. Yang, Z. Gong, W. Huang, Q. Yang, Z. Zhou, Z. Huang, Y. Li, X. Gao, Q. Dai, B. Liu, K. Qiu, Y. Yang, D. Chen, X. Yang, and C. Luo. Skillopt: Executive strategy for self-evolving agent skills, 2026. URL https://arxiv.org/abs/2605.23904

Pith/arXiv arXiv 2026

[34] [34]

L. Yue, N. Somasekharan, T. Zhang, Y. Cao, Z. Chen, S. Di, and S. Pan. Foam-agent: Towards automated intelligent cfd workflows, 2025 a . URL https://arxiv.org/abs/2505.04997

arXiv 2025

[35] [35]

L. Yue, N. Somasekharan, T. Zhang, Y. Cao, and S. Pan. Foam-agent 2.0: An end-to-end composable multi-agent framework for automating cfd simulation in openfoam, 2025 b . URL https://arxiv.org/abs/2509.18178

arXiv 2025

[36] [36]

Zhang and H

T. Zhang and H. Sun. Scinav: A general agent framework for scientific coding tasks, 2026. URL https://arxiv.org/abs/2603.20256

arXiv 2026

[37] [37]

Zhang, Z

T. Zhang, Z. Liu, Y. Xin, and Y. Jiao. Mooseagent: A llm based multi-agent framework for automating moose simulation, 2025. URL https://arxiv.org/abs/2504.08621

arXiv 2025

[38] [38]

A. Zhao, A. Chandrasekhar, and A. B. Farimani. Polyjarvis: Llm agent for autonomous polymer md simulations, 2026. URL https://arxiv.org/abs/2604.02537

Pith/arXiv arXiv 2026