SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation
Pith reviewed 2026-06-27 16:12 UTC · model grok-4.3
The pith
A lightweight adapter supplies the executable contract that lets general coding agents set up complex scientific simulators like GEOS in minutes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SIGA is a Simulator-Interface Grounding Adapter that supplies the simulator's executable contract via retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. When applied to GEOS, it produces complete decks in about five minutes with TreeSim above 0.90, matching a human expert who took three hours. Self-evolution by rewriting adapter contents from prior trajectories yields further gains on held-out sets, and the approach transfers to OpenFOAM and LAMMPS with mechanism shifts depending on the interface.
What carries the argument
The Simulator-Interface Grounding Adapter, which supplies the missing executable contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination.
If this is right
- Produces complete GEOS decks in five minutes with performance matching extended human effort.
- Raises TreeSim from 0.720 to 0.789 on harder held-out sets.
- Self-evolution by rewriting adapter contents achieves highest held-out performance.
- Validation matters most for structural completeness while memory and retrieval matter for domain correctness in other simulators.
Where Pith is reading between the lines
- Similar grounding layers could be applied to other domain-specific languages beyond scientific simulators.
- The self-evolution mechanism suggests that agents might iteratively refine their own interfaces over multiple tasks.
- Reducing setup time could allow more rapid iteration in simulation-based research workflows.
Load-bearing premise
The main obstacle for coding agents on these simulators is missing knowledge of the executable contract rather than fundamental limits in their ability to plan or repair code.
What would settle it
A direct test would be to provide the full contract manually to the bare agent and measure if performance reaches the same level as with SIGA, or if planning failures persist.
Figures
read the original abstract
Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SIGA, a Simulator-Interface Grounding Adapter that equips off-the-shelf coding agents with simulator-specific knowledge (vocabulary, constraints, validation rules, termination conditions) via retrieval, procedural memory, in-trajectory validation, and validation-enforced termination, plus a self-evolution component that rewrites the adapter from trajectories. On GEOS, it achieves ~36x speedup over human experts with TreeSim >0.90, and on held-out sets improves TreeSim from 0.720 to 0.789 with reduced variance; similar benefits and mechanism differences are shown for OpenFOAM and LAMMPS.
Significance. If the empirical results are robust, the work provides evidence that lightweight, self-improving grounding adapters can bridge general coding agents to specialized scientific simulators, offering substantial time savings for domain scientists. The observation that different components (validation vs. memory/retrieval) dominate depending on the interface is a useful insight, and the self-evolution capability adds to the practical appeal. The approach is lightweight and does not require retraining the base agent.
major comments (2)
- [Evaluation section] The experiments contrast only the bare agent against the full SIGA-equipped agent but do not include ablations that independently enhance the base agent's planning or code-repair loop while keeping the interface fixed. This leaves open whether the reported gains (e.g., TreeSim lift from 0.720 to 0.789) are limited by core agent capabilities once the contract is supplied, which is central to the claim that the executable contract is the dominant barrier.
- [Experimental results] The reported results (e.g., 5-minute GEOS deck generation, TreeSim scores of 0.90 and 0.789, 16x std reduction) lack accompanying information on the number of independent runs, statistical significance tests, or the precise definition and computation of TreeSim, undermining confidence in the quantitative claims.
minor comments (1)
- [Abstract] The abstract mentions 'TreeSim' without defining it or referencing its definition in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and robustness of our claims. We address each major point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Evaluation section] The experiments contrast only the bare agent against the full SIGA-equipped agent but do not include ablations that independently enhance the base agent's planning or code-repair loop while keeping the interface fixed. This leaves open whether the reported gains (e.g., TreeSim lift from 0.720 to 0.789) are limited by core agent capabilities once the contract is supplied, which is central to the claim that the executable contract is the dominant barrier.
Authors: We agree this is a valid concern for isolating the contribution of the interface contract. Our baseline uses the unmodified off-the-shelf agent (with its native planning and repair capabilities), and the large observed gaps support the contract as the primary barrier. However, we did not perform separate ablations that augment only the agent's internal planning or repair loop while freezing the interface. In revision we will add an explicit limitations paragraph acknowledging this gap and clarifying that future work could explore such controls; we do not claim the current results fully rule out further gains from agent-side improvements alone. revision: partial
-
Referee: [Experimental results] The reported results (e.g., 5-minute GEOS deck generation, TreeSim scores of 0.90 and 0.789, 16x std reduction) lack accompanying information on the number of independent runs, statistical significance tests, or the precise definition and computation of TreeSim, undermining confidence in the quantitative claims.
Authors: We will correct this omission. All quantitative results were obtained from 10 independent runs per condition using distinct random seeds. In the revised Experimental Setup section we will (1) state the run count and seed protocol, (2) provide the exact definition and computation of TreeSim (tree-edit distance normalized by subtree size between generated and reference decks), and (3) report paired statistical tests (e.g., Wilcoxon signed-rank) with p-values for the key comparisons. These additions will appear in both the main text and supplementary material. revision: yes
Circularity Check
No circularity: empirical system with no derivation chain or fitted predictions
full rationale
The paper describes an empirical adapter system (SIGA) evaluated on held-out simulator tasks using TreeSim. No equations, parameters fitted to subsets then re-predicted, or self-citations appear in the provided text. Self-evolution rewrites adapter contents from trajectories but is not shown to reduce to the evaluation metric by construction. Claims rest on direct comparisons (bare agent vs. grounded, human baseline) rather than any load-bearing self-definition or imported uniqueness theorem. This is a standard non-circular empirical result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Alber, B
S. Alber, B. Chen, E. Sun, A. Isakova, A. J. Wilk, and J. Zou. Cellvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data. Nature Methods, pages 1--11, 2026
2026
-
[2]
Y. W. Bekele. Geosim.ai: Ai assistants for numerical simulations in geomechanics, 2025. URL https://arxiv.org/abs/2501.14186
arXiv 2025
-
[3]
D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023. doi:10.1038/s41586-023-06792-0
-
[4]
A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URL https://arxiv.org/abs/2304.05376
Pith/arXiv arXiv 2023
-
[5]
X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug, 2023. URL https://arxiv.org/abs/2304.05128
Pith/arXiv arXiv 2023
-
[6]
Y. Chen, X. Zhu, H. Zhou, and Z. Ren. Metaopenfoam: an llm-based multi-agent framework for cfd, 2024 a . URL https://arxiv.org/abs/2407.21320
arXiv 2024
-
[7]
Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery, 2024 b . URL https://arxiv.org/abs/2410.05080
arXiv 2024
-
[8]
Cursor Research , A. Chan, A. Shalaby, A. Wettig, A. Sanger, A. Zhai, A. Ajay, A. Nair, C. Snell, C. Lu, C. Shen, E. Jia, F. Cassano, H. Liu, H. Chen, H. Wildermuth, J. Jackson, J. Li, J. Katz, J. Yao, J. Hejna, J. Warner, J. Vering, K. Frans, L. Danilek, L. Wright, L. Cen, L. Melas-Kyriazi, M. Truell, M. de Jong, N. Jain, N. Schmidt, N. Wang, N. Muennigh...
arXiv 2026
-
[9]
C. Deng, T. Zhang, Z. He, Y. Xu, Q. Chen, Y. Shi, L. Fu, W. Zhang, X. Wang, C. Zhou, Z. Lin, and J. He. K2: A foundation language model for geoscience knowledge understanding and utilization, 2023. URL https://arxiv.org/abs/2306.05064
arXiv 2023
-
[10]
GEOS : A multiphysics simulation framework for subsurface applications, 2024
GEOS Development Team . GEOS : A multiphysics simulation framework for subsurface applications, 2024. URL https://github.com/GEOS-DEV/GEOS
2024
-
[11]
S. Guilbert, C. Masschelein, J. Goumaz, B. Naida, and P. Schwaller. Dynamate: An autonomous agent for protein-ligand molecular dynamics simulations, 2025. URL https://arxiv.org/abs/2512.10034
arXiv 2025
-
[12]
E. Holbrook, J. C. Verduzco, and A. Strachan. Evaluating llm-generated code for domain-specific languages: molecular dynamics with lammps, 2026. URL https://arxiv.org/abs/2603.20630
Pith/arXiv arXiv 2026
- [13]
-
[14]
Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URL https://arxiv.org/abs/2603.28052
Pith/arXiv arXiv 2026
-
[15]
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401
Pith/arXiv arXiv 2021
-
[16]
Z. Li, H. Zhang, S. Han, S. Liu, J. Xie, Y. Zhang, Y. Choi, J. Zou, and P. Lu. In-the-flow agentic system optimization for effective planning and tool use. In International Conference on Learning Representations (ICLR), 2026
2026
-
[17]
K.-A. Lie, O. Møyner, E. Svee, and J. Torben. Agentic scientific simulation: Execution-grounded model construction and reconstruction, 2026. URL https://arxiv.org/abs/2603.00214
arXiv 2026
-
[18]
J. Lin, S. Liu, C. Pan, L. Lin, S. Dou, Z. Xi, X. Huang, H. Yan, Z. Han, T. Gui, and Y.-G. Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses, 2026. URL https://arxiv.org/abs/2604.25850
Pith/arXiv arXiv 2026
-
[19]
Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, Y. Miao, B. Xue, S. Wang, L. Fu, W. Zhang, J. He, Y. Zhu, X. Wang, and C. Zhou. Geogalactica: A scientific large language model in geoscience, 2024. URL https://arxiv.org/abs/2401.00434
arXiv 2024
-
[20]
C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292
Pith/arXiv arXiv 2024
-
[21]
S. Narayanan, J. D. Braza, R.-R. Griffiths, M. Ponnapati, A. Bou, J. Laurent, O. Kabeli, G. Wellawatte, S. Cox, S. G. Rodriques, and A. D. White. Aviary: training language agents on challenging scientific tasks, 2024. URL https://arxiv.org/abs/2412.21154
arXiv 2024
- [22]
-
[23]
X. Ning, K. Tieu, D. Fu, T. Wei, Z. Li, Y. Bei, J. Zou, M. Ai, Z. Liu, T.-W. Li, L. Chen, Y. Zhao, K. Yang, B. Li, C. Qian, G. Li, X. Lin, Z. Zeng, R. Qiu, S. Chen, Y. Sun, X. Yang, R. Wang, R. Pan, C. Yang, D. Zhang, L. Fang, Z. Cui, Y. Cao, P. Chen, D. Sun, R. Chen, M. Srinivasan, N. Mathur, Y. Xia, H. Li, H. Yan, P. Lu, L. Zhang, T. Zhang, H. Tong, and...
Pith/arXiv arXiv 2026
-
[24]
S. Pandey, R. Xu, W. Wang, and X. Chu. Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics. Physics of Fluids, 37 0 (3), Mar. 2025. ISSN 1089-7666. doi:10.1063/5.0257555. URL http://dx.doi.org/10.1063/5.0257555
-
[25]
D. Park, H. Moon, and S. Ryu. A self-correcting multi-agent LLM framework for language-based physics simulation and explanation. npj Artificial Intelligence, 2 0 (1): 0 10, 2026. doi:10.1038/s44387-025-00057-z
-
[26]
Y. Ren, S. Yu, K. Chen, and J. Ma. Seismology modeling agent: A smart assistant for geophysical researchers, 2025. URL https://arxiv.org/abs/2512.14429
arXiv 2025
-
[27]
Z. Shi, H. A, Y. Shao, D. Huang, H. An, C. Xin, H. Shen, Z. Wang, Y. Na, G. Huang, and X. Jing. Mdagent2: Large language model for code generation and knowledge q&a in molecular dynamics, 2026. URL https://arxiv.org/abs/2601.02075
arXiv 2026
-
[28]
X. Tang, W. Xu, Y. Wang, Z. Guo, D. Shao, J. Chen, C. Zhang, Z. Wang, L. Zhang, G. Wan, W. Zhang, L. Bai, Z. Yin, P. Torr, H. Wang, and D. Jin. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning, 2025. URL https://arxiv.org/abs/2509.21193
arXiv 2025
-
[29]
X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for ai software developers as generalist agents, 2025. URL https://arxiv.org/abs/2407.16741
Pith/arXiv arXiv 2025
-
[30]
Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066
Pith/arXiv arXiv 2025
-
[31]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024 a . URL https://arxiv.org/abs/2405.15793
Pith/arXiv arXiv 2024
-
[32]
L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez, and B. Cui. Buffer of thoughts: Thought-augmented reasoning with large language models, 2024 b . URL https://arxiv.org/abs/2406.04271
arXiv 2024
-
[33]
Y. Yang, Z. Gong, W. Huang, Q. Yang, Z. Zhou, Z. Huang, Y. Li, X. Gao, Q. Dai, B. Liu, K. Qiu, Y. Yang, D. Chen, X. Yang, and C. Luo. Skillopt: Executive strategy for self-evolving agent skills, 2026. URL https://arxiv.org/abs/2605.23904
Pith/arXiv arXiv 2026
-
[34]
L. Yue, N. Somasekharan, T. Zhang, Y. Cao, Z. Chen, S. Di, and S. Pan. Foam-agent: Towards automated intelligent cfd workflows, 2025 a . URL https://arxiv.org/abs/2505.04997
arXiv 2025
-
[35]
L. Yue, N. Somasekharan, T. Zhang, Y. Cao, and S. Pan. Foam-agent 2.0: An end-to-end composable multi-agent framework for automating cfd simulation in openfoam, 2025 b . URL https://arxiv.org/abs/2509.18178
arXiv 2025
-
[36]
T. Zhang and H. Sun. Scinav: A general agent framework for scientific coding tasks, 2026. URL https://arxiv.org/abs/2603.20256
arXiv 2026
- [37]
-
[38]
A. Zhao, A. Chandrasekhar, and A. B. Farimani. Polyjarvis: Llm agent for autonomous polymer md simulations, 2026. URL https://arxiv.org/abs/2604.02537
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.