pith. sign in

arxiv: 2606.09774 · v1 · pith:XPJVDKOAnew · submitted 2026-06-08 · 💻 cs.AI · cs.CL

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

Pith reviewed 2026-06-27 16:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords coding agentsscientific simulatorsinterface groundingself-evolutionGEOS simulatoragent adaptationsimulation setup
0
0 comments X

The pith

A lightweight adapter supplies the executable contract that lets general coding agents set up complex scientific simulators like GEOS in minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to make off-the-shelf coding agents work with specialized scientific simulators that require custom input languages. It argues that the main missing piece is knowledge of the simulator's vocabulary, constraints, and rules rather than general planning skills. SIGA provides this through targeted components and shows large speedups on real tasks. A sympathetic reader would care because domain scientists currently spend hours or days learning these interfaces, and automation could free them for higher-level work.

Core claim

SIGA is a Simulator-Interface Grounding Adapter that supplies the simulator's executable contract via retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. When applied to GEOS, it produces complete decks in about five minutes with TreeSim above 0.90, matching a human expert who took three hours. Self-evolution by rewriting adapter contents from prior trajectories yields further gains on held-out sets, and the approach transfers to OpenFOAM and LAMMPS with mechanism shifts depending on the interface.

What carries the argument

The Simulator-Interface Grounding Adapter, which supplies the missing executable contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination.

If this is right

  • Produces complete GEOS decks in five minutes with performance matching extended human effort.
  • Raises TreeSim from 0.720 to 0.789 on harder held-out sets.
  • Self-evolution by rewriting adapter contents achieves highest held-out performance.
  • Validation matters most for structural completeness while memory and retrieval matter for domain correctness in other simulators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar grounding layers could be applied to other domain-specific languages beyond scientific simulators.
  • The self-evolution mechanism suggests that agents might iteratively refine their own interfaces over multiple tasks.
  • Reducing setup time could allow more rapid iteration in simulation-based research workflows.

Load-bearing premise

The main obstacle for coding agents on these simulators is missing knowledge of the executable contract rather than fundamental limits in their ability to plan or repair code.

What would settle it

A direct test would be to provide the full contract manually to the bare agent and measure if performance reaches the same level as with SIGA, or if planning failures persist.

Figures

Figures reproduced from arXiv: 2606.09774 by Audrey Wang, Brian Liu, Jixuan Chen, Lianhui Qin, Matthew Ho.

Figure 1
Figure 1. Figure 1: Illustrative example of advanced tooling usage bottleneck for the geophysics domain. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The SIGA method. A natural-language simulation brief feeds into the base coding agent (a frozen harness H0 wrapping a frozen model π), which runs its generic context→act→observe loop to author a configuration deck. The SIGA adapter grounds this loop at three interfaces, without modifying the loop itself: always-on procedural memory (M) injected into the system context; retrieval (R) and an XML validator (X… view at source ↗
read the original abstract

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SIGA, a Simulator-Interface Grounding Adapter that equips off-the-shelf coding agents with simulator-specific knowledge (vocabulary, constraints, validation rules, termination conditions) via retrieval, procedural memory, in-trajectory validation, and validation-enforced termination, plus a self-evolution component that rewrites the adapter from trajectories. On GEOS, it achieves ~36x speedup over human experts with TreeSim >0.90, and on held-out sets improves TreeSim from 0.720 to 0.789 with reduced variance; similar benefits and mechanism differences are shown for OpenFOAM and LAMMPS.

Significance. If the empirical results are robust, the work provides evidence that lightweight, self-improving grounding adapters can bridge general coding agents to specialized scientific simulators, offering substantial time savings for domain scientists. The observation that different components (validation vs. memory/retrieval) dominate depending on the interface is a useful insight, and the self-evolution capability adds to the practical appeal. The approach is lightweight and does not require retraining the base agent.

major comments (2)
  1. [Evaluation section] The experiments contrast only the bare agent against the full SIGA-equipped agent but do not include ablations that independently enhance the base agent's planning or code-repair loop while keeping the interface fixed. This leaves open whether the reported gains (e.g., TreeSim lift from 0.720 to 0.789) are limited by core agent capabilities once the contract is supplied, which is central to the claim that the executable contract is the dominant barrier.
  2. [Experimental results] The reported results (e.g., 5-minute GEOS deck generation, TreeSim scores of 0.90 and 0.789, 16x std reduction) lack accompanying information on the number of independent runs, statistical significance tests, or the precise definition and computation of TreeSim, undermining confidence in the quantitative claims.
minor comments (1)
  1. [Abstract] The abstract mentions 'TreeSim' without defining it or referencing its definition in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and robustness of our claims. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Evaluation section] The experiments contrast only the bare agent against the full SIGA-equipped agent but do not include ablations that independently enhance the base agent's planning or code-repair loop while keeping the interface fixed. This leaves open whether the reported gains (e.g., TreeSim lift from 0.720 to 0.789) are limited by core agent capabilities once the contract is supplied, which is central to the claim that the executable contract is the dominant barrier.

    Authors: We agree this is a valid concern for isolating the contribution of the interface contract. Our baseline uses the unmodified off-the-shelf agent (with its native planning and repair capabilities), and the large observed gaps support the contract as the primary barrier. However, we did not perform separate ablations that augment only the agent's internal planning or repair loop while freezing the interface. In revision we will add an explicit limitations paragraph acknowledging this gap and clarifying that future work could explore such controls; we do not claim the current results fully rule out further gains from agent-side improvements alone. revision: partial

  2. Referee: [Experimental results] The reported results (e.g., 5-minute GEOS deck generation, TreeSim scores of 0.90 and 0.789, 16x std reduction) lack accompanying information on the number of independent runs, statistical significance tests, or the precise definition and computation of TreeSim, undermining confidence in the quantitative claims.

    Authors: We will correct this omission. All quantitative results were obtained from 10 independent runs per condition using distinct random seeds. In the revised Experimental Setup section we will (1) state the run count and seed protocol, (2) provide the exact definition and computation of TreeSim (tree-edit distance normalized by subtree size between generated and reference decks), and (3) report paired statistical tests (e.g., Wilcoxon signed-rank) with p-values for the key comparisons. These additions will appear in both the main text and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system with no derivation chain or fitted predictions

full rationale

The paper describes an empirical adapter system (SIGA) evaluated on held-out simulator tasks using TreeSim. No equations, parameters fitted to subsets then re-predicted, or self-citations appear in the provided text. Self-evolution rewrites adapter contents from trajectories but is not shown to reduce to the evaluation metric by construction. Claims rest on direct comparisons (bare agent vs. grounded, human baseline) rather than any load-bearing self-definition or imported uniqueness theorem. This is a standard non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated assumption that simulator interfaces are the primary missing piece for coding agents and that the four listed mechanisms are sufficient to supply it.

pith-pipeline@v0.9.1-grok · 5856 in / 1325 out tokens · 16489 ms · 2026-06-27T16:12:23.027092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 3 canonical work pages

  1. [1]

    Alber, B

    S. Alber, B. Chen, E. Sun, A. Isakova, A. J. Wilk, and J. Zou. Cellvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data. Nature Methods, pages 1--11, 2026

  2. [2]

    Y. W. Bekele. Geosim.ai: Ai assistants for numerical simulations in geomechanics, 2025. URL https://arxiv.org/abs/2501.14186

  3. [3]

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023. doi:10.1038/s41586-023-06792-0

  4. [4]

    A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URL https://arxiv.org/abs/2304.05376

  5. [5]

    X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug, 2023. URL https://arxiv.org/abs/2304.05128

  6. [6]

    Y. Chen, X. Zhu, H. Zhou, and Z. Ren. Metaopenfoam: an llm-based multi-agent framework for cfd, 2024 a . URL https://arxiv.org/abs/2407.21320

  7. [7]

    Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery, 2024 b . URL https://arxiv.org/abs/2410.05080

  8. [8]

    Cursor Research , A. Chan, A. Shalaby, A. Wettig, A. Sanger, A. Zhai, A. Ajay, A. Nair, C. Snell, C. Lu, C. Shen, E. Jia, F. Cassano, H. Liu, H. Chen, H. Wildermuth, J. Jackson, J. Li, J. Katz, J. Yao, J. Hejna, J. Warner, J. Vering, K. Frans, L. Danilek, L. Wright, L. Cen, L. Melas-Kyriazi, M. Truell, M. de Jong, N. Jain, N. Schmidt, N. Wang, N. Muennigh...

  9. [9]

    C. Deng, T. Zhang, Z. He, Y. Xu, Q. Chen, Y. Shi, L. Fu, W. Zhang, X. Wang, C. Zhou, Z. Lin, and J. He. K2: A foundation language model for geoscience knowledge understanding and utilization, 2023. URL https://arxiv.org/abs/2306.05064

  10. [10]

    GEOS : A multiphysics simulation framework for subsurface applications, 2024

    GEOS Development Team . GEOS : A multiphysics simulation framework for subsurface applications, 2024. URL https://github.com/GEOS-DEV/GEOS

  11. [11]

    Guilbert, C

    S. Guilbert, C. Masschelein, J. Goumaz, B. Naida, and P. Schwaller. Dynamate: An autonomous agent for protein-ligand molecular dynamics simulations, 2025. URL https://arxiv.org/abs/2512.10034

  12. [12]

    Holbrook, J

    E. Holbrook, J. C. Verduzco, and A. Strachan. Evaluating llm-generated code for domain-specific languages: molecular dynamics with lammps, 2026. URL https://arxiv.org/abs/2603.20630

  13. [13]

    Huang, J

    Y. Huang, J. Luo, Y. Yu, Y. Zhang, F. Lei, Y. Wei, S. He, L. Huang, X. Liu, J. Zhao, and K. Liu. Da-code: Agent data science code generation benchmark for large language models, 2024. URL https://arxiv.org/abs/2410.07331

  14. [14]

    Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URL https://arxiv.org/abs/2603.28052

  15. [15]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401

  16. [16]

    Z. Li, H. Zhang, S. Han, S. Liu, J. Xie, Y. Zhang, Y. Choi, J. Zou, and P. Lu. In-the-flow agentic system optimization for effective planning and tool use. In International Conference on Learning Representations (ICLR), 2026

  17. [17]

    K.-A. Lie, O. Møyner, E. Svee, and J. Torben. Agentic scientific simulation: Execution-grounded model construction and reconstruction, 2026. URL https://arxiv.org/abs/2603.00214

  18. [18]

    J. Lin, S. Liu, C. Pan, L. Lin, S. Dou, Z. Xi, X. Huang, H. Yan, Z. Han, T. Gui, and Y.-G. Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses, 2026. URL https://arxiv.org/abs/2604.25850

  19. [19]

    Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, Y. Miao, B. Xue, S. Wang, L. Fu, W. Zhang, J. He, Y. Zhu, X. Wang, and C. Zhou. Geogalactica: A scientific large language model in geoscience, 2024. URL https://arxiv.org/abs/2401.00434

  20. [20]

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292

  21. [21]

    Narayanan, J

    S. Narayanan, J. D. Braza, R.-R. Griffiths, M. Ponnapati, A. Bou, J. Laurent, O. Kabeli, G. Wellawatte, S. Cox, S. G. Rodriques, and A. D. White. Aviary: training language agents on challenging scientific tasks, 2024. URL https://arxiv.org/abs/2412.21154

  22. [22]

    Ni and M

    B. Ni and M. J. Buehler. Mechagents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge, 2023. URL https://arxiv.org/abs/2311.08166

  23. [23]

    X. Ning, K. Tieu, D. Fu, T. Wei, Z. Li, Y. Bei, J. Zou, M. Ai, Z. Liu, T.-W. Li, L. Chen, Y. Zhao, K. Yang, B. Li, C. Qian, G. Li, X. Lin, Z. Zeng, R. Qiu, S. Chen, Y. Sun, X. Yang, R. Wang, R. Pan, C. Yang, D. Zhang, L. Fang, Z. Cui, Y. Cao, P. Chen, D. Sun, R. Chen, M. Srinivasan, N. Mathur, Y. Xia, H. Li, H. Yan, P. Lu, L. Zhang, T. Zhang, H. Tong, and...

  24. [24]

    Pandey, R

    S. Pandey, R. Xu, W. Wang, and X. Chu. Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics. Physics of Fluids, 37 0 (3), Mar. 2025. ISSN 1089-7666. doi:10.1063/5.0257555. URL http://dx.doi.org/10.1063/5.0257555

  25. [25]

    D. Park, H. Moon, and S. Ryu. A self-correcting multi-agent LLM framework for language-based physics simulation and explanation. npj Artificial Intelligence, 2 0 (1): 0 10, 2026. doi:10.1038/s44387-025-00057-z

  26. [26]

    Y. Ren, S. Yu, K. Chen, and J. Ma. Seismology modeling agent: A smart assistant for geophysical researchers, 2025. URL https://arxiv.org/abs/2512.14429

  27. [27]

    Z. Shi, H. A, Y. Shao, D. Huang, H. An, C. Xin, H. Shen, Z. Wang, Y. Na, G. Huang, and X. Jing. Mdagent2: Large language model for code generation and knowledge q&a in molecular dynamics, 2026. URL https://arxiv.org/abs/2601.02075

  28. [28]

    X. Tang, W. Xu, Y. Wang, Z. Guo, D. Shao, J. Chen, C. Zhang, Z. Wang, L. Zhang, G. Wan, W. Zhang, L. Bai, Z. Yin, P. Torr, H. Wang, and D. Jin. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning, 2025. URL https://arxiv.org/abs/2509.21193

  29. [29]

    X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for ai software developers as generalist agents, 2025. URL https://arxiv.org/abs/2407.16741

  30. [30]

    Yamada, R

    Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066

  31. [31]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024 a . URL https://arxiv.org/abs/2405.15793

  32. [32]

    L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez, and B. Cui. Buffer of thoughts: Thought-augmented reasoning with large language models, 2024 b . URL https://arxiv.org/abs/2406.04271

  33. [33]

    Y. Yang, Z. Gong, W. Huang, Q. Yang, Z. Zhou, Z. Huang, Y. Li, X. Gao, Q. Dai, B. Liu, K. Qiu, Y. Yang, D. Chen, X. Yang, and C. Luo. Skillopt: Executive strategy for self-evolving agent skills, 2026. URL https://arxiv.org/abs/2605.23904

  34. [34]

    L. Yue, N. Somasekharan, T. Zhang, Y. Cao, Z. Chen, S. Di, and S. Pan. Foam-agent: Towards automated intelligent cfd workflows, 2025 a . URL https://arxiv.org/abs/2505.04997

  35. [35]

    L. Yue, N. Somasekharan, T. Zhang, Y. Cao, and S. Pan. Foam-agent 2.0: An end-to-end composable multi-agent framework for automating cfd simulation in openfoam, 2025 b . URL https://arxiv.org/abs/2509.18178

  36. [36]

    Zhang and H

    T. Zhang and H. Sun. Scinav: A general agent framework for scientific coding tasks, 2026. URL https://arxiv.org/abs/2603.20256

  37. [37]

    Zhang, Z

    T. Zhang, Z. Liu, Y. Xin, and Y. Jiao. Mooseagent: A llm based multi-agent framework for automating moose simulation, 2025. URL https://arxiv.org/abs/2504.08621

  38. [38]

    A. Zhao, A. Chandrasekhar, and A. B. Farimani. Polyjarvis: Llm agent for autonomous polymer md simulations, 2026. URL https://arxiv.org/abs/2604.02537