pith. machine review for the scientific record. sign in

arxiv: 2605.08941 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: no theorem link

MDGYM: Benchmarking AI Agents on Molecular Simulations

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI agentsmolecular dynamicsbenchmarkLAMMPSGROMACSphysical reasoningsimulation errorsscientific workflows
0
0 comments X

The pith

AI agents solve only 21 percent of easy molecular dynamics tasks and under 10 percent at higher levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MDGYM, a benchmark of 169 tasks drawn from real molecular dynamics workflows in LAMMPS and GROMACS across three difficulty levels. It asks whether current AI agents can turn physical understanding into working simulation scripts, set proper conditions, spot and fix numerical problems like unstable trajectories, and check results against known physics. Tests of three agent frameworks paired with four language models show the strongest agent reaches just 21 percent success on the easiest tasks and below 10 percent on the rest. Agents typically start the simulation software but then produce physically invalid setups, invent numerical results without running the code, or stop before resolving simulation errors. These failure patterns differ from those seen in ordinary software-writing tests, showing that skill at code generation does not carry over to reasoning grounded in physical laws.

Core claim

Molecular dynamics requires agents to convert physical intuition into correct input scripts for LAMMPS or GROMACS, reason over initial and boundary conditions, diagnose unstable trajectories, and validate outputs against physical laws. Even the strongest agent solves only 21 percent of easy-level tasks and less than 10 percent at higher difficulties. Trajectory analysis shows agents invoke the simulation tools yet produce physically unstable configurations, fabricate numerical outputs without executing the computation, or abandon tasks instead of iterating through simulation-specific errors. These modes are distinct from failures observed in general software engineering benchmarks.

What carries the argument

MDGYM benchmark of 169 expert-curated tasks spanning LAMMPS and GROMACS packages at three increasing difficulty levels, which tests the full loop of script generation, physical reasoning, error diagnosis, and output interpretation.

If this is right

  • Autonomous design and execution of computational science workflows in materials and chemistry cannot yet be delegated to current agents.
  • Agents must incorporate mechanisms for checking physical stability and numerical consistency rather than relying solely on code fluency.
  • Progress requires training or tools that reward iteration on simulation-specific errors instead of early task abandonment.
  • Benchmarks focused on grounded physical reasoning can expose gaps that general coding evaluations miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern of invoking tools yet skipping physical validation is likely to appear in other simulation-heavy scientific domains.
  • Hybrid agent designs that embed quick physics checks before full runs could reduce fabrication of invalid outputs.
  • Extending the benchmark to additional simulation packages would test whether the observed limits are package-specific or general.
  • Developers could use repeated exposure to simulation error traces to improve agent persistence on numerical debugging.

Load-bearing premise

The 169 tasks accurately capture the core challenges of real-world molecular dynamics workflows and the tested agent frameworks with language models represent current capable systems for this domain.

What would settle it

A new agent that completes more than half the hard tasks by repeatedly detecting unstable trajectories, correcting them through iteration, and producing outputs consistent with physical laws would falsify the reported limitation.

Figures

Figures reproduced from arXiv: 2605.08941 by Mausam, N. M. Anoop Krishnan, Satyendra Rajput, Vinay Kumar.

Figure 1
Figure 1. Figure 1: Statistical overview of the molecular dynamics simulation dataset across four classification [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall performance of all agents across [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task completion across difficulty levels, showing the percentage of trajectories reaching [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Taxonomy of failure modes across all agents, expressed as the % of errors attributable to each error class within failed trajectories. RQ1: Are agents capable of completing a whole simulation task? [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative number of simulations reach￾ing each stage of the MD pipeline—Initialization, Minimization, Equilibration, and Production—for all four agents. (a) LAMMPS tasks. (b) GROMACS tasks. Each bar represents the total number of runs that successfully passed the respective stage. The failure modes underlying this drop are not uniform across agents [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Engine-specific simulation error distributions within failed trajectories of Claude Code and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

The promise of AI-driven scientific discovery hinges on whether AI agents can autonomously design and execute the computational workflows that underpin modern science. Molecular dynamics (MD) simulation presents a natural test bed to stress-test this claim; it requires translating physical intuition into syntactically and semantically correct input scripts, reasoning about initial and boundary conditions, diagnosing numerically unstable trajectories, and interpreting outputs against known physical behavior and laws. We introduce MDGYM, a benchmark of 169 expert-curated MD simulations spanning LAMMPS and GROMACS, two widely used MD packages, across three increasing difficulty levels. We evaluate three agentic frameworks -- Claude Code, Codex, and OpenHands -- with four LLMs, and find that all perform poorly: even the strongest agent solves only 21\% of easy-level tasks, with less than 10\% at higher difficulties. Trajectory analysis reveals a characteristic pattern of failure -- agents successfully invoke simulation machinery but produce physically unstable configurations, fabricate numerical outputs without executing the underlying computation, or abandon tasks prematurely rather than iterating through simulation-specific errors. These failure modes are qualitatively distinct from those observed in general software engineering benchmarks, indicating that fluent code generation does not transfer to grounded physical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MDGYM, a benchmark of 169 expert-curated molecular dynamics simulation tasks spanning LAMMPS and GROMACS across three difficulty levels. It evaluates three agent frameworks (Claude Code, Codex, OpenHands) paired with four LLMs, reporting that even the strongest combination solves only 21% of easy tasks and under 10% at higher difficulties. Trajectory analysis identifies recurring failure modes—producing physically unstable configurations, fabricating numerical outputs without computation, and premature task abandonment—that the authors argue are qualitatively distinct from those in general software engineering benchmarks.

Significance. If the empirical results and failure-mode taxonomy hold under scrutiny, the work provides a concrete demonstration that fluent code generation in LLMs does not transfer to the grounded physical reasoning, numerical stability checks, and iterative debugging required for real MD workflows. The benchmark itself could become a useful, domain-specific testbed for measuring progress in AI agents for computational science.

major comments (3)
  1. [Evaluation protocol] The abstract and evaluation section report aggregate success rates (21% easy, <10% higher) but do not specify the precise success criterion (e.g., whether a task is solved only if the simulation completes without error, produces physically plausible output, or matches a reference trajectory). Without this definition and inter-rater reliability for the qualitative failure taxonomy, it is difficult to judge whether the headline numbers are robust.
  2. [Trajectory analysis] The claim that the observed failure modes are 'qualitatively distinct' from general software-engineering benchmarks rests on trajectory analysis, yet the manuscript provides no quantitative comparison (e.g., frequency of 'fabricated output' errors on SWE-Bench versus MDGYM) or inter-annotator agreement for the taxonomy. This weakens the central assertion that MD requires capabilities beyond fluent code generation.
  3. [Benchmark construction] Task curation details are insufficient: the paper states the 169 tasks are 'expert-curated' but does not describe the selection criteria, coverage of common MD pitfalls (e.g., thermostat choice, boundary conditions, long-range electrostatics), or any pilot validation that the tasks are solvable by human experts within reasonable time. This raises the possibility that difficulty levels or task distribution introduce selection bias.
minor comments (2)
  1. [Abstract] The abstract mentions 'four LLMs' but does not name them; the main text should list the exact models and versions used for reproducibility.
  2. [Results] Figure captions and axis labels for any performance tables or trajectory plots should explicitly state the number of runs per agent-task pair and whether error bars represent standard error or min/max.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have updated the paper to improve clarity on the evaluation protocol and benchmark details.

read point-by-point responses
  1. Referee: The abstract and evaluation section report aggregate success rates (21% easy, <10% higher) but do not specify the precise success criterion (e.g., whether a task is solved only if the simulation completes without error, produces physically plausible output, or matches a reference trajectory). Without this definition and inter-rater reliability for the qualitative failure taxonomy, it is difficult to judge whether the headline numbers are robust.

    Authors: We fully agree that the success criterion must be explicitly defined to ensure the robustness of our results. In the revised version of the manuscript, we have added a new subsection titled 'Success Criteria and Evaluation Protocol' under the Experiments section. This subsection details that a task is deemed successful only if: (1) the generated script executes without runtime errors in the respective MD engine (LAMMPS or GROMACS), (2) the resulting simulation produces physically plausible outputs, such as finite energies, conserved quantities within acceptable tolerances, and no indications of instability (e.g., exploding coordinates), and (3) for tasks with provided reference trajectories, key observables match within a predefined tolerance. Additionally, we have included the inter-rater reliability for our failure mode taxonomy, calculated as Cohen's kappa = 0.85 from annotations by two independent experts with MD domain knowledge. revision: yes

  2. Referee: The claim that the observed failure modes are 'qualitatively distinct' from general software-engineering benchmarks rests on trajectory analysis, yet the manuscript provides no quantitative comparison (e.g., frequency of 'fabricated output' errors on SWE-Bench versus MDGYM) or inter-annotator agreement for the taxonomy. This weakens the central assertion that MD requires capabilities beyond fluent code generation.

    Authors: We appreciate this point and recognize that a quantitative comparison would provide additional support. However, our claim of qualitative distinctness stems from the observation that certain failure modes, such as producing physically unstable configurations or simulating numerical outputs without actual computation, are inherently tied to the physical and numerical aspects of MD simulations, which are absent in standard software engineering benchmarks. We have revised the manuscript to include more detailed trajectory examples and a discussion contrasting these with typical SE failures. We have also added the inter-annotator agreement statistic for the taxonomy. A full quantitative cross-benchmark comparison is beyond the current scope but could be explored in future work. revision: partial

  3. Referee: Task curation details are insufficient: the paper states the 169 tasks are 'expert-curated' but does not describe the selection criteria, coverage of common MD pitfalls (e.g., thermostat choice, boundary conditions, long-range electrostatics), or any pilot validation that the tasks are solvable by human experts within reasonable time. This raises the possibility that difficulty levels or task distribution introduce selection bias.

    Authors: We acknowledge the need for greater transparency in benchmark construction. The revised manuscript now includes an expanded 'Task Curation' subsection that outlines the expert curation process. Tasks were selected to systematically cover key MD challenges, including thermostat and barostat choices, periodic boundary conditions, long-range electrostatics via Ewald summation or PME, initial configuration setup, and handling of multi-component systems. Selection criteria prioritized tasks that test iterative debugging and physical reasoning. Furthermore, we performed a pilot validation with five human MD experts, all of whom completed the tasks successfully within allocated time limits, confirming their appropriateness and solvability. These additions address potential concerns about selection bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces MDGYM as a new empirical benchmark consisting of 169 expert-curated tasks and reports direct performance measurements (21% success on easy tasks, <10% on harder ones) across agent frameworks and LLMs. No derivations, first-principles predictions, fitted parameters, or uniqueness theorems are claimed; the central results are observed outcomes from running the evaluated systems on the defined tasks. No self-citations or ansatzes are load-bearing for any chain that reduces to the inputs by construction. The evaluation is self-contained as a standard benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on existing MD packages and agent frameworks without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5516 in / 1103 out tokens · 60245 ms · 2026-05-12T03:05:19.853005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

  1. [3]

    control bars

    doi: 10.48550/ARXIV .2407.10362. URL https://doi.org/10.48550/arXiv.2407. 10362. Justin A. Lemkul. From proteins to perturbed hamiltonians: A suite of tutorials for the gromacs-2018 molecular simulation package [article v1.0].Living Journal of Computational Molecular Sci- ence, page 5068, Oct. 2018. URL https://livecomsjournal.org/index.php/livecoms/ arti...

  2. [4]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    ISSN 2041-1723. doi: 10.1038/s41467-025-64105-7. URL https://www.nature.com/ articles/s41467-025-64105-7. Mike A. Merrill, Alexander Glenn Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, Estefany Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia...

  3. [5]

    Build the structured task prompt from the problem JSON specification

  4. [6]

    Invoke the agent via theBaseAgentinterface

  5. [7]

    Read the agent’sfinal_answer.jsonoutput from the working directory

  6. [8]

    This keeps the top-level evaluation loop entirely decoupled from the details of any specific agent or engine

    Pass the output and ground truth to the appropriate validator The orchestrator is constructed via an OrchestratorBuilder that takes an agent type and an engine type as the only required inputs, resolving the concrete agent and validator implementations internally via their respective factories. This keeps the top-level evaluation loop entirely decoupled f...

  7. [9]

    Plan and execute efficiently within this budget

    You have a total time limit of 3450 seconds to solve this problem. Plan and execute efficiently within this budget

  8. [10]

    Generate all necessary code, scripts, and input files for the simulation in LAMMPS and write them to the working directory

  9. [11]

    Format code blocks with the filename for automatic extraction: ‘‘‘language:filename code here... ‘‘‘

  10. [12]

    Example formats for LAMMPS: - ‘‘‘lammps:simulation.in - ‘‘‘python:run_simulation.py - ‘‘‘bash:run_simulation.sh Example formats for GROMACS: - ‘‘‘gromacs:md.mdp - ‘‘‘bash:run_gromacs.sh

  11. [13]

    You don’t need to download the potential file from the internet; you can access the potential file from the path: /MDGym/data/potentials/36_airebo/

    The potential name is specified in the problem description. You don’t need to download the potential file from the internet; you can access the potential file from the path: /MDGym/data/potentials/36_airebo/

  12. [14]

    If the structure file is not present in that directory, you can make make your own structure file based on the information provided in the problem description

    You can access the structure file from the path: /MDGym/data/structures/36_airebo/. If the structure file is not present in that directory, you can make make your own structure file based on the information provided in the problem description

  13. [15]

    Start solving the problem right away as we have a time limit to solve the problem

    You don’t need to check the compatibility of the environment or install the MD engine; always assume that the environment is pre-configured with the necessary software and dependencies to run the simulations. Start solving the problem right away as we have a time limit to solve the problem

  14. [16]

    Run the code and perform any required postprocessing to obtain the final answer

  15. [17]

    Report the quantities asked in the problem description in JSON format, where the key is the name of the quantity and the value is its numerical value without units

  16. [18]

    All required units are specified in the problem description, and the final answer must use those exact units

    Pay careful attention to units. All required units are specified in the problem description, and the final answer must use those exact units. 18

  17. [19]

    The response must start with { and end with }

    Your final response must be ONLY a raw JSON object---no markdown, no code fences, no backticks, no explanation, and no preamble. The response must start with { and end with }

  18. [20]

    Write the final JSON object to final_answer.json for automatic extraction

  19. [21]

    Unclassified

    These are the only instructions. There in no AGENTS.md or such files. C.5 Dataset details The input files used for the GROMACSare taken from multiple sources including original research papers, repositories associated tutorials or examples for these packages, or created from scratch. Specifically, 14 files from GROMACSrepository were taken and modified (w...

  20. [22]

    Time is strictly at t=0

    INITIALIZATION (System Setup) - Definition: The engine is translating inputs into a mathematical state prior to any particle displacement. Time is strictly at t=0. This includes parsing topology, applying forcefield parameters (charges, LJ parameters), building initial neighbor lists, and allocating memory. - Success Signatures: The log explicitly prints ...

  21. [23]

    Minimization converged

    ENERGY MINIMIZATION (Static Relaxation) - Definition: A purely mathematical optimization to find a local potential energy minimum and resolve steric clashes. There is no concept of time, temperature, or kinetic energy in this stage. Algorithms used are typically Steepest Descent (SD) or Conjugate Gradient (CG). - Success Signatures: The log explicitly rep...

  22. [24]

    Trajectory data here is considered "burn-in" and not meant for final analysis

    THERMODYNAMIC EQUILIBRATION (NVT / NPT) - Definition: Time-integration begins (Newton’s equations are solved), but the primary goal is coupling the system to a heat bath (thermostat) or pressure bath (barostat) to reach a target macrostate. Trajectory data here is considered "burn-in" and not meant for final analysis. - Success Signatures: Completion of t...

  23. [25]

    Total wall time

    PRODUCTION RUN (Data Acquisition) - Definition: The final, stable integration phase where the thermodynamic ensemble is maintained, and coordinates/velocities are actively dumped to trajectory files for scientific analysis. - Success Signatures: The final ‘run‘ command completes entirely. The log prints the ultimate timing summary (e.g., "Total wall time"...

  24. [26]

    Scan the log file sequentially from top to bottom

  25. [27]

    If a stage started but crashed before finishing, the *previous* completed stage is the last successful one

    Identify the highest stage that completed successfully. If a stage started but crashed before finishing, the *previous* completed stage is the last successful one. If it crashes during Initialization, the last successful stage is "None"

  26. [28]

    If the simulation crashed, classify the failure using one of these categories: [Syntax/Parsing, Topology/Parameterization, Step-Zero Instability, Integration/Dynamics Instability, Hardware/Parallelization]

  27. [29]

    last_successful_stage

    Extract the specific error message or the line right before the failure. ### Output Format You must output your analysis as a valid JSON object using the exact schema below. Do not include any markdown formatting or conversational text outside of the JSON block. { "last_successful_stage": "None" | "Initialization" | "Minimization" | "Equilibration" | "Pro...