pith. sign in

arxiv: 2605.26087 · v1 · pith:GRDJU6JGnew · submitted 2026-05-25 · 📊 stat.ML · cs.LG

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

Pith reviewed 2026-06-29 20:30 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords LLM benchmarkingscientific discoveryphysics simulationagent reasoninghypothesis refinementN-body simulationout-of-distribution physicsexplanation quality
0
0 comments X

The pith

LLM agents discover laws in simulated non-standard physics worlds in only half of cases, with predictive accuracy failing to ensure explanation quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DiscoverPhysics, an interactive benchmark that requires LLM agents to uncover the motion laws of 22 simulated worlds whose physics deviates from standard models through choices like screened gravity, fractional powers, multi-species couplings, and hidden particles. Agents design sequences of experiments in an N-body simulator, inspect raw trajectory data, refine hypotheses over multiple rounds, and finally submit both a natural-language account and a Python implementation of the inferred law. Evaluation combines held-out trajectory prediction error with an LLM judge applying an expert rubric for conceptual grasp. Top frontier models succeed on roughly half the worlds and falter where latent structures must be actively uncovered rather than observed directly. The work shows that strong predictive performance on trajectories does not translate into high-quality explanations and that genuine conceptual understanding hinges on iterative experimental refinement.

Core claim

DiscoverPhysics asks LLM agents to discover the laws of motion in 22 on-demand simulated worlds whose physics includes screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free rules, and time-varying interactions. Each agent proposes experiments, receives raw trajectory data from an N-body simulator, iterates on hypotheses, and submits a natural-language explanation plus a Python code implementation. Submissions are scored on held-out trajectory MSE and on an LLM judge's assessment against an expert-written rubric measuring conceptual understanding. Across eleven frontier models the strongest agents pass only half the worlds and f

What carries the argument

An interactive benchmark built on an N-body simulator that generates worlds with deliberately non-standard physics on demand, forcing agents to design informative experiments, observe trajectory histories, and submit both explanations and code.

If this is right

  • Good predictive accuracy on held-out trajectories does not guarantee high explanation quality.
  • Conceptual understanding requires iterative hypothesis refinement through well-chosen experiments.
  • Commercial models outperform open-source models in designing informative experiments and extracting conclusions from data.
  • Agents fail most often on worlds whose latent structure must be uncovered rather than observed directly.
  • The benchmark measures long-horizon reasoning over an experimental history rather than single-shot recall.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current LLMs may depend more on surface pattern matching than on generating and testing new physical hypotheses when facing truly novel regimes.
  • Adding real experimental noise or sensor limitations to the simulator could reveal whether agents can handle incomplete or noisy observations.
  • Training loops that reward both accurate prediction and explicit hypothesis updates might close the gap between the two evaluation axes.
  • The same setup could be adapted to chemistry or biology simulators to test cross-domain scientific discovery.

Load-bearing premise

The LLM judge using the expert rubric gives a valid measure of conceptual understanding and the simulator correctly realizes the intended non-standard physics without implementation errors that could mislead agents.

What would settle it

A model that scores high on the explanation rubric for worlds with hidden particles yet produces large trajectory MSE on held-out particles, or systematic disagreement between the LLM judge and independent human raters on explanation quality.

Figures

Figures reproduced from arXiv: 2605.26087 by Andrew Gordon Wilson, Carolina Cuesta-L\'azaro, Lindsay M. Smith, Matt L. Wiemann, Pavel Izmailov, Peter Melchior, Siddharth Mishra-Sharma.

Figure 1
Figure 1. Figure 1: Schematic of the DISCOVERPHYSICS benchmarking pipeline. Our benchmark does not store data products, instead data is generated on demand by a N-body simulator from the equations specified by the world definition. We then generate an initial set of particle trajectories and feed this, as well as the agent prompt about task and experiment instructions, to an LLM. The LLM agent is tasked with discovering the p… view at source ↗
Figure 2
Figure 2. Figure 2: Performance metrics of the suite of 11 tested LLM models. From left to right, we show a [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance metrics of claude-opus-4-7 and gpt-5.5 as a function of allowed exper￾imental rounds. We show (from left to right) the expected pass rate at k = 3, explanation score averaged over all worlds, and normalized MSE averaged over all worlds. We also show the results of random experimentation in the cross-hatched colorbars for both models. Guided vs random experimentation. In [PITH_FULL_IMAGE:figure… view at source ↗
Figure 4
Figure 4. Figure 4: Noise ablations over the Yukawa (top) and Ether (bottom) worlds for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top: Mean explanation score per model (rows) and world (columns). The cell values show the mean explanation score between 0 → 1, averaged over all 5 seeds for each model. Bottom: Distribution of explanation scores by world, pooled across all models and seeds. Each violin shows the score distribution for one world; the horizontal bar marks the median. Worlds are sorted left to right by from easiest to harde… view at source ↗
Figure 6
Figure 6. Figure 6: A summary of the experimentation process for an example run of Claude Opus 4.7 on the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A summary of the experimentation process for an example run of Claude Opus 4.7 on [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualisation of [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
read the original abstract

Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces DiscoverPhysics, a benchmark of 22 on-demand N-body simulated worlds whose physics deviates from standard laws (screened/fractional gravity, multi-species couplings, hidden particles, time-varying interactions). LLM agents design experiments, observe trajectories, and submit both a natural-language explanation and a Python implementation of the inferred law; submissions are scored on held-out trajectory MSE and an LLM judge's rubric-based explanation score. Across eleven frontier models the strongest agents pass only half the worlds, fail systematically on latent-structure cases, and show that predictive accuracy does not guarantee high explanation quality.

Significance. If the evaluation protocol is shown to be reliable, the benchmark would supply a concrete, falsifiable test of long-horizon scientific reasoning that separates recall from genuine discovery; the reported dissociation between MSE and explanation score, together with the performance gap between commercial and open-source models, would be a useful signal for future agent development.

major comments (3)
  1. [Evaluation] The central claims about explanation quality and conceptual understanding rest on the LLM judge's rubric scores, yet the manuscript reports neither human validation, inter-rater agreement statistics, nor an ablation demonstrating that the judge distinguishes genuine grasp from fluent but shallow descriptions (see abstract and evaluation description).
  2. [World Construction] The N-body simulator is asserted to implement the intended non-standard laws correctly, but no unit tests against known analytic cases, implementation details, or verification that the generated trajectories match the target physics are provided; this directly affects the validity of both MSE and explanation metrics.
  3. [Results] The headline result that 'strongest agents pass only half of the worlds' and the claim that failures concentrate on latent-structure worlds are stated without per-model, per-world success rates or statistical significance tests, making it impossible to assess the robustness of the pattern.
minor comments (1)
  1. [Abstract] The abstract states that 'good predictive accuracy does not guarantee high explanation quality' but does not define the quantitative threshold or correlation measure used to support this dissociation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important areas where additional evidence and documentation will improve the manuscript's rigor. We address each major comment below and will incorporate the suggested changes in the revision.

read point-by-point responses
  1. Referee: [Evaluation] The central claims about explanation quality and conceptual understanding rest on the LLM judge's rubric scores, yet the manuscript reports neither human validation, inter-rater agreement statistics, nor an ablation demonstrating that the judge distinguishes genuine grasp from fluent but shallow descriptions (see abstract and evaluation description).

    Authors: We agree that validation of the LLM judge is required to substantiate claims about explanation quality. In the revised manuscript we will add a human validation study on a subset of explanations scored by domain experts, report inter-rater agreement statistics (e.g., Cohen's kappa), and include an ablation comparing the judge on genuine versus fluent-but-shallow descriptions to demonstrate its discriminative capability. revision: yes

  2. Referee: [World Construction] The N-body simulator is asserted to implement the intended non-standard laws correctly, but no unit tests against known analytic cases, implementation details, or verification that the generated trajectories match the target physics are provided; this directly affects the validity of both MSE and explanation metrics.

    Authors: We acknowledge that explicit verification of the simulator is necessary for metric validity. The revised manuscript will provide implementation details of the N-body simulator, unit tests against analytic cases for standard physics, and direct verification that generated trajectories match the target non-standard laws via comparison to numerical integration of the intended equations. revision: yes

  3. Referee: [Results] The headline result that 'strongest agents pass only half of the worlds' and the claim that failures concentrate on latent-structure worlds are stated without per-model, per-world success rates or statistical significance tests, making it impossible to assess the robustness of the pattern.

    Authors: We agree that granular data and statistical tests are needed to evaluate robustness. The revised manuscript will include tables reporting per-model and per-world success rates for both MSE and explanation metrics, together with statistical significance tests (e.g., appropriate chi-squared or t-tests) for the observed patterns, including failure concentration on latent-structure worlds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and scoring rely on external components

full rationale

The paper constructs a new benchmark using an independent N-body simulator to generate worlds and an expert-written rubric scored by an LLM judge. Evaluation results (pass rates, MSE vs. explanation scores) are direct empirical measurements on held-out trajectories and rubric outputs. No equations, parameters, or central claims reduce by construction to fitted inputs, self-citations, or self-definitions. The methodology is self-contained against the external simulator and rubric, with no load-bearing self-referential steps matching the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work is a benchmark construction rather than a derivation; it postulates 22 new simulated worlds and relies on the correctness of the simulator and the validity of the LLM judge.

axioms (2)
  • domain assumption The N-body simulator generates trajectories that faithfully reflect the intended non-standard physics rules.
    Central to agents being able to discover the laws from data.
  • domain assumption The LLM judge with expert rubric accurately measures conceptual understanding independent of surface-level prediction accuracy.
    Underpins the claim that prediction and explanation quality can diverge.
invented entities (1)
  • 22 simulated worlds with screened/fractional gravity, multi-species couplings, hidden particles, and time-varying interactions no independent evidence
    purpose: To create test cases requiring latent structure discovery
    These worlds are constructed specifically for the benchmark and have no independent existence outside it.

pith-pipeline@v0.9.1-grok · 5822 in / 1346 out tokens · 35609 ms · 2026-06-29T20:30:13.888422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    Hogg and Soledad Villar

    David W. Hogg and Soledad Villar. Is machine learning good or bad for the natural sciences? arXiv e-prints, art. arXiv:2405.18095, May 2024. doi: 10.48550/arXiv.2405.18095

  2. [2]

    A Probabilistic Framework for LLM-Based Model Discovery

    Stefan Wahl, Raphaela Schenk, Ali Farnoud, Jakob H Macke, and Daniel Gedon. A probabilistic framework for llm-based model discovery.arXiv preprint arXiv:2602.18266, 2026

  3. [3]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  4. [4]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  5. [5]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022, 2023

  6. [6]

    On the Measure of Intelligence

    François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

  7. [7]

    Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K. Reddy. LLM-SR: Scientific equation discovery via programming with large language models. InInternational Conference on Learning Representations, 2025

  8. [8]

    Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K. Reddy. LLM-SRBench: A new benchmark for scientific equation discovery with large language models. InInternational Conference on Machine Learning, 2025

  9. [9]

    Think like a scientist: Physics-guided llm agent for equation discovery.arXiv preprint arXiv:2602.12259, 2026

    Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, and Rose Yu. Think like a scientist: Physics-guided llm agent for equation discovery.arXiv preprint arXiv:2602.12259, 2026

  10. [10]

    Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

    Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

  11. [11]

    DiscoveryBench: Towards data-driven discovery with large language models.arXiv preprint arXiv:2407.01725, 2024

    Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhi- jeetsingh Meena, Aryan Prakhar, Tirth V ora, Tushar Khot, Ashish Sabharwal, and Peter Clark. DiscoveryBench: Towards data-driven discovery with large language models.arXiv preprint arXiv:2407.01725, 2024

  12. [12]

    Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. ScienceAgent- Bench: Toward rigorous assessment of language agents for data-driven scientific discovery. I...

  13. [13]

    BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments.arXiv preprint arXiv:2405.17631, 2024

    Yusuf Roohani, Andrew Lee, Qian Huang, Jian V ora, Zachary Steinhart, Kexin Huang, Alexan- der Marson, Percy Liang, and Jure Leskovec. BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments.arXiv preprint arXiv:2405.17631, 2024

  14. [14]

    Laurent, Benjamin Tenmann, Siddharth Narayanan, Geemi P

    Ludovico Mitchener, Jon M. Laurent, Benjamin Tenmann, Siddharth Narayanan, Geemi P. Wellawatte, Andrew D. White, Lorenzo Sani, and Samuel G. Rodrigues. BixBench: A comprehensive benchmark for LLM-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025. 11

  15. [15]

    Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery

    Tingting Chen, Srinivas Yu, Yu Su, and Lei Li. Auto-Bench: An automated benchmark for scientific discovery in LLMs.arXiv preprint arXiv:2502.15224, 2025

  16. [16]

    Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, and Kevin Maik Jablonka. AI scientists produce results without reasoning scientifically.arXiv preprint arXiv:2604.18805, 2026

  17. [17]

    Newton- bench: Benchmarking generalizable scientific law discovery in llm agents.arXiv preprint arXiv:2510.07172, 2025

    Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, et al. Newton- bench: Benchmarking generalizable scientific law discovery in llm agents.arXiv preprint arXiv:2510.07172, 2025

  18. [18]

    Physgym: Benchmarking llms in interactive physics discovery with controlled priors.arXiv preprint arXiv:2507.15550, 2025

    Yimeng Chen, Piotr Pie ¸kos, Mateusz Ostaszewski, Firas Laakom, and Jürgen Schmidhuber. Physgym: Benchmarking llms in interactive physics discovery with controlled priors.arXiv preprint arXiv:2507.15550, 2025

  19. [19]

    Gravity-bench-v1: A benchmark on gravitational physics discovery for agents.arXiv preprint arXiv:2501.18411, 2025

    Nolan Koblischke, Hyunseok Jang, Kristen Menou, and Mohamad Ali-Dib. Gravity-bench-v1: A benchmark on gravitational physics discovery for agents.arXiv preprint arXiv:2501.18411, 2025

  20. [20]

    Construction of higher order symplectic integrators.Physics Letters A, 150(5): 262–268, 1990

    Haruo Yoshida. Construction of higher order symplectic integrators.Physics Letters A, 150(5): 262–268, 1990. ISSN 0375-9601. doi: https://doi.org/10.1016/0375-9601(90)90092-3. URL https://www.sciencedirect.com/science/article/pii/0375960190900923

  21. [21]

    discovered

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Yash Katariya, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman- Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax. 12 A Example experimentation process In Figures 6 and 7 we show...

  22. [22]

    Huge observation noise (σ∼several units)

  23. [23]

    "" import numpy as np from scipy.integrate import solve_ivp k = params.get(

    The system itself is chaotic Let me focus on velocities, which may be less noisy. Actually for the probe at (20,0) , v= 0 , at t= 0.5 in run 1: v≈(−0.06,−1.16) . Run 2: v≈ (1.20,0.01). Different! That’s chaotic behavior, but withv= 0 initial and almost no time, the velocity shouldn’t change much from systematic forces in 0.5s unless force is large. Means ...

  24. [24]

    "" import numpy as np from scipy.integrate import solve_ivp k = params.get(

    The field obeys an inverse-square law, with a strength proportional to the ratio of the probe’s scalar property p2 to the source’s property p1. Its motion is additionally damped by a linear drag proportional to its velocity. Explanation score: 0.20 (raw 2.0/10) Judge reasoning: The student describes only a single inverse-square (1/r2) force law with no me...