pith. machine review for the scientific record. sign in

arxiv: 2605.10820 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

MaD Physics: Evaluating information seeking under constraints in physical environments

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:48 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords MaD Physicsscientific discoveryAI agentsmeasurement constraintsphysical lawsbenchmarkmodel inferenceinformation seeking
0
0 comments X

The pith

MaD Physics benchmark evaluates agents on inferring physical laws from constrained measurements using altered laws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MaD Physics to assess how well AI agents can seek information and discover physical laws when limited by measurement budgets in terms of quality and quantity. Current benchmarks either test static knowledge or unconstrained design, missing the real constraints of scientific work. MaD Physics uses three environments with modified physical laws so agents must gather data and infer the rules rather than recall them. It then requires agents to predict future states after exhausting their measurement allowance. Benchmarking leading models reveals weaknesses in structured data collection and exploration.

Core claim

MaD Physics is a benchmark consisting of three environments based on distinct physical laws with alterations to reduce knowledge contamination. In each trial, an agent makes measurements until its budget is exhausted, then infers the underlying law to predict the system's future state. This setup directly evaluates the capabilities of inferring models from data and planning under constraints.

What carries the argument

The MaD Physics benchmark with its measurement budget protocol and altered physical law environments.

If this is right

  • If correct, MaD Physics allows systematic testing of agents' trade-offs between measurement quality and quantity.
  • It identifies specific shortcomings in current models in structured exploration.
  • The benchmark can be used to evaluate multimodality and in-context learning in scientific tasks.
  • Improvements in agents on this benchmark would indicate better handling of resource-constrained discovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such benchmarks could guide development of AI systems that assist in actual lab experiments with cost limits.
  • Connections might exist to fields like active learning or experimental design optimization.
  • Testing on more models or real-world proxies would validate its usefulness beyond the current evaluations.

Load-bearing premise

That using altered physical laws effectively prevents agents from relying on pre-existing knowledge and that the benchmark environments accurately represent the challenges of scientific discovery under physical constraints.

What would settle it

If models that excel at MaD Physics still fail to make discoveries in actual constrained physical experiments, or if the alterations are insufficient to block knowledge recall.

read the original abstract

Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MaD Physics, a benchmark with three environments based on distinct (altered) physical laws. Agents must make a budgeted sequence of measurements, infer the underlying law from the resulting data, and use the inferred model to predict future system states. The work evaluates four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, 3 Flash) on this task, reports shortcomings in structured exploration and data collection, and claims the benchmark isolates two core capabilities: model inference from data and planning under measurement constraints.

Significance. A well-validated benchmark that enforces explicit measurement budgets and uses altered laws to reduce contamination would address a genuine gap between static knowledge benchmarks and unconstrained experimental-design tasks. The proposal to evaluate multimodality and in-context learning within the same framework is a constructive extension. However, the absence of any quantitative validation that the altered laws actually block prior-knowledge leakage, combined with missing implementation details and environment verification, prevents the current manuscript from establishing that the measured performance reflects the claimed capabilities rather than retrieval plus adaptation.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Benchmark Design): the central claim that agents must 'infer the underlying physical law purely from budgeted measurements' rests on the assertion that altered laws mitigate contamination from pre-training. No ablation, performance gap between altered vs. standard laws, or knowledge-free baseline is reported. Without such evidence the inference task cannot be shown to isolate data-driven model discovery rather than pattern retrieval plus adaptation.
  2. [§4 and §5] §4 (Experiments) and §5 (Results): the manuscript provides only high-level Gemini performance summaries and does not include environment validation (e.g., ground-truth measurement distributions, budget calibration, or checks that the prediction task requires the inferred law). Full implementation details are unavailable, so it is impossible to verify that the three environments actually enforce the claimed constraints or isolate the two target capabilities.
minor comments (2)
  1. [Abstract and §5] The abstract states that MaD Physics 'can be used to evaluate other capabilities such as multimodality and in-context learning,' yet the results section does not present separate metrics or controls for these capabilities; the reported Gemini numbers appear to conflate all factors.
  2. [§3] Notation for measurement budgets, quality/quantity trade-offs, and the exact form of the prediction task is introduced without a compact mathematical definition or pseudocode, making it difficult to replicate the exact interaction protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for identifying key areas where additional evidence and detail are needed to substantiate the benchmark's claims. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Design): the central claim that agents must 'infer the underlying physical law purely from budgeted measurements' rests on the assertion that altered laws mitigate contamination from pre-training. No ablation, performance gap between altered vs. standard laws, or knowledge-free baseline is reported. Without such evidence the inference task cannot be shown to isolate data-driven model discovery rather than pattern retrieval plus adaptation.

    Authors: We acknowledge that the manuscript currently lacks an ablation study, a direct performance comparison between altered and standard physical laws, or a knowledge-free baseline. This means the claim that altered laws successfully mitigate contamination is not yet quantitatively supported. In the revised manuscript we will add an ablation experiment reporting agent performance on both the altered laws and their standard-physics counterparts, together with a baseline that receives no measurement data. These additions will provide the missing evidence that the benchmark isolates data-driven model inference. revision: yes

  2. Referee: [§4 and §5] §4 (Experiments) and §5 (Results): the manuscript provides only high-level Gemini performance summaries and does not include environment validation (e.g., ground-truth measurement distributions, budget calibration, or checks that the prediction task requires the inferred law). Full implementation details are unavailable, so it is impossible to verify that the three environments actually enforce the claimed constraints or isolate the two target capabilities.

    Authors: We agree that the current sections provide insufficient validation and implementation detail. The revised manuscript will expand §4 and §5 to include ground-truth measurement distributions, explicit budget-calibration procedures, and verification that accurate future-state prediction requires the inferred law rather than other cues. We will also release the full benchmark code and environment specifications to allow independent verification of the constraints and the isolation of the two target capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with independent design choices

full rationale

The paper proposes MaD Physics, a new benchmark consisting of three environments based on altered physical laws, to evaluate agents on model inference from budgeted measurements and planning under constraints. No derivations, equations, or first-principles results are presented that reduce to fitted parameters, self-citations, or inputs by construction. The altered laws are an explicit design choice to address contamination, not a self-defined or renamed result. Agent benchmarking with Gemini models constitutes external evaluation rather than any internal prediction or uniqueness theorem. The work is self-contained as an evaluation framework without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper contributes a new benchmark rather than deriving results from first principles or fitting parameters; it relies on standard assumptions about agent capabilities and the effectiveness of altered laws for testing.

axioms (2)
  • domain assumption Altered physical laws can be used to test inference without prior knowledge contamination
    Invoked to design environments that mitigate contamination from model training data.
  • domain assumption Agents can infer underlying laws from limited measurements and make future predictions
    Core premise of the inference and prediction phase after budget exhaustion.
invented entities (1)
  • MaD Physics benchmark environments no independent evidence
    purpose: To evaluate constrained scientific discovery capabilities in AI agents
    Newly proposed testbed with three distinct physical-law-based setups.

pith-pipeline@v0.9.0 · 5586 in / 1325 out tokens · 69989 ms · 2026-05-12T03:48:13.989107+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

    Claude Code Docs. Accessed: 2026-04- 13. J. Binney and S. Tremaine.Galactic Dynamics. Princeton University Press, Princeton, second edition, 2008. G. Y. Bogoslovsky. On the local anisotropy of space-time, inertia and force fields.Il Nuovo Cimento B (1971-1996), 77(2):181–190, 1983. J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin...

  2. [2]

    URL https://doi.org/10.1007/ s10994-025-06914-x

    doi: 10.1007/s10994-025-06914-x. URL https://doi.org/10.1007/ s10994-025-06914-x. Y. Chen, P. Piekos, M. Ostaszewski, F. Laakom, and J. Schmidhuber. Physgym: Benchmarking llms in interactive physics discovery with con- trolled priors.arXiv preprint arXiv:2507.15550, 2025. D.Choudhury,S.Williamson,A.Goliński,N.Miao, F. B. Smith, M. Kirchhof, Y. Zhang, and ...

  3. [3]

    URL https://openreview.net/ forum?id=vyflgpwfJW. C.-W. Mak, G. Zhu, B. Zhang, H. Li, X. Chi, K. Zhang, Y. Wu, Y. He, C.-K. Fan, W. Lu, et al. Physicsmind: Sim and real mechanics bench- marking for physical reasoning and prediction in foundational vlms and world models.arXiv preprint arXiv:2601.16007, 2026. A. Melnik, R. Schiewer, M. Lange, A. I. Mure- san...

  4. [4]

    URL https://openreview.net/ forum?id=m5Qsh0kBQG. L. Phan, A. Gatti, N. Li, et al. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649:1139–1146,

  5. [5]

    Humanity's Last Exam

    doi: 10.1038/s41586-025-09962-4. URL https://doi.org/10.1038/ s41586-025-09962-4. Y. Pu, T. Lin, and H. Chen. Piflow: Principle- aware scientific discovery with multi-agent col- laboration.arXiv preprint arXiv:2505.15047,

  6. [6]

    arXiv preprint arXiv:2505.15047 , year=

    doi: 10.48550/arXiv.2505.15047. URL https://arxiv.org/abs/2505.15047. S. Qiu, S. Guo, Z.-Y. Song, Y. Sun, Z. Cai, J. Wei, T. Luo, Y. Yin, H. Zhang, Y. Hu, et al. Phy- bench: Holistic evaluation of physical percep- tion and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025. T. Rainforth, A. Foster, D. R. Ivanova, and F. Bick- ford S...

  7. [11]

    s el ec ti on

    ** M e a s u r e m e n t Quality :** A value d e t e r m i n i n g the pr ec is io n of the o b s e r v a t i o n for each object . Your response should contain the action you want to take exactly in the fo ll ow in g format : ‘‘‘ json {{ " s el ec ti on ": [ {{" ob je ct _i d ": 0 , " quality ": " high "}} , ... , {{" ob je ct _i d ": k , " quality ": " ...

  8. [13]

    ** I n t e r a c t i o n Protocol ** You must i t e r a t i v e l y gather data to u n d e r s t a n d the system

    A c c u r a t e l y predict the state of the system at a future time . ** I n t e r a c t i o n Protocol ** You must i t e r a t i v e l y gather data to u n d e r s t a n d the system . At each step , you will provide the fo ll ow in g p a r a m e t e r s :

  9. [15]

    ** M e a s u r e m e n t Se le ct io n :** The specific lo ca ti on s (x , y ) you wish to observe

  10. [16]

    s el ec ti on

    ** M e a s u r e m e n t Quality :** A value d e t e r m i n i n g the pr ec is io n of the o b s e r v a t i o n for each location . Your response should contain the action you want to take in exactly the fo ll ow in g JSON format : 25 MaDPhysics : Evaluating information seeking under constraints in physical environments ‘‘‘ json {{ " s el ec ti on ": [ ...

  11. [18]

    ** Particle :** The particle you wish to observe

  12. [19]

    particle

    ** M e a s u r e m e n t Quality :** A value d e t e r m i n i n g the pr ec is io n of the o b s e r v a t i o n for each object . You will be allowed to make a maximum of { m a x _ o b s e r v a t i o n s _ p e r _ t r i a l } o b s e r v a t i o n s with a total budget of { budget } units . After which the system will reset to a fixed initial state . Y...

  13. [20]

    Derive the u n d e r l y i n g eq ua ti on s of motion go ve rn in g the system

  14. [21]

    ** I n t e r a c t i o n Protocol ** You must i t e r a t i v e l y gather data to u n d e r s t a n d the system

    A c c u r a t e l y predict the future po si ti on s of the bodies beyond the observed timeline . ** I n t e r a c t i o n Protocol ** You must i t e r a t i v e l y gather data to u n d e r s t a n d the system . At each step , you will provide the fo ll ow in g p a r a m e t e r s :

  15. [22]

    ** Time Interval :** The duration from the current time to the next o b s e r v a t i o n

  16. [23]

    ** Target Subset :** The specific objects you wish to observe

  17. [24]

    s el ec ti on

    ** M e a s u r e m e n t Quality :** A value d e t e r m i n i n g the pr ec is io n of the o b s e r v a t i o n for each object . Your response should contain the action you want to take exactly in the fo ll ow in g 31 MaDPhysics : Evaluating information seeking under constraints in physical environments format : ‘‘‘ json { " s el ec ti on ": [ {" o bj ...

  18. [25]

    **Time Interval ($\\Delta t$):** $0.1$ seconds. This is short enough that the approximation holds well, but long enough to provide measurable 32 MaDPhysics : Evaluating information seeking under constraints in physical environments displacement from initial velocities

  19. [26]

    **Target Subset:** Objects 0, 1, and 2

  20. [27]

    **Measurement Quality:** High for all objects to minimize observational noise relative to the calculated acceleration

  21. [29]

    selection

    **Expected Information Gain:** High, as the resulting $\\bar{\\vec{a}}_i$ will either strongly confirm Hypothesis A or necessitate a structural revision. I will first use the interpreter to calculate the *theoretical* acceleration at $t=0$ based on Hypothesis A, providing a benchmark for the incoming data. ### Theoretical Calculation at t=0 (Hypothesis A)...

  22. [30]

    This is the largest reasonable step to take now, maximizing the distance traveled and thus the change in relative separation $r_{ij}$

    **Time Interval ($\\Delta t$):** $1.0$ second. This is the largest reasonable step to take now, maximizing the distance traveled and thus the change in relative separation $r_{ij}$

  23. [31]

    We need the full configuration to check the interaction geometry

    **Target Subset:** Objects 0, 1, and 2. We need the full configuration to check the interaction geometry

  24. [32]

    Precision is paramount when testing subtle structural deviations

    **Measurement Quality:** High. Precision is paramount when testing subtle structural deviations

  25. [34]

    **Budget:** $170.0 - 30.0 = 140.0$ remaining

  26. [35]

    selection

    **Expected Information Gain:** Maximized for structural resolution. If the system is truly inverse-square, the acceleration observed over this longer interval should *approximate* the initial acceleration $\\vec{a }(0)$ (since $\\Delta r$ is small relative to $r$), or at least show a consistent pattern. If the acceleration changes wildly, Hypothesis A is ...

  27. [36]

    **Step 1 (t=0.0 to 0.1):** Empirical acceleration $a_{emp}(0)$ significantly disagreed with $a_{theory}(0)$

  28. [37]

    For Object 0, the empirical acceleration in the x-direction was $2.67$, while the theoretical prediction based on $1/r^2$ was only $0.067$

    **Step 2 (t=0.1 to 1.1):** The average acceleration $\\bar{a}_{0.1 \\to 1.1}$ disagreed spectacularly with the theoretical acceleration $\\vec{a} _{theory}(0.1)$ calculated using the positions at $t=0.1$. For Object 0, the empirical acceleration in the x-direction was $2.67$, while the theoretical prediction based on $1/r^2$ was only $0.067$. The structur...

  29. [38]

    This is a significant step to see if the configuration stabilizes or spirals into a more complex motion

    **Time Interval ($\\Delta t$):** $5.0$ seconds. This is a significant step to see if the configuration stabilizes or spirals into a more complex motion

  30. [39]

    37 MaDPhysics : Evaluating information seeking under constraints in physical environments

    **Target Subset:** All three objects. 37 MaDPhysics : Evaluating information seeking under constraints in physical environments

  31. [40]

    We need reliable tracking of the velocity change

    **Measurement Quality:** High. We need reliable tracking of the velocity change

  32. [42]

    **Budget:** $140.0 - 30.0 = 110.0$ remaining

  33. [43]

    selection

    **Expected Information Gain:** High. This step will determine if the resulting acceleration vector $\\bar{a}_{1.1 \\to 6.1}$ is consistently different from the $p=2$ prediction based on $\\vec{r}(1.1)$, or if it follows *any* discernible pattern relative to the $r$ changes. We will use the calculated $\\bar{a}_{0.1 \\to 1.1}$ to approximate $\\vec{V }(1.1...

  34. [44]

    **Time Interval ($\\Delta t$):** $5.0$ seconds

  35. [46]

    **Measurement Quality:** High, to capture the true evolution of position and subsequently derive the acceleration over this long interval

  36. [47]

    **Cost:** $3 \\times 10 = 30.0$

  37. [48]

    selection

    **Budget:** $140.0 - 30.0 = 110.0$ remaining. 38 MaDPhysics : Evaluating information seeking under constraints in physical environments The action taken will be: ‘‘‘json { "selection": [ {"object_id": 0, "quality": "high"}, {"object_id": 1, "quality": "high"}, {"object_id": 2, "quality": "high"} ], "time_delta": 5.0 } ‘‘‘’, ’executable_code’: ’import nump...

  38. [49]

    Short enough to approximate instantaneous acceleration

    **Time Interval ($\\Delta t$):** $0.1$ seconds. Short enough to approximate instantaneous acceleration

  39. [52]

    selection

    **Budget:** $80.0$ remaining. The planned action is: ‘‘‘json { "selection": [ {"object_id": 0, "quality": "high"}, {"object_id": 1, "quality": "high"}, {"object_id": 2, "quality": "high"} ], "time_delta": 0.1 } ‘‘‘’, ’executable_code’: ’import numpy as np # --- Data from Step 1 (t=0.0 to t=0.1) --- # Initial State (t=0) r0 = np.array([ [-8.80812, 0.37882]...

  40. [53]

    (Re-center, use $\\bar{a} _{6.1}$ to verify $\\vec{V}(6.2)$)

    **Time Interval ($\\Delta t$):** $0.1$ seconds. (Re-center, use $\\bar{a} _{6.1}$ to verify $\\vec{V}(6.2)$)

  41. [54]

    **Target Subset:** All objects

  42. [56]

    selection

    **Budget:** $80.0 - 30.0 = 50.0$ remaining. We will calculate $\\bar{a}_{6.1}$ against $a_{theory, p=1}(6.1)$. If the ratios are closer to 1 than the $p=2$ test, we have identified $p=1$. I will perform the calculations in the next step, focusing on the $p=1$ test using the state at $t=6.1$. If $p=1$ is rejected, the budget is too low 46 MaDPhysics : Eval...

  43. [57]

    **Update State:** Calculate the best estimate for $\\vec{V}(6.3)$ using $ \\bar{a}_{6.1 \\to 6.3}$

  44. [58]

    * **Equation of Motion (Local Model):** $\\vec{a}_i(t) = \\bar{a}_{i, 6.1 \\to 6.3}$ for $t > 6.3$

    **Final Model Selection:** Since no structural law holds, the best local model is the **Constant Acceleration Model** derived from the last, most recent measurement interval ($[6.1, 6.3]$). * **Equation of Motion (Local Model):** $\\vec{a}_i(t) = \\bar{a}_{i, 6.1 \\to 6.3}$ for $t > 6.3$

  45. [59]

    A short, high-precision step at $\\Delta t=0.1$ is the last chance to shrink $\\vec{V}(6.3)$ uncertainty before the budget runs out

    **Budget Allocation:** The remaining budget (50.0) must be used to confirm $\\vec{V}(6.3)$ and explore the system\’s boundary conditions if possible. A short, high-precision step at $\\Delta t=0.1$ is the last chance to shrink $\\vec{V}(6.3)$ uncertainty before the budget runs out. **Experiment Design (Step 6 - Final Velocity Refinement):**

  46. [60]

    **Time Interval ($\\Delta t$):** $0.1$ seconds

  47. [63]

    selection

    **Budget:** $50.0 - 30.0 = 20.0$ remaining. This leaves 20.0 units for a final, very cheap, low-precision test, or the investigation terminates here. I will commit the 30.0 now to maximize the final velocity accuracy before termination. The planned action is: ‘‘‘json { "selection": [ {"object_id": 0, "quality": "high"}, {"object_id": 1, "quality": "high"}...

  48. [64]

    54 MaDPhysics : Evaluating information seeking under constraints in physical environments

    **Time Interval ($\\Delta t$):** $0.1$ seconds. 54 MaDPhysics : Evaluating information seeking under constraints in physical environments

  49. [65]

    **Target Subset:** All objects (0, 1, 2)

  50. [66]

    **Measurement Quality:** High

  51. [67]

    selection

    **Cost:** $30.0$. **Wait, I only have 20.0 remaining.** I can only afford a maximum cost of 20.0. I must choose an observation that costs $\\le 20.0$. * Two \’medium\’ observations (2 * 5 = 10.0) + Two \’low\’ observations (2 * 2 = 4.0) = 14.0 total for 2 objects? No, quality must be assigned per object. * $2 \\times 10$ (High) is 20.0, but I have 3 objec...