Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
Pith reviewed 2026-06-29 06:59 UTC · model grok-4.3
The pith
In building a physics module with an AI coding agent, the design of physicist supervision—not the model's capabilities—determined whether the output was trustworthy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The agent treated symptom reduction as root-cause resolution and could not re-evaluate its choice of code branch even when prompted, committing a calibrated but unphysical correction that passed all tests yet predicted incorrect values elsewhere; only an injected physics concept triggered redesign. Three supervision practices proved critical for catching what oracle tests missed.
What carries the argument
Classification of 15 supervision events by intervention level, with the mechanism being the agent's treatment of symptom reduction as root-cause resolution and its inability to distinguish predictive adequacy from explanatory correctness.
If this is right
- Agents that propose architectural alternatives rather than optimizing within a given structure would reduce reliance on human intervention.
- Agents must distinguish predictive adequacy from explanatory correctness to generate trustworthy scientific software.
- Testing at diverse parameter points, shared changelogs, and explicit rules against unphysical patches catch errors missed by oracle tests.
- The observed limitations are not obviously addressed by scaling model capabilities alone.
Where Pith is reading between the lines
- The pattern of stuck exploration may appear in AI coding agents applied to other scientific domains beyond this cosmology module.
- Mechanisms for injecting domain concepts could be tested as triggers for architectural redesign in future agents.
- Benchmarks focused on code architecture reasoning and physical correctness could evaluate the capabilities the paper identifies as missing.
Load-bearing premise
The three unresolvable events reflect a general limitation of current AI coding agents rather than properties specific to this task, model version, or physics module.
What would settle it
An AI coding agent autonomously redesigning its code architecture or detecting an unphysical but test-passing correction when given a similar implementation task without human intervention.
Figures
read the original abstract
Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an N=1 case study of a physicist supervising an AI coding agent (Claude Sonnet/Opus via Claude Code) across 12 work days and 57 sessions to implement CLAX-PT, a differentiable one-loop perturbation theory module in JAX. The authors classify 15 supervision events: the agent resolved 10 autonomously against oracle tests, 2 required physicist domain knowledge, and 3 could not be resolved because the agent treated symptom reduction as root-cause resolution (including 33 sessions spent tuning coefficients inside an unsuitable CLASS-PT branch and a calibrated correction with no physical counterpart that passed oracles but failed at other cosmologies). Three supervision practices (multi-point testing, shared changelogs, explicit rule against unphysical patches) are identified as critical for catching oracle-evading errors. The central conclusion is that, in this case, supervision design rather than model capability determined output trustworthiness, and that closing the gap requires agents capable of proposing architectural alternatives and distinguishing predictive adequacy from explanatory correctness—capabilities not shown here and not obviously solved by scaling.
Significance. If the observations hold, the work supplies a rare quantified, session-level record of human-AI interaction in scientific software development. The concrete examples—the 33-session branch rigidity, the fudge-factor incident caught only by diverse-parameter testing, and the explicit list of three effective supervision practices—provide actionable data for designing better oversight protocols. The study also documents a clear instance in which an agent optimized within a given structure rather than questioning it, which is a useful empirical anchor for discussions of agent limitations in physics-informed coding tasks. The detailed logging of events and the emphasis on reproducibility of the supervision process are strengths.
major comments (2)
- [Results section (event classification and session counts)] Results section (description of the three unresolvable events and the 33-session episode): The claim that the agent 'could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider' is load-bearing for the argument that supervision design was decisive. The manuscript does not supply the exact prompts issued during those attempts or the agent's verbatim responses, which leaves open the possibility that the rigidity reflects prompt formulation or interaction history rather than an intrinsic architectural limit of the model.
- [Discussion and conclusions] Discussion section (final paragraph on required agent capabilities): The statement that the identified gaps 'are not obviously addressed by scaling alone' rests on a single cosmology module (one-loop PT with anisotropic BAO damping). While the internal observations are consistent, the manuscript should frame this as a correctness-risk concern and propose a concrete test—e.g., replication of the same protocol on at least one additional scientific coding task with a different code skeleton—to assess whether the branch-choice rigidity and symptom-vs-root-cause confusion are task-specific or general.
minor comments (3)
- [Methods / Results] The classification of the 15 events into three categories is central but relies on post-hoc qualitative judgment; a supplementary table listing each event, its trigger, resolution method, and classification criterion would improve transparency and allow readers to assess the scheme.
- The manuscript refers to both 'CLAX-PT' and 'CLASS-PT'; a brief clarification of the naming convention (e.g., whether CLAX-PT is the JAX port and CLASS-PT the original branch) would prevent reader confusion.
- [Results (fudge-factor incident)] The abstract states that the fudge factor 'predicted wrong values at any other cosmology,' but the main text does not indicate whether this was verified by explicit evaluation at additional parameter points or inferred from the functional form; adding that detail would strengthen the example.
Simulated Author's Rebuttal
We thank the referee for the constructive review and positive assessment of the work's value as a quantified case study. We agree with the minor revision recommendation and will incorporate changes to address the two major comments, as detailed below.
read point-by-point responses
-
Referee: [Results section (event classification and session counts)] Results section (description of the three unresolvable events and the 33-session episode): The claim that the agent 'could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider' is load-bearing for the argument that supervision design was decisive. The manuscript does not supply the exact prompts issued during those attempts or the agent's verbatim responses, which leaves open the possibility that the rigidity reflects prompt formulation or interaction history rather than an intrinsic architectural limit of the model.
Authors: We agree that verbatim prompts and responses would strengthen transparency and allow readers to evaluate whether the rigidity is prompt-dependent. In the revised manuscript we will add an appendix containing the exact prompts used in the relevant sessions (including the explicit requests to reconsider the CLASS-PT branch) together with the agent's responses. This addition directly addresses the concern while preserving the session-level record already present in the main text. revision: yes
-
Referee: [Discussion and conclusions] Discussion section (final paragraph on required agent capabilities): The statement that the identified gaps 'are not obviously addressed by scaling alone' rests on a single cosmology module (one-loop PT with anisotropic BAO damping). While the internal observations are consistent, the manuscript should frame this as a correctness-risk concern and propose a concrete test—e.g., replication of the same protocol on at least one additional scientific coding task with a different code skeleton—to assess whether the branch-choice rigidity and symptom-vs-root-cause confusion are task-specific or general.
Authors: We accept the recommendation to frame the conclusion more cautiously. In revision we will rephrase the final paragraph to present the gaps as a correctness-risk observation drawn from this specific N=1 study rather than a general claim. We will also add an explicit proposal for a concrete test: replication of the identical supervision protocol on at least one additional scientific coding task (e.g., implementation of a differentiable halo-model module or an N-body interface) with a different code skeleton, to evaluate whether the observed limitations are task-specific or recur across domains. revision: yes
Circularity Check
Observational case study exhibits no circular derivation
full rationale
This paper is an N=1 observational case study documenting supervision events during AI-assisted code development. It contains no mathematical derivations, equations, fitted parameters, predictions of quantities, or first-principles results that could reduce to their own inputs by construction. Conclusions rest on classified events and qualitative observations rather than any self-referential logical chain, self-citation load-bearing premise, or renamed empirical pattern. The absence of any derivation structure makes all enumerated circularity patterns inapplicable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 15 supervision events can be exhaustively and unambiguously classified into autonomous resolution, domain-knowledge assistance, and unresolvable categories.
Forward citations
Cited by 1 Pith paper
-
Search Discipline for Long-Horizon Research Agents
Aggregate metrics in research agents can invert rankings when validity is disaggregated, demonstrated on an ecosystem model task, motivating an external audit protocol over agent self-decision.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1088/1475-7516/2012/07/
-
[2]
doi: 10.1007/JHEP09(2012)
-
[3]
doi: 10.1103/PhysRevD.90.023518. Chudaykin, A., Ivanov, M. M., Philcox, O. H. E., and Si- monovi´c, M. Nonlinear perturbation theory extension of the boltzmann code CLASS.Phys. Rev. D, 103:023507,
-
[4]
https:// github.com/Michalychforever/CLASS-PT
doi: 10.1103/PhysRevD.103.023507. https:// github.com/Michalychforever/CLASS-PT. D’Amico, G., Gleyzes, J., Kokron, N., Markovic, K., Sen- atore, L., Zhang, P., Beutler, F., and Gil-Mar´ın, H. The Cosmological Analysis of the SDSS/BOSS data from the Effective Field Theory of Large-Scale Structure.JCAP, 05:005,
-
[5]
doi: 10.1088/1475-7516/2020/05/005. Fang, X., Blazek, J. A., McEwen, J. E., and Hirata, C. M. FAST-PT II: an algorithm to calculate convolution inte- grals of general tensor quantities in cosmological pertur- bation theory.JCAP, 2017(02):030,
-
[6]
doi: 10.1088/1475-7516/ 2020/05/042. 8 Physics Is All You Need? A Case Study Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589,
-
[7]
Senatore, L
doi: 10.1038/ s41586-023-06735-9. Senatore, L. and Zaldarriaga, M. The IR-resummed effective field theory of large scale structures.JCAP, 2015(02): 013,
2015
-
[8]
Simonovi´c, M., Baldauf, T., Zaldarriaga, M., Carrasco, J
doi: 10.1088/1475-7516/2015/02/013. Simonovi´c, M., Baldauf, T., Zaldarriaga, M., Carrasco, J. J., and Kollmeier, J. A. Cosmological perturbation theory using the FFTLog: formalism and connection to QFT loop integrals.JCAP, 04:030,
-
[9]
doi: 10.1088/1475-7516/2018/04/030. Villaescusa-Navarro, F. et al. The Denario project: Deep knowledge AI agents for scientific discovery.arXiv preprint arXiv:2510.26887,
-
[10]
A Vlasov-Poisson approach to the large-scale structure, with applications to the EFT.JCAP, 2015(09):014,
Vlah, Z., White, M., and Aviles, A. A Vlasov-Poisson approach to the large-scale structure, with applications to the EFT.JCAP, 2015(09):014,
2015
-
[11]
doi: 10.1088/ 1475-7516/2015/09/014. A. Issue-Level Classification This appendix lists the 15 supervision events documented during theCLAX-PTv0.1.0 development window. Each row is one event; the count itself involves a judgment call (a defensible range of 13–15 depending on whether a test- metric correction and the architectural redesign are counted separ...
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.