pith. sign in

arxiv: 2605.30353 · v1 · pith:OPXKSQQ5new · submitted 2026-05-28 · 💻 cs.AI · astro-ph.CO· cs.HC· cs.SE

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Pith reviewed 2026-06-29 06:59 UTC · model grok-4.3

classification 💻 cs.AI astro-ph.COcs.HCcs.SE
keywords AI coding agentsphysicist supervisionscientific softwareJAXperturbation theorycase studytrustworthy AIone-loop calculations
0
0 comments X

The pith

In building a physics module with an AI coding agent, the design of physicist supervision—not the model's capabilities—determined whether the output was trustworthy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports a quantified case study of one physicist supervising an AI agent over 12 days and 57 sessions to implement a differentiable perturbation theory module in JAX. The agent autonomously resolved most issues but failed on three that required recognizing root causes rather than symptoms, including persisting with an inadequate code structure for 33 sessions. Specific supervision practices, such as testing across parameter ranges and maintaining changelogs, were essential for catching unphysical solutions that passed automated tests. The work argues that current agents optimize within fixed architectures and equate test-passing with correctness, limitations not resolved by scaling.

Core claim

The agent treated symptom reduction as root-cause resolution and could not re-evaluate its choice of code branch even when prompted, committing a calibrated but unphysical correction that passed all tests yet predicted incorrect values elsewhere; only an injected physics concept triggered redesign. Three supervision practices proved critical for catching what oracle tests missed.

What carries the argument

Classification of 15 supervision events by intervention level, with the mechanism being the agent's treatment of symptom reduction as root-cause resolution and its inability to distinguish predictive adequacy from explanatory correctness.

If this is right

  • Agents that propose architectural alternatives rather than optimizing within a given structure would reduce reliance on human intervention.
  • Agents must distinguish predictive adequacy from explanatory correctness to generate trustworthy scientific software.
  • Testing at diverse parameter points, shared changelogs, and explicit rules against unphysical patches catch errors missed by oracle tests.
  • The observed limitations are not obviously addressed by scaling model capabilities alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pattern of stuck exploration may appear in AI coding agents applied to other scientific domains beyond this cosmology module.
  • Mechanisms for injecting domain concepts could be tested as triggers for architectural redesign in future agents.
  • Benchmarks focused on code architecture reasoning and physical correctness could evaluate the capabilities the paper identifies as missing.

Load-bearing premise

The three unresolvable events reflect a general limitation of current AI coding agents rather than properties specific to this task, model version, or physics module.

What would settle it

An AI coding agent autonomously redesigning its code architecture or detecting an unphysical but test-passing correction when given a similar implementation task without human intervention.

Figures

Figures reproduced from arXiv: 2605.30353 by Nhat-Minh Nguyen.

Figure 1
Figure 1. Figure 1: Issue taxonomy for the CLAX-PT v0.1.0 development. Of 15 documented supervision events, 10 were resolved autonomously by the agent iterating against oracle tests; 2 were accelerated by the physicist’s domain knowledge (unit-magnitude and dimensional discrepancies invisible to shape-based comparisons); 3 required essential human judgment (an architectural redesign, a calibration￾patch rejection, and identif… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy convergence over 57 agent sessions. Blue: real-space matter power spectrum (autonomous, converges in ∼10 sessions). Red: worst redshift-space multipole (stuck at 8–86% for ∼33 sessions). Vertical gold lines mark human interventions. The agent’s error plateaued because the code architecture was structurally incompatible with anisotropic BAO damping—no coefficient adjustment within that architecture… view at source ↗
read the original abstract

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents an N=1 case study of a physicist supervising an AI coding agent (Claude Sonnet/Opus via Claude Code) across 12 work days and 57 sessions to implement CLAX-PT, a differentiable one-loop perturbation theory module in JAX. The authors classify 15 supervision events: the agent resolved 10 autonomously against oracle tests, 2 required physicist domain knowledge, and 3 could not be resolved because the agent treated symptom reduction as root-cause resolution (including 33 sessions spent tuning coefficients inside an unsuitable CLASS-PT branch and a calibrated correction with no physical counterpart that passed oracles but failed at other cosmologies). Three supervision practices (multi-point testing, shared changelogs, explicit rule against unphysical patches) are identified as critical for catching oracle-evading errors. The central conclusion is that, in this case, supervision design rather than model capability determined output trustworthiness, and that closing the gap requires agents capable of proposing architectural alternatives and distinguishing predictive adequacy from explanatory correctness—capabilities not shown here and not obviously solved by scaling.

Significance. If the observations hold, the work supplies a rare quantified, session-level record of human-AI interaction in scientific software development. The concrete examples—the 33-session branch rigidity, the fudge-factor incident caught only by diverse-parameter testing, and the explicit list of three effective supervision practices—provide actionable data for designing better oversight protocols. The study also documents a clear instance in which an agent optimized within a given structure rather than questioning it, which is a useful empirical anchor for discussions of agent limitations in physics-informed coding tasks. The detailed logging of events and the emphasis on reproducibility of the supervision process are strengths.

major comments (2)
  1. [Results section (event classification and session counts)] Results section (description of the three unresolvable events and the 33-session episode): The claim that the agent 'could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider' is load-bearing for the argument that supervision design was decisive. The manuscript does not supply the exact prompts issued during those attempts or the agent's verbatim responses, which leaves open the possibility that the rigidity reflects prompt formulation or interaction history rather than an intrinsic architectural limit of the model.
  2. [Discussion and conclusions] Discussion section (final paragraph on required agent capabilities): The statement that the identified gaps 'are not obviously addressed by scaling alone' rests on a single cosmology module (one-loop PT with anisotropic BAO damping). While the internal observations are consistent, the manuscript should frame this as a correctness-risk concern and propose a concrete test—e.g., replication of the same protocol on at least one additional scientific coding task with a different code skeleton—to assess whether the branch-choice rigidity and symptom-vs-root-cause confusion are task-specific or general.
minor comments (3)
  1. [Methods / Results] The classification of the 15 events into three categories is central but relies on post-hoc qualitative judgment; a supplementary table listing each event, its trigger, resolution method, and classification criterion would improve transparency and allow readers to assess the scheme.
  2. The manuscript refers to both 'CLAX-PT' and 'CLASS-PT'; a brief clarification of the naming convention (e.g., whether CLAX-PT is the JAX port and CLASS-PT the original branch) would prevent reader confusion.
  3. [Results (fudge-factor incident)] The abstract states that the fudge factor 'predicted wrong values at any other cosmology,' but the main text does not indicate whether this was verified by explicit evaluation at additional parameter points or inferred from the functional form; adding that detail would strengthen the example.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's value as a quantified case study. We agree with the minor revision recommendation and will incorporate changes to address the two major comments, as detailed below.

read point-by-point responses
  1. Referee: [Results section (event classification and session counts)] Results section (description of the three unresolvable events and the 33-session episode): The claim that the agent 'could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider' is load-bearing for the argument that supervision design was decisive. The manuscript does not supply the exact prompts issued during those attempts or the agent's verbatim responses, which leaves open the possibility that the rigidity reflects prompt formulation or interaction history rather than an intrinsic architectural limit of the model.

    Authors: We agree that verbatim prompts and responses would strengthen transparency and allow readers to evaluate whether the rigidity is prompt-dependent. In the revised manuscript we will add an appendix containing the exact prompts used in the relevant sessions (including the explicit requests to reconsider the CLASS-PT branch) together with the agent's responses. This addition directly addresses the concern while preserving the session-level record already present in the main text. revision: yes

  2. Referee: [Discussion and conclusions] Discussion section (final paragraph on required agent capabilities): The statement that the identified gaps 'are not obviously addressed by scaling alone' rests on a single cosmology module (one-loop PT with anisotropic BAO damping). While the internal observations are consistent, the manuscript should frame this as a correctness-risk concern and propose a concrete test—e.g., replication of the same protocol on at least one additional scientific coding task with a different code skeleton—to assess whether the branch-choice rigidity and symptom-vs-root-cause confusion are task-specific or general.

    Authors: We accept the recommendation to frame the conclusion more cautiously. In revision we will rephrase the final paragraph to present the gaps as a correctness-risk observation drawn from this specific N=1 study rather than a general claim. We will also add an explicit proposal for a concrete test: replication of the identical supervision protocol on at least one additional scientific coding task (e.g., implementation of a differentiable halo-model module or an N-body interface) with a different code skeleton, to evaluate whether the observed limitations are task-specific or recur across domains. revision: yes

Circularity Check

0 steps flagged

Observational case study exhibits no circular derivation

full rationale

This paper is an N=1 observational case study documenting supervision events during AI-assisted code development. It contains no mathematical derivations, equations, fitted parameters, predictions of quantities, or first-principles results that could reduce to their own inputs by construction. Conclusions rest on classified events and qualitative observations rather than any self-referential logical chain, self-citation load-bearing premise, or renamed empirical pattern. The absence of any derivation structure makes all enumerated circularity patterns inapplicable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a qualitative case study with no mathematical model, fitted parameters, or new physical entities. The only background assumptions are that the 15 events can be reliably classified by intervention level and that the observed agent behaviors are informative about current AI coding capabilities.

axioms (1)
  • domain assumption The 15 supervision events can be exhaustively and unambiguously classified into autonomous resolution, domain-knowledge assistance, and unresolvable categories.
    This classification underpins the counts of 10 + 2 + 3 events reported in the abstract.

pith-pipeline@v0.9.1-grok · 5846 in / 1426 out tokens · 33120 ms · 2026-06-29T06:59:46.346733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Search Discipline for Long-Horizon Research Agents

    cs.AI 2026-06 unverdicted novelty 4.0

    Aggregate metrics in research agents can invert rankings when validity is disaggregated, demonstrated on an ecosystem model task, motivating an external audit protocol over agent self-decision.

Reference graph

Works this paper leans on

11 extracted references · 8 canonical work pages · cited by 1 Pith paper

  1. [1]

    doi: 10.1088/1475-7516/2012/07/

  2. [2]

    doi: 10.1007/JHEP09(2012)

  3. [3]

    Chudaykin, A., Ivanov, M

    doi: 10.1103/PhysRevD.90.023518. Chudaykin, A., Ivanov, M. M., Philcox, O. H. E., and Si- monovi´c, M. Nonlinear perturbation theory extension of the boltzmann code CLASS.Phys. Rev. D, 103:023507,

  4. [4]

    https:// github.com/Michalychforever/CLASS-PT

    doi: 10.1103/PhysRevD.103.023507. https:// github.com/Michalychforever/CLASS-PT. D’Amico, G., Gleyzes, J., Kokron, N., Markovic, K., Sen- atore, L., Zhang, P., Beutler, F., and Gil-Mar´ın, H. The Cosmological Analysis of the SDSS/BOSS data from the Effective Field Theory of Large-Scale Structure.JCAP, 05:005,

  5. [5]

    Fang, X., Blazek, J

    doi: 10.1088/1475-7516/2020/05/005. Fang, X., Blazek, J. A., McEwen, J. E., and Hirata, C. M. FAST-PT II: an algorithm to calculate convolution inte- grals of general tensor quantities in cosmological pertur- bation theory.JCAP, 2017(02):030,

  6. [6]

    Abghari, E.F

    doi: 10.1088/1475-7516/ 2020/05/042. 8 Physics Is All You Need? A Case Study Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589,

  7. [7]

    Senatore, L

    doi: 10.1038/ s41586-023-06735-9. Senatore, L. and Zaldarriaga, M. The IR-resummed effective field theory of large scale structures.JCAP, 2015(02): 013,

  8. [8]

    Simonovi´c, M., Baldauf, T., Zaldarriaga, M., Carrasco, J

    doi: 10.1088/1475-7516/2015/02/013. Simonovi´c, M., Baldauf, T., Zaldarriaga, M., Carrasco, J. J., and Kollmeier, J. A. Cosmological perturbation theory using the FFTLog: formalism and connection to QFT loop integrals.JCAP, 04:030,

  9. [9]

    Villaescusa-Navarro, F

    doi: 10.1088/1475-7516/2018/04/030. Villaescusa-Navarro, F. et al. The Denario project: Deep knowledge AI agents for scientific discovery.arXiv preprint arXiv:2510.26887,

  10. [10]

    A Vlasov-Poisson approach to the large-scale structure, with applications to the EFT.JCAP, 2015(09):014,

    Vlah, Z., White, M., and Aviles, A. A Vlasov-Poisson approach to the large-scale structure, with applications to the EFT.JCAP, 2015(09):014,

  11. [11]

    doi: 10.1088/ 1475-7516/2015/09/014. A. Issue-Level Classification This appendix lists the 15 supervision events documented during theCLAX-PTv0.1.0 development window. Each row is one event; the count itself involves a judgment call (a defensible range of 13–15 depending on whether a test- metric correction and the architectural redesign are counted separ...