Competing with AI Scientists: Agent-Driven Approach to Astrophysics Research

Andy Nilipour; Biwei Dai; Boris Bolliet; Celia Lecat; Erwan Allys; Licong Xu; Po-Wen Chang; Sebastien Pierre; Thomas Borrett; Wahid Bhimji

arxiv: 2604.09621 · v1 · submitted 2026-03-18 · 💻 cs.AI

Competing with AI Scientists: Agent-Driven Approach to Astrophysics Research

Thomas Borrett , Licong Xu , Andy Nilipour , Boris Bolliet , Sebastien Pierre , Erwan Allys , Celia Lecat , Biwei Dai

show 2 more authors

Po-Wen Chang Wahid Bhimji

This is my paper

Pith reviewed 2026-05-15 09:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent-driven researchmulti-agent systemsparameter inferencecosmological parametersweak lensingAI in scientific discovery

0 comments

The pith

A multi-agent AI system with human guidance won first place in a cosmological parameter inference challenge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes an agent-driven method for building parameter inference pipelines using a multi-agent system called Cmbagent. Specialized agents collaborate to generate ideas, write code, run experiments, and refine the pipeline for scientific data analysis. In the FAIR Universe Weak Lensing Uncertainty Challenge, the fully autonomous version fell short of expert results, but adding human intervention produced a winning entry. The final pipeline relies on parameter-efficient convolutional neural networks, likelihood calibration on a known grid, and regularization techniques. This indicates that semi-autonomous agentic workflows can match or exceed expert performance in time-constrained inference tasks.

Core claim

The integration of human intervention enabled the agent-driven workflow to achieve a first-place result in the FAIR Universe Weak Lensing Uncertainty Challenge, demonstrating that semi-autonomous agentic systems can compete with and in some cases surpass expert solutions for constructing cosmological parameter inference pipelines.

What carries the argument

Cmbagent, a multi-agent system in which specialized agents collaborate to generate research ideas, write and execute code, evaluate results, and iteratively refine the overall inference pipeline.

If this is right

Semi-autonomous agentic systems can achieve top performance in competitive scientific challenges under time constraints.
The approach provides a scalable framework to rapidly explore and construct pipelines for inference problems.
The winning pipeline combines parameter-efficient convolutional neural networks, likelihood calibration over a known parameter grid, and multiple regularization techniques.
Agent-driven workflows can handle realistic observational uncertainties in cosmological data analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid human-AI loops may prove more reliable than pure autonomy for complex scientific tasks, pointing toward collaborative rather than replacement models.
If the method scales beyond this one challenge, it could shorten development time for inference tools in other data-heavy fields like particle physics or genomics.
The result raises the question of how much human steering is optimal, suggesting experiments that systematically vary the level of intervention.

Load-bearing premise

Success on this single competition problem with human guidance will generalize to other inference tasks without comparable human steering.

What would settle it

Run the fully autonomous version of Cmbagent on a new, unrelated parameter inference challenge and check whether it reaches or exceeds the top expert submissions without any human input.

read the original abstract

We present an agent-driven approach to the construction of parameter inference pipelines for scientific data analysis. Our method leverages a multi-agent system, Cmbagent (the analysis system of the AI scientist Denario), in which specialized agents collaborate to generate research ideas, write and execute code, evaluate results, and iteratively refine the overall pipeline. As a case study, we apply this approach to the FAIR Universe Weak Lensing Uncertainty Challenge, a competition under time constraints focused on robust cosmological parameter inference with realistic observational uncertainties. While the fully autonomous exploration initially did not reach expert-level performance, the integration of human intervention enabled our agent-driven workflow to achieve a first-place result in the challenge. This demonstrates that semi-autonomous agentic systems can compete with, and in some cases surpass, expert solutions. We describe our workflow in detail, including both the autonomous and semi-autonomous exploration by Cmbagent. Our final inference pipeline utilizes parameter-efficient convolutional neural networks, likelihood calibration over a known parameter grid, and multiple regularization techniques. Our results suggest that agent-driven research workflows can provide a scalable framework to rapidly explore and construct pipelines for inference problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-agent system reaches first place in a cosmology challenge only after human intervention, but the paper gives no breakdown of what the agents actually contributed on their own.

read the letter

The main point is that their Cmbagent multi-agent setup, once humans got involved, produced the winning pipeline for the FAIR Universe Weak Lensing Uncertainty Challenge. Without that human step the autonomous version fell short of expert level. They describe the workflow in some detail: agents handle idea generation, code writing, execution, and iteration, and the final pipeline combines parameter-efficient CNNs, grid-based likelihood calibration, and standard regularization. That specific application to end-to-end construction of a competition-grade inference pipeline in cosmology is new relative to the cited work. The description of both the fully autonomous and the semi-autonomous runs is clear enough to follow what they tried. The result is tied to an external competition outcome rather than an internal fit, which avoids obvious circularity. The soft spot is the missing attribution. The abstract states human intervention was required to reach first place, yet supplies no numbers on how much code, prompts, or design choices came from the agents versus the humans. The techniques themselves are conventional, so it is hard to tell whether the agents drove the decisive elements or simply executed a standard approach under guidance. There are also no quantitative baseline comparisons or error analysis shown in the abstract. This is useful reading for people working on agent frameworks for scientific data analysis who want a concrete example of how the pieces can fit together in practice. It is not yet strong enough to stand as evidence that the agent method itself beats experts. I would send it to peer review if the authors add the human-agent breakdown and direct comparisons to non-agent baselines; otherwise it risks being seen as an interesting demo without the supporting measurements.

Referee Report

3 major / 1 minor

Summary. The manuscript presents Cmbagent, a multi-agent system for automating the construction of parameter inference pipelines in scientific data analysis. As a case study on the FAIR Universe Weak Lensing Uncertainty Challenge, it reports that fully autonomous operation fell short of expert performance, but human intervention enabled the workflow to achieve first place. The final pipeline uses parameter-efficient CNNs, grid-based likelihood calibration, and regularization. The authors conclude that semi-autonomous agentic systems can compete with expert solutions.

Significance. If the central attribution holds, the work would demonstrate a viable hybrid human-AI framework for rapidly developing robust cosmological inference pipelines under time pressure, offering a scalable template for other data-analysis tasks. The competition outcome provides an external benchmark, but the absence of quantitative breakdowns limits the strength of the claim that the agent framework itself drove the result.

major comments (3)

[Abstract] Abstract: The claim that the agent-driven workflow achieved first place supplies no quantitative performance metric, competition scoring details, or comparisons to other entries or expert baselines, leaving the central result unsupported by presented evidence.
[Workflow description] Workflow and results sections: No breakdown quantifies the extent or nature of human interventions (e.g., fraction of prompts, code, or design choices supplied by humans versus agents). The final pipeline consists of conventional components (parameter-efficient CNNs, grid calibration, regularization) routinely used by experts, so the manuscript does not isolate the Cmbagent framework as the source of the win.
[Results] Results: The manuscript provides no error analysis, ablation studies, or validation procedure for the final pipeline on the challenge data, preventing assessment of robustness or whether success generalizes beyond this single instance.

minor comments (1)

[Abstract] The abstract refers to 'parameter-efficient convolutional neural networks' without specifying the efficiency metrics, architecture details, or comparison to standard CNNs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped clarify the presentation of our results. We address each major comment below and have revised the manuscript to strengthen the evidence and analysis where possible.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the agent-driven workflow achieved first place supplies no quantitative performance metric, competition scoring details, or comparisons to other entries or expert baselines, leaving the central result unsupported by presented evidence.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we have added the final competition score, the precise scoring metric employed by the challenge organizers, and direct numerical comparisons to the top expert baselines and other participating entries. These additions now anchor the first-place claim with concrete evidence. revision: yes
Referee: [Workflow description] Workflow and results sections: No breakdown quantifies the extent or nature of human interventions (e.g., fraction of prompts, code, or design choices supplied by humans versus agents). The final pipeline consists of conventional components (parameter-efficient CNNs, grid calibration, regularization) routinely used by experts, so the manuscript does not isolate the Cmbagent framework as the source of the win.

Authors: We have expanded the workflow section to include a quantitative breakdown of human interventions, specifying the number of human-supplied prompts, code edits, and high-level design decisions versus those generated autonomously by the agents. While the individual components are established techniques, the Cmbagent multi-agent workflow enabled their rapid identification, integration, and iterative calibration under the challenge's strict time limits; we have added explicit discussion of the autonomous exploration paths that converged on this combination, thereby clarifying the framework's contribution to the outcome. revision: partial
Referee: [Results] Results: The manuscript provides no error analysis, ablation studies, or validation procedure for the final pipeline on the challenge data, preventing assessment of robustness or whether success generalizes beyond this single instance.

Authors: We acknowledge the value of these analyses. The revised results section now includes (i) error analysis on the held-out challenge data, (ii) ablation studies that systematically remove regularization, grid calibration, and the parameter-efficient CNN architecture to quantify their individual contributions, and (iii) a detailed description of the internal validation procedure used during pipeline development. These additions directly address concerns about robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; result anchored in external competition ranking

full rationale

The paper's central claim rests on achieving first place in the FAIR Universe Weak Lensing Uncertainty Challenge, an external benchmark independent of the paper's internal definitions or fits. The workflow description, including autonomous and semi-autonomous exploration by Cmbagent, does not involve any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the result to its own inputs. The final pipeline elements are presented as outcomes of the agent process but are evaluated against the competition metric, making the success falsifiable externally rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that specialized agents can reliably generate, execute, and refine code for scientific inference without introducing systematic errors that would invalidate the competition result.

axioms (1)

domain assumption Multi-agent collaboration on code generation and evaluation produces pipelines competitive with expert human solutions when supplemented by human intervention.
Invoked to explain the transition from autonomous to semi-autonomous performance.

pith-pipeline@v0.9.0 · 5526 in / 1189 out tokens · 41780 ms · 2026-05-15T09:07:59.698558+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

doi: 10.48550/arXiv.2601.14235. Tingjia Miao, Jiawen Dai, Jingkun Liu, Jinxin Tan, Muhua Zhang, Wenkai Jin, Yuwen Du, Tian Jin, Xianghe Pang, Zexi Liu, Tu Guo, Zhengliang Zhang, Yunjie Huang, Shuo Chen, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Kun Chen, Wei Wang, Weinan E, and Siheng Chen. Physmaster: Building an autonomous ai physicist for theoretical and com...

work page doi:10.48550/arxiv.2601.14235 2025
[2]

Train an ensemble of CNN models, each on a different train/validation split

work page
[3]

For each ensemble memberm, compute validation prediction–truth pairs {(ˆθ(m) i , θi)}i∈I (m) val

work page
[4]

For each cosmology grid pointθg, group the validation predictions with ground truth θi =θ g and estimate the mean and covariance,µg,Σ g

work page
[5]

Define the empirical Gaussian likelihoodp( ˆθ|θ g)≈ N( ˆθ;µ g,Σ g),using a Hartlap- corrected inverse covariance

work page
[6]

ApplyD 4 test-time augmentation and average the resulting predictions before likelihood evaluation

work page
[7]

Smooth(µ g,Σ g)across nearby grid points and regularize the covariance estimates via shrinkage to obtain calibrated moments

work page
[8]

Determine a global temperatureτfrom the validation residuals and rescale the covariance matrices accordingly

work page
[9]

Compute unsupervised NLL-based weights for the ensemble members and form the weighted prediction ˆθens =P m w(ens) m ˆθ(m)

work page
[10]

target degrees of freedom

Evaluate the calibrated likelihood over all grid points,˜wg ∝p( ˆθens |θ g),normalize to obtainw g, and compute ˆθpost =P g wgθg together with the marginal posterior uncer- tainties. In this appendix, we give the full details of our inference. Our agentic workflow suggested this as one of many alternative approaches to an MCMC pipeline, and we chose this ...

work page 2006

[1] [1]

doi: 10.48550/arXiv.2601.14235. Tingjia Miao, Jiawen Dai, Jingkun Liu, Jinxin Tan, Muhua Zhang, Wenkai Jin, Yuwen Du, Tian Jin, Xianghe Pang, Zexi Liu, Tu Guo, Zhengliang Zhang, Yunjie Huang, Shuo Chen, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Kun Chen, Wei Wang, Weinan E, and Siheng Chen. Physmaster: Building an autonomous ai physicist for theoretical and com...

work page doi:10.48550/arxiv.2601.14235 2025

[2] [2]

Train an ensemble of CNN models, each on a different train/validation split

work page

[3] [3]

For each ensemble memberm, compute validation prediction–truth pairs {(ˆθ(m) i , θi)}i∈I (m) val

work page

[4] [4]

For each cosmology grid pointθg, group the validation predictions with ground truth θi =θ g and estimate the mean and covariance,µg,Σ g

work page

[5] [5]

Define the empirical Gaussian likelihoodp( ˆθ|θ g)≈ N( ˆθ;µ g,Σ g),using a Hartlap- corrected inverse covariance

work page

[6] [6]

ApplyD 4 test-time augmentation and average the resulting predictions before likelihood evaluation

work page

[7] [7]

Smooth(µ g,Σ g)across nearby grid points and regularize the covariance estimates via shrinkage to obtain calibrated moments

work page

[8] [8]

Determine a global temperatureτfrom the validation residuals and rescale the covariance matrices accordingly

work page

[9] [9]

Compute unsupervised NLL-based weights for the ensemble members and form the weighted prediction ˆθens =P m w(ens) m ˆθ(m)

work page

[10] [10]

target degrees of freedom

Evaluate the calibrated likelihood over all grid points,˜wg ∝p( ˆθens |θ g),normalize to obtainw g, and compute ˆθpost =P g wgθg together with the marginal posterior uncer- tainties. In this appendix, we give the full details of our inference. Our agentic workflow suggested this as one of many alternative approaches to an MCMC pipeline, and we chose this ...

work page 2006