URSA: The Universal Research and Scientific Agent

arxiv: 2506.22653 · v2 · submitted 2025-06-27 · 💻 cs.AI

URSA: The Universal Research and Scientific Agent

Michael Grosskopf , Nathan Debardeleben , Russell Bent , Rahul Somasundaram , Isaac Michaud , Arthur Lui , Alexius Wadell , Warren D. Graham

show 5 more authors

Golo A Wimmer Sachin Shivakumar Joan Vendrell Gallart Harsha Nagarajan Earl Lawrence

This is my paper

Pith reviewed 2026-05-19 07:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords scientific agentslarge language modelsAI for sciencephysics simulationsmodular agentsagentic AIresearch acceleration

0 comments p. Extension

The pith

URSA combines modular LLM agents with physics simulation tools to address scientific problems of varying complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces URSA as an ecosystem of AI agents built on large language models to accelerate research. LLMs now handle reasoning, planning, coding, and other tasks that overlap with daily scientific work. The system uses modular agents and tools, including direct links to advanced physics simulation codes, that users can combine as needed. If the approach works, it could remove common bottlenecks and speed progress across research areas.

Core claim

URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes, that can be combined to address scientific problems of varied complexity and impact. This work highlights the architecture of URSA, as well as examples that highlight the potential of the system.

What carries the argument

Modular agents and tools in an agentic AI setup, including direct coupling to advanced physics simulation codes, that users assemble to tackle research tasks.

If this is right

Researchers gain the ability to assemble custom agent combinations for problems of different scales.
Direct integration with physics codes extends agent capabilities beyond text generation into quantitative modeling.
Scientific bottlenecks tied to routine reasoning and coding tasks can be reduced.
The same modular structure supports both narrow and broad-impact research questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agent ecosystems could be built for domains outside physics by swapping in other simulation or data tools.
Success would raise the question of how to measure and credit AI contributions in published research.
Routine coupling of agents to live experimental data streams could become a natural next step.

Load-bearing premise

Large language models already carry out complex reasoning, planning, writing, coding, and research tasks that overlap significantly with the skills human scientists use day-to-day.

What would settle it

A test case in which URSA fails to produce a correct or useful result on a problem that requires both LLM reasoning and coupled simulation output.

Figures

Figures reproduced from arXiv: 2506.22653 by Alexius Wadell, Arthur Lui, Earl Lawrence, Golo A Wimmer, Harsha Nagarajan, Isaac Michaud, Joan Vendrell Gallart, Michael Grosskopf, Nathan Debardeleben, Rahul Somasundaram, Russell Bent, Sachin Shivakumar, Warren D. Graham.

**Figure 2.** Figure 2: Graphical workflow for the Execution Agent. The URSA execution agent carries out code and toolusing tasks to perform steps necessary to solve a given problem. The agent is passed a general problem prompt or a particular step as part of a larger plan. These actions are carried out through calling python functions as tools, such that a python wrapper must be used for adding additional tools. Allowing the a… view at source ↗

**Figure 3.** Figure 3: Hypothesizer Agent The goal of the URSA hypothesizer agent is to utilize web search and a vigorous debate to hypothesize a solution to a user prompt. The difference between the hypothesizer agent and the planning/research agents are an internal iteration for solving the problem and the structure of output. The hypothesizer consists of three internal subagents: the hypothesis generator, the critic, and the… view at source ↗

**Figure 4.** Figure 4: ArXiv Agent The e-print repository ArXiv provides an open access store of research prints [3, 4]. The goal of the URSA ArXiv agent is to utilize the ArXiv search API to find papers relevant to a given problem and then use an LLM to process the text and images in the paper to summarize the cutting-edge research related to the motivating problem. Similar to the other URSA agents, the input to this agent is a… view at source ↗

**Figure 5.** Figure 5: Convergence plot of the optimization of the six-hump camel function as generated by the URSA written and evaluated Bayesian optimization script. Optimize the six-hump camel function. Start by evaluating that function at 10 locations. Then utilize Bayesian optimization to build a surrogate model and sequentially select points until the function is optimized. Carry out the optimization and report the resul… view at source ↗

**Figure 6.** Figure 6: Prediction of log neutron yield in an ICF target from Helios simulation using a Gaussian [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The sequence of evaluations of 1D Helios by the URSA Execution Agent driven by o1. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of URSA to Bayesian optimization for designing a direct-drive ICF design. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Design optimization summary with plausible fake data, presented as real results by the [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

read the original abstract

Large language models (LLMs) have moved far beyond their initial form as simple chatbots, now carrying out complex reasoning, planning, writing, coding, and research tasks. These skills overlap significantly with those that human scientists use day-to-day to solve complex problems that drive the cutting edge of research. Using LLMs in \quotes{agentic} AI has the potential to revolutionize modern science and remove bottlenecks to progress. In this work, we present URSA, a scientific agent ecosystem for accelerating research tasks. URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes, that can be combined to address scientific problems of varied complexity and impact. This work highlights the architecture of URSA, as well as examples that highlight the potential of the system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces URSA, a scientific agent ecosystem for accelerating research tasks. URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes, that can be combined to address scientific problems of varied complexity and impact. The work highlights the architecture of URSA as well as examples that highlight the potential of the system.

Significance. If the claims hold and are supported by evidence, URSA could represent a meaningful contribution to agentic AI applications in science by offering a modular framework that integrates LLMs with domain-specific simulation tools. This approach aligns with ongoing efforts to automate aspects of scientific workflows. However, the current manuscript provides no empirical validation, case studies, or performance metrics, so its significance cannot be determined from the available text.

major comments (1)

Abstract: The central claim that the modular agents and tools (including physics simulation couplings) can be combined to address scientific problems of varied complexity and impact lacks any supporting data, validation results, error analysis, implementation details, or even the promised examples. The manuscript supplies only a high-level architectural description, leaving the claim as an unevaluated assertion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and constructive feedback on our manuscript introducing URSA. We address the major comment below.

read point-by-point responses

Referee: Abstract: The central claim that the modular agents and tools (including physics simulation couplings) can be combined to address scientific problems of varied complexity and impact lacks any supporting data, validation results, error analysis, implementation details, or even the promised examples. The manuscript supplies only a high-level architectural description, leaving the claim as an unevaluated assertion.

Authors: The abstract is intentionally concise and high-level, as is standard. The full manuscript expands on the architecture with implementation details and includes concrete examples demonstrating how modular agents and tools (including physics simulation couplings) are combined for scientific problems of varying complexity. These examples illustrate the system's potential without claiming exhaustive benchmarks. We agree that the abstract could better preview the examples and will revise it accordingly. Comprehensive empirical validation, error analysis, and performance metrics are beyond the scope of this initial framework paper but are planned for follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely descriptive architecture with no derivations

full rationale

The available text consists solely of an abstract describing the URSA agent ecosystem as a set of modular agents and tools for scientific problems. No equations, derivations, predictions, fitted parameters, or load-bearing claims derived from prior results appear. The central statements are high-level architectural descriptions and mentions of examples, with no chain that reduces by construction to the paper's own inputs or self-citations. This is a self-contained system overview rather than a derived result, so no circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that current LLMs already possess scientist-like reasoning skills; the paper introduces the URSA system itself as the primary new entity without additional free parameters or external evidence in the abstract.

axioms (1)

domain assumption Large language models have moved far beyond their initial form as simple chatbots, now carrying out complex reasoning, planning, writing, coding, and research tasks.
This premise is stated directly in the opening of the abstract and underpins the decision to build an agentic scientific system.

invented entities (1)

URSA no independent evidence
purpose: A scientific agent ecosystem consisting of modular agents and tools for accelerating research tasks.
The system is introduced in this work; the abstract provides no mention of prior independent validation or external benchmarks.

pith-pipeline@v0.9.0 · 5678 in / 1385 out tokens · 50925 ms · 2026-05-19T07:10:59.116183+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes, that can be combined to address scientific problems of varied complexity and impact.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that URSA outperforms standard methods (Bayesian optimization) for a design optimization task utilizing radiation hydrodynamics simulation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 4 internal anchors

[1]

LangGraph, 2025

LangChain community. LangGraph, 2025

work page 2025
[2]

Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning

Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning. arXiv preprint arXiv:2409.05556, 2024

work page arXiv 2024
[3]

First steps towards electronic research communication

Paul Ginsparg. First steps towards electronic research communication. Computers in physics, 8(4):390–396, 1994

work page 1994
[4]

Arxiv at 20

Paul Ginsparg. Arxiv at 20. Nature, 476(7359):145–147, 2011

work page 2011
[5]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an AI co-scientist. arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Surrogates: Gaussian process modeling, design, and optimization for the applied sciences

Robert B Gramacy. Surrogates: Gaussian process modeling, design, and optimization for the applied sciences. Chapman and Hall/CRC, 2020

work page 2020
[7]

Agentic ai for scientific discovery: A survey of progress, challenges, and future directions

Mourad Gridach, Jay Nanavati, Khaldoun Zine El Abidine, Lenon Mendes, and Christina Mack. Agentic ai for scientific discovery: A survey of progress, challenges, and future directions. In Proceedings of the International Conference on Learning Representations (ICLR) , 2025. arXiv:2503.08979

work page arXiv 2025
[8]

Large Lan- guage Models to Enhance Bayesian Optimization,

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. arXiv preprint arXiv:2402.03921, 2024

work page arXiv 2024
[9]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Helios-cr–a 1-d radiation- magnetohydrodynamics code with inline atomic kinetics modeling

JJ MacFarlane, IE Golovkin, and PR Woodruff. Helios-cr–a 1-d radiation- magnetohydrodynamics code with inline atomic kinetics modeling. Journal of Quantitative Spectroscopy and Radiative Transfer, 99(1-3):381–397, 2006

work page 2006
[11]

Test functions for optimization needs.Test functions for optimization needs, 101(48):32, 2005

Marcin Molga and Czesław Smutnicki. Test functions for optimization needs.Test functions for optimization needs, 101(48):32, 2005

work page 2005
[12]

Design considerations for indirectly driven double shell capsules

DS Montgomery, William Scott Daughton, Brian James Albright, Andrei N Simakov, Dou- glas Carl Wilson, Evan S Dodd, RC Kirkpatrick, Robert Gregory Watt, Mark A Gunderson, Eric Nicholas Loomis, et al. Design considerations for indirectly driven double shell capsules. Physics of Plasmas, 25(9), 2018

work page 2018
[13]

Aviary: training language agents on challenging scientific tasks

Siddharth Narayanan, James D Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G Rodriques, et al. Aviary: training language agents on challenging scientific tasks. arXiv preprint arXiv:2412.21154, 2024

work page arXiv 2024
[14]

Deep research system card

OpenAI. Deep research system card. OpenAI System Cards, 2025

work page 2025
[15]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011
[16]

Towards scientific intelligence: A survey of llm-based scientific agents

Shuo Ren, Pu Jian, Zhenjiang Ren, Chunlin Leng, Can Xie, and Jiajun Zhang. Towards scientific intelligence: A survey of llm-based scientific agents. arXiv preprint arXiv:2503.24047, 2025

work page arXiv 2025
[17]

Beautiful soup documentation, 2007

Leonard Richardson. Beautiful soup documentation, 2007

work page 2007
[18]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. arXiv preprint arXiv:2501.04227, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Generative to agentic ai: Survey, conceptualization, and challenges

Johannes Schneider. Generative to agentic ai: Survey, conceptualization, and challenges. arXiv preprint arXiv:2504.18875, 2025

work page arXiv 2025
[20]

Coupling 1d xrage simulations with machine learning for graded inner shell design optimization in double shell capsules

Nomita Nirmal Vazirani, Michael John Grosskopf, David James Stark, Paul Andrew Bradley, Brian Michael Haines, E Loomis, Scott L England, and Wayne A Scales. Coupling 1d xrage simulations with machine learning for graded inner shell design optimization in double shell capsules. Physics of Plasmas, 28(12), 2021

work page 2021
[21]

Evaluating the performance and robustness of llms in materials science q&a and property predictions

Hongchen Wang, Kangming Li, Scott Ramsay, Yao Fehlis, Edward Kim, and Jason Hattrick- Simpers. Evaluating the performance and robustness of llms in materials science q&a and property predictions. arXiv preprint arXiv:2409.14572, 2024

work page arXiv 2024
[22]

Strategic chain-of-thought: Guiding accurate reasoning in llms through strategy elicitation

Yu Wang, Shiwan Zhao, Zhihu Wang, Heyuan Huang, Ming Fan, Yubo Zhang, Zhixing Wang, Haijun Wang, and Ting Liu. Strategic chain-of-thought: Guiding accurate reasoning in llms through strategy elicitation. arXiv preprint arXiv:2409.03271, 2024

work page arXiv 2024
[23]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

web_search

Rui Zhou, Vir Sikand, and Sudhit Rao. Ai agents for deep scientific research. UIUC Spring 2025 CS598 LLM Agent Workshop, Submitted. 11 A Code Blocks for the ArXiv, Hypothesizer, and Research Agents Code Block 3 ArXiv Agent 1 function arxiv_agent(String query, String context) 2 paper_pdfs = arxiv_api_call(query,max_papers) 3 summaries = [] 4 5 for pdf in p...

work page 2025
[25]

A descriptive name for the step

work page
[26]

A detailed description of what needs to be done

work page
[27]

Whether the step requires generating and executing code

work page
[28]

Expected outputs of the step

work page
[29]

[APPROVED]

How to evaluate whether the step was successful Consider a diverse range of appropriate steps such as: • Data gathering or generation • Data preprocessing and cleaning • Analysis and modeling • Hypothesis testing • Visualization • Evaluation and validation Only allocate the steps that are needed to solve the problem. 13 Reflection Prompt You are acting as...

work page
[30]

Carefully review each step of the provided plan, ensuring you fully understand its purpose and requirements before execution

work page
[31]

• Writing and executing computer code when solving computational tasks

Use the appropriate tools available to execute each step effectively, including: • Performing internet searches to gather additional necessary information. • Writing and executing computer code when solving computational tasks. Do not generate any placeholder or synthetic data! Only real data! • Executing safe and relevant system commands as required, aft...

work page
[32]

• Any code written, commands executed, or searches performed

Clearly document each action you take, including: • The tools or methods you used. • Any code written, commands executed, or searches performed. • Outcomes, results, or errors encountered during execution

work page
[33]

Your goal is to execute the provided plan accurately, safely, and transparently, maintaining accountability at each step

Immediately highlight and clearly communicate any steps that appear unclear, unsafe, or impractical before proceeding. Your goal is to execute the provided plan accurately, safely, and transparently, maintaining accountability at each step. Safety Prompt Assume commands to run python and Julia are safe because the files are from a trusted source. Answer o...

work page
[34]

Identify the level of strictness that is required for answering the user’s query

work page
[35]

Clearly list any unsupported assumptions or claims lacking proper citation

work page
[36]

Identify any missing information or critical details that should have been included

work page
[37]

[APPROVED]

Suggest specific actions or additional searches the researcher should undertake if the provided information is incomplete or insufficient. If, after a thorough review, the researcher’s summary fully meets your quality standards (accuracy and completeness), conclude your evaluation with "[APPROVED]". Your primary goal is to ensure rigor, accuracy, and reli...

work page
[38]

**Text-Based Insights**: Summarize the main contributions and findings from the written text

work page
[39]

do research, install and run rep- utable physics models, or build data-driven forward models from open online data

**Image-Based Insights**: Describe what the extracted image/plot interpretations add or illustrate. If the image data supports or contradicts the text, mention that. Here is the paper content: {paper} ArXiv Paper Summarizer Prompt (Skip Images) You are a scientific assistant helping summarize research papers. The paper below consists of the main written c...

work page
[40]

placeholder

Overall Conclusions & Recommendations • Step-5 established a solid alloy/weld process with minimal defects and favorable microstructure. • Step-6 confirmed excellent low -temperature properties in both parent and weld regions, with minor further optimization recommended (e.g., fine-tuning weld filler or heat treatment for improved fatigue resistance). • F...

work page
[41]

Sotani, K

Neutron star mass-radius constraints using the high-frequency QPOs of GRB 200415A by H. Sotani, K. D. Kokkotas, N. Stergioulas Link: https://arxiv.org/abs/2303.03150v2 Summary: Text–Based Insights • The four high–frequency QPOs detected in the 2020 giant flare GRB 200415A (836, 1444, 2132 and 4250 Hz; quoted 1–σ error of ≃ 10%) can be reproduced by the ℓ ...

work page arXiv 2020
[42]

experimental errors in K0 and L (dominant)

work page
[43]

identification of the observed peaks with a specific set of overtones

work page
[44]

neglect of magnetic corrections (valid only for B ≲1015 G, Appendix A)

work page
[45]

omission of relativistic metric perturbations (Cowling approximation)

work page
[46]

poorly known superfluid fraction in the cylindrical–pasta region

work page
[47]

double–parallelogram

≲ 10% statistical errors in the measured QPO frequencies. • Even with these uncertainties, the deduced radius band (roughly R = 12.5 ± 0.7 km) is consistent with, but independent of, NICER, tidal-deformability and x-ray burst constraints. Image–Based Insights • Fig. 1 (not shown here). Demonstrates that for a fixed mass and radius the n = 1 overtone varie...

work page
[48]

smooth” continuation and a “maximally–stiff

Neutron star radii, deformabilities, and moments of inertia from experimental and ab initio theory constraints on the 208Pb neutron skin thickness by Yeunhwan Lim, Jeremy W. Holt Link: https://arxiv.org/abs/2204.09000v2 Summary: Text-Based Insights • A global Bayesian analysis was performed that combines (i) chiral EFT predictions for homogeneous matter u...

work page arXiv
[49]

all–experiments

Constraints on the Nuclear Symmetry Energy from Experiments, Theory and Observations by James M. Lattimer Link: https://arxiv.org/abs/2308.08001v1 Summary: Text-Based Insights • A near–linear correlation exists between the slope of the symmetry energy L and the radius of a 1.4 M⊙ neutron star, R1.4, originating from the fact that the pressure of β–equilib...

work page arXiv
[50]

Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the paper

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and introduction highlight the motivation for the Agentic AI approach we are presenting and the body of the work then details those claims directly. Guidelines: • The answer NA means th...

work page
[51]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] 25 Justification: Yes, the discussion highlights limitations and failure modes of URSA and an appendix is dedicated to highlighting specific examples of negative outcomes. Guidelines: • The answer NA means that the paper has no limitation while...

work page
[52]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: This work is not motivated by or claiming any theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems,...

work page
[53]

When the agentic code is open sourced, it will include the exact examples used to generate the results in this work

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The paper includes directl...

work page
[54]

Guidelines: • The answer NA means that paper does not include experiments requiring code

Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We are working to open source the code and would expect to before the camera-ready date for a manuscript, ho...

work page
[55]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: While the code to generate the results will be open sourced, we do not have specific results that are being as...

work page
[56]

Guidelines: • The answer NA means that the paper does not include experiments

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: This paper does not have any experiments for which this is relevant or appro- priate. Guidelines: • The answer NA means that the pa...

work page
[57]

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] 28 Justification: The main measure of resource relevant here is API cost for using the OpenAI models. While the costs a...

work page
[58]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: The authors have reviewed the code of Ethics and confirm that the paper and research conform to the code of Ethics. Guidelines: • The answer NA means that the ...

work page
[59]

Guidelines: • The answer NA means that there is no societal impact of the work performed

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We conclude with a brief discussion on this, discussing the broader impacts and potential for flexible scientific agents like URSA. Guidelines: • The answer NA means that there is no so...

work page
[60]

In the appendix we give mention to sandboxing and the importance for computer and data safety

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [Yes] Justification: Building and defining safeguards is critical to release of agentic AI work and part of the pro...

work page
[61]

Guidelines: • The answer NA means that the paper does not use existing assets

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Models and packages used are thoroughly cited and credited. Guidelines: • The answer NA means th...

work page
[62]

However due to institutional restrictions the code is not openly available at this time

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The code assets are documented and documentation will be provided along with code when it is open sourced. However due to institutional restrictions the code is not openly available at this time. ...

work page
[63]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This work does not involve crowdsourcing or research ...

work page
[64]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page
[65]

Answer: [Yes] Justification: The paper describes in detail the ways in which LLMs are integrated into the agentic workflow

Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...

work page 2025

[1] [1]

LangGraph, 2025

LangChain community. LangGraph, 2025

work page 2025

[2] [2]

Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning

Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning. arXiv preprint arXiv:2409.05556, 2024

work page arXiv 2024

[3] [3]

First steps towards electronic research communication

Paul Ginsparg. First steps towards electronic research communication. Computers in physics, 8(4):390–396, 1994

work page 1994

[4] [4]

Arxiv at 20

Paul Ginsparg. Arxiv at 20. Nature, 476(7359):145–147, 2011

work page 2011

[5] [5]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an AI co-scientist. arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Surrogates: Gaussian process modeling, design, and optimization for the applied sciences

Robert B Gramacy. Surrogates: Gaussian process modeling, design, and optimization for the applied sciences. Chapman and Hall/CRC, 2020

work page 2020

[7] [7]

Agentic ai for scientific discovery: A survey of progress, challenges, and future directions

Mourad Gridach, Jay Nanavati, Khaldoun Zine El Abidine, Lenon Mendes, and Christina Mack. Agentic ai for scientific discovery: A survey of progress, challenges, and future directions. In Proceedings of the International Conference on Learning Representations (ICLR) , 2025. arXiv:2503.08979

work page arXiv 2025

[8] [8]

Large Lan- guage Models to Enhance Bayesian Optimization,

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. arXiv preprint arXiv:2402.03921, 2024

work page arXiv 2024

[9] [9]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Helios-cr–a 1-d radiation- magnetohydrodynamics code with inline atomic kinetics modeling

JJ MacFarlane, IE Golovkin, and PR Woodruff. Helios-cr–a 1-d radiation- magnetohydrodynamics code with inline atomic kinetics modeling. Journal of Quantitative Spectroscopy and Radiative Transfer, 99(1-3):381–397, 2006

work page 2006

[11] [11]

Test functions for optimization needs.Test functions for optimization needs, 101(48):32, 2005

Marcin Molga and Czesław Smutnicki. Test functions for optimization needs.Test functions for optimization needs, 101(48):32, 2005

work page 2005

[12] [12]

Design considerations for indirectly driven double shell capsules

DS Montgomery, William Scott Daughton, Brian James Albright, Andrei N Simakov, Dou- glas Carl Wilson, Evan S Dodd, RC Kirkpatrick, Robert Gregory Watt, Mark A Gunderson, Eric Nicholas Loomis, et al. Design considerations for indirectly driven double shell capsules. Physics of Plasmas, 25(9), 2018

work page 2018

[13] [13]

Aviary: training language agents on challenging scientific tasks

Siddharth Narayanan, James D Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G Rodriques, et al. Aviary: training language agents on challenging scientific tasks. arXiv preprint arXiv:2412.21154, 2024

work page arXiv 2024

[14] [14]

Deep research system card

OpenAI. Deep research system card. OpenAI System Cards, 2025

work page 2025

[15] [15]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011

[16] [16]

Towards scientific intelligence: A survey of llm-based scientific agents

Shuo Ren, Pu Jian, Zhenjiang Ren, Chunlin Leng, Can Xie, and Jiajun Zhang. Towards scientific intelligence: A survey of llm-based scientific agents. arXiv preprint arXiv:2503.24047, 2025

work page arXiv 2025

[17] [17]

Beautiful soup documentation, 2007

Leonard Richardson. Beautiful soup documentation, 2007

work page 2007

[18] [18]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. arXiv preprint arXiv:2501.04227, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Generative to agentic ai: Survey, conceptualization, and challenges

Johannes Schneider. Generative to agentic ai: Survey, conceptualization, and challenges. arXiv preprint arXiv:2504.18875, 2025

work page arXiv 2025

[20] [20]

Coupling 1d xrage simulations with machine learning for graded inner shell design optimization in double shell capsules

Nomita Nirmal Vazirani, Michael John Grosskopf, David James Stark, Paul Andrew Bradley, Brian Michael Haines, E Loomis, Scott L England, and Wayne A Scales. Coupling 1d xrage simulations with machine learning for graded inner shell design optimization in double shell capsules. Physics of Plasmas, 28(12), 2021

work page 2021

[21] [21]

Evaluating the performance and robustness of llms in materials science q&a and property predictions

Hongchen Wang, Kangming Li, Scott Ramsay, Yao Fehlis, Edward Kim, and Jason Hattrick- Simpers. Evaluating the performance and robustness of llms in materials science q&a and property predictions. arXiv preprint arXiv:2409.14572, 2024

work page arXiv 2024

[22] [22]

Strategic chain-of-thought: Guiding accurate reasoning in llms through strategy elicitation

Yu Wang, Shiwan Zhao, Zhihu Wang, Heyuan Huang, Ming Fan, Yubo Zhang, Zhixing Wang, Haijun Wang, and Ting Liu. Strategic chain-of-thought: Guiding accurate reasoning in llms through strategy elicitation. arXiv preprint arXiv:2409.03271, 2024

work page arXiv 2024

[23] [23]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

web_search

Rui Zhou, Vir Sikand, and Sudhit Rao. Ai agents for deep scientific research. UIUC Spring 2025 CS598 LLM Agent Workshop, Submitted. 11 A Code Blocks for the ArXiv, Hypothesizer, and Research Agents Code Block 3 ArXiv Agent 1 function arxiv_agent(String query, String context) 2 paper_pdfs = arxiv_api_call(query,max_papers) 3 summaries = [] 4 5 for pdf in p...

work page 2025

[25] [25]

A descriptive name for the step

work page

[26] [26]

A detailed description of what needs to be done

work page

[27] [27]

Whether the step requires generating and executing code

work page

[28] [28]

Expected outputs of the step

work page

[29] [29]

[APPROVED]

How to evaluate whether the step was successful Consider a diverse range of appropriate steps such as: • Data gathering or generation • Data preprocessing and cleaning • Analysis and modeling • Hypothesis testing • Visualization • Evaluation and validation Only allocate the steps that are needed to solve the problem. 13 Reflection Prompt You are acting as...

work page

[30] [30]

Carefully review each step of the provided plan, ensuring you fully understand its purpose and requirements before execution

work page

[31] [31]

• Writing and executing computer code when solving computational tasks

Use the appropriate tools available to execute each step effectively, including: • Performing internet searches to gather additional necessary information. • Writing and executing computer code when solving computational tasks. Do not generate any placeholder or synthetic data! Only real data! • Executing safe and relevant system commands as required, aft...

work page

[32] [32]

• Any code written, commands executed, or searches performed

Clearly document each action you take, including: • The tools or methods you used. • Any code written, commands executed, or searches performed. • Outcomes, results, or errors encountered during execution

work page

[33] [33]

Your goal is to execute the provided plan accurately, safely, and transparently, maintaining accountability at each step

Immediately highlight and clearly communicate any steps that appear unclear, unsafe, or impractical before proceeding. Your goal is to execute the provided plan accurately, safely, and transparently, maintaining accountability at each step. Safety Prompt Assume commands to run python and Julia are safe because the files are from a trusted source. Answer o...

work page

[34] [34]

Identify the level of strictness that is required for answering the user’s query

work page

[35] [35]

Clearly list any unsupported assumptions or claims lacking proper citation

work page

[36] [36]

Identify any missing information or critical details that should have been included

work page

[37] [37]

[APPROVED]

Suggest specific actions or additional searches the researcher should undertake if the provided information is incomplete or insufficient. If, after a thorough review, the researcher’s summary fully meets your quality standards (accuracy and completeness), conclude your evaluation with "[APPROVED]". Your primary goal is to ensure rigor, accuracy, and reli...

work page

[38] [38]

**Text-Based Insights**: Summarize the main contributions and findings from the written text

work page

[39] [39]

do research, install and run rep- utable physics models, or build data-driven forward models from open online data

**Image-Based Insights**: Describe what the extracted image/plot interpretations add or illustrate. If the image data supports or contradicts the text, mention that. Here is the paper content: {paper} ArXiv Paper Summarizer Prompt (Skip Images) You are a scientific assistant helping summarize research papers. The paper below consists of the main written c...

work page

[40] [40]

placeholder

Overall Conclusions & Recommendations • Step-5 established a solid alloy/weld process with minimal defects and favorable microstructure. • Step-6 confirmed excellent low -temperature properties in both parent and weld regions, with minor further optimization recommended (e.g., fine-tuning weld filler or heat treatment for improved fatigue resistance). • F...

work page

[41] [41]

Sotani, K

Neutron star mass-radius constraints using the high-frequency QPOs of GRB 200415A by H. Sotani, K. D. Kokkotas, N. Stergioulas Link: https://arxiv.org/abs/2303.03150v2 Summary: Text–Based Insights • The four high–frequency QPOs detected in the 2020 giant flare GRB 200415A (836, 1444, 2132 and 4250 Hz; quoted 1–σ error of ≃ 10%) can be reproduced by the ℓ ...

work page arXiv 2020

[42] [42]

experimental errors in K0 and L (dominant)

work page

[43] [43]

identification of the observed peaks with a specific set of overtones

work page

[44] [44]

neglect of magnetic corrections (valid only for B ≲1015 G, Appendix A)

work page

[45] [45]

omission of relativistic metric perturbations (Cowling approximation)

work page

[46] [46]

poorly known superfluid fraction in the cylindrical–pasta region

work page

[47] [47]

double–parallelogram

≲ 10% statistical errors in the measured QPO frequencies. • Even with these uncertainties, the deduced radius band (roughly R = 12.5 ± 0.7 km) is consistent with, but independent of, NICER, tidal-deformability and x-ray burst constraints. Image–Based Insights • Fig. 1 (not shown here). Demonstrates that for a fixed mass and radius the n = 1 overtone varie...

work page

[48] [48]

smooth” continuation and a “maximally–stiff

Neutron star radii, deformabilities, and moments of inertia from experimental and ab initio theory constraints on the 208Pb neutron skin thickness by Yeunhwan Lim, Jeremy W. Holt Link: https://arxiv.org/abs/2204.09000v2 Summary: Text-Based Insights • A global Bayesian analysis was performed that combines (i) chiral EFT predictions for homogeneous matter u...

work page arXiv

[49] [49]

all–experiments

Constraints on the Nuclear Symmetry Energy from Experiments, Theory and Observations by James M. Lattimer Link: https://arxiv.org/abs/2308.08001v1 Summary: Text-Based Insights • A near–linear correlation exists between the slope of the symmetry energy L and the radius of a 1.4 M⊙ neutron star, R1.4, originating from the fact that the pressure of β–equilib...

work page arXiv

[50] [50]

Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the paper

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and introduction highlight the motivation for the Agentic AI approach we are presenting and the body of the work then details those claims directly. Guidelines: • The answer NA means th...

work page

[51] [51]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] 25 Justification: Yes, the discussion highlights limitations and failure modes of URSA and an appendix is dedicated to highlighting specific examples of negative outcomes. Guidelines: • The answer NA means that the paper has no limitation while...

work page

[52] [52]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: This work is not motivated by or claiming any theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems,...

work page

[53] [53]

When the agentic code is open sourced, it will include the exact examples used to generate the results in this work

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The paper includes directl...

work page

[54] [54]

Guidelines: • The answer NA means that paper does not include experiments requiring code

Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We are working to open source the code and would expect to before the camera-ready date for a manuscript, ho...

work page

[55] [55]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: While the code to generate the results will be open sourced, we do not have specific results that are being as...

work page

[56] [56]

Guidelines: • The answer NA means that the paper does not include experiments

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: This paper does not have any experiments for which this is relevant or appro- priate. Guidelines: • The answer NA means that the pa...

work page

[57] [57]

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] 28 Justification: The main measure of resource relevant here is API cost for using the OpenAI models. While the costs a...

work page

[58] [58]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: The authors have reviewed the code of Ethics and confirm that the paper and research conform to the code of Ethics. Guidelines: • The answer NA means that the ...

work page

[59] [59]

Guidelines: • The answer NA means that there is no societal impact of the work performed

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We conclude with a brief discussion on this, discussing the broader impacts and potential for flexible scientific agents like URSA. Guidelines: • The answer NA means that there is no so...

work page

[60] [60]

In the appendix we give mention to sandboxing and the importance for computer and data safety

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [Yes] Justification: Building and defining safeguards is critical to release of agentic AI work and part of the pro...

work page

[61] [61]

Guidelines: • The answer NA means that the paper does not use existing assets

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Models and packages used are thoroughly cited and credited. Guidelines: • The answer NA means th...

work page

[62] [62]

However due to institutional restrictions the code is not openly available at this time

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The code assets are documented and documentation will be provided along with code when it is open sourced. However due to institutional restrictions the code is not openly available at this time. ...

work page

[63] [63]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This work does not involve crowdsourcing or research ...

work page

[64] [64]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[65] [65]

Answer: [Yes] Justification: The paper describes in detail the ways in which LLMs are integrated into the agentic workflow

Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...

work page 2025