pith. sign in

arxiv: 2606.12834 · v1 · pith:D6SNHKCSnew · submitted 2026-06-11 · 💻 cs.AI

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

Pith reviewed 2026-06-27 07:10 UTC · model grok-4.3

classification 💻 cs.AI
keywords scientific agentsLLM agentsagent constructionRietveld refinementGSAS-IIrubric evaluationworkflow automation
0
0 comments X

The pith

AgentBuild compiles scientific LLM agents from scientist-authored contracts of rubrics, curricula, and knowledge bases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current ways of creating LLM agents for science, such as fine-tuning or prompt engineering, hide the scientist's judgment inside opaque processes. Instead it treats agent construction itself as a workflow stage and introduces AgentBuild, which takes a single contract the scientist writes and uses it to produce a working agent. The contract contains a version-controlled rubric, a graded curriculum of tasks, and external knowledge sources. A rubric-driven judge then controls a meta-optimizer that edits the agent code only inside boundaries set by the contract. The approach is shown on Rietveld refinement of X-ray diffraction data, where the same rubric scores both the quality of fits and the scope of the agent's actions.

Core claim

AgentBuild builds a scientific agent directly from a contract the scientist authors; the contract consists of a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base, and a rubric-driven judge gates a meta-optimizer coding agent that performs edits only inside declared boundaries, demonstrated by a blank-harness run on GSAS-II that advances through an LLZO signal-to-noise ladder to the four-hour scan case.

What carries the argument

The contract, a version-controlled rubric plus difficulty-graded curriculum plus curated knowledge base, which a rubric-driven judge uses to gate edits by a meta-optimizer coding agent.

If this is right

  • The identical rubric can score both the credibility of the Rietveld fit and the scope of the agent's trajectory, so frontier cases become contract failures rather than pattern-fitting failures.
  • Re-running AgentBuild with an updated base model becomes a re-tune step rather than a full rebuild.
  • The scientist's authored contract remains the durable, version-controlled asset across changes in underlying models.
  • Workflow-scope limits surface explicitly as rubric scores instead of remaining hidden inside model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contract format could be written for other data-analysis tasks that currently rely on ad-hoc LLM prompting.
  • Version control of the contract would allow teams to track changes in scientific standards separately from changes in model weights.
  • If the rubric is made public, different groups could compare agents built from identical contracts on different base models.

Load-bearing premise

A written rubric can capture enough of the scientist's judgment to let the judge reliably decide when an edited agent is acceptable.

What would settle it

Re-run AgentBuild on the same contract but with a newer base model and check whether the new agent reaches the four-hour LLZO scan case with the same or higher rubric score without any change to the contract text.

Figures

Figures reproduced from arXiv: 2606.12834 by Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva, Woong Shin.

Figure 2
Figure 2. Figure 2: The two-knowledge-base seam. The producer external knowledge [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The AgentBuild stage contract. The six declared inputs (curriculum, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The Rietveld build topology for one iteration. The Claude Sonnet [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Run-derived SNR construction trajectory. Iteration 7 is the last complete four-case P-strict milestone and triggers escalation to the 4 hour LLZO scan. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Numeric Rietveld pattern evidence for the construction milestone and the 4 hour frontier. The top row is the LLZO 30 minute construction milestone, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes treating scientific agent construction as an explicit workflow stage via AgentBuild, which generates an agent from a scientist-authored contract (version-controlled rubric + difficulty-graded curriculum + curated external knowledge base). A rubric-driven external judge gates a meta-optimizer coding agent that edits the target agent only inside declared boundaries, so that the build process itself does not embed ongoing human judgment. The method is instantiated for Rietveld refinement of LLZO X-ray data inside GSAS-II (via MCP and A2A), with the claim that a blank-harness run traverses a signal-to-noise ladder, reaches the 4-hour scan as a contract frontier rather than a fitting failure, and remains re-runnable when base models change.

Significance. If the workflow can be shown to function as described, the separation of contract authoring from automated build, the re-use of the same rubric for both fit quality and trajectory scope, and the explicit versioning of the contract as the durable asset would constitute a concrete advance over prompt-and-go or fine-tuning practices. The re-runnability claim when base models evolve is particularly valuable for long-term scientific tooling.

major comments (1)
  1. [Abstract] Abstract: the central empirical claim that a construction run 'reaches the 4 hour scan as a frontier case' and 'exposes the workflow-scope limits' is stated without any quantitative metrics, success rates, error bars, baseline comparisons, or trajectory statistics. This absence makes it impossible to evaluate whether the rubric-driven judge actually gates the meta-optimizer as intended or whether the frontier identification is reproducible.
minor comments (2)
  1. [Abstract] The acronyms MCP and A2A are used without expansion or citation on first appearance.
  2. The manuscript would benefit from an explicit diagram or pseudocode showing the exact gating loop between judge, meta-optimizer, and agent under construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and the recommendation for major revision. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that a construction run 'reaches the 4 hour scan as a frontier case' and 'exposes the workflow-scope limits' is stated without any quantitative metrics, success rates, error bars, baseline comparisons, or trajectory statistics. This absence makes it impossible to evaluate whether the rubric-driven judge actually gates the meta-optimizer as intended or whether the frontier identification is reproducible.

    Authors: We agree that the abstract presents the outcome of the blank-harness construction run in primarily qualitative language. The full manuscript describes the signal-to-noise ladder progression and the 4-hour scan as the point at which the rubric-driven judge halts further improvement, but the abstract does not include the supporting quantitative details (e.g., per-level success rates, judge score thresholds, number of meta-optimizer iterations, or trajectory length statistics). In the revised version we will expand the abstract to report these metrics explicitly, drawing from the experimental results already present in the body of the paper. This change will make the reproducibility of the frontier identification clearer without altering the underlying claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the workflow proposal

full rationale

The paper introduces AgentBuild as a methodological workflow for constructing LLM-based scientific agents from an externally authored contract (rubric, curriculum, and knowledge base), with a rubric-driven judge gating a meta-optimizer coding agent. No mathematical derivations, first-principles predictions, or equations are presented that reduce to inputs by construction. The Rietveld/GSAS-II instantiation is described as an empirical demonstration of the workflow progressing through a signal-to-noise ladder and identifying frontier cases as contract failures, without any fitted parameters renamed as predictions or self-definitional steps. The central claim of separating scientist judgment via the contract and judge is independent and does not rely on self-citation chains or ansatzes smuggled from prior work. The approach is self-contained against external benchmarks as a proposed construction process rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or new entities; the approach relies on existing LLM technologies and the proposed contract mechanism.

pith-pipeline@v0.9.1-grok · 5754 in / 1108 out tokens · 41742 ms · 2026-06-27T07:10:58.623337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 10 canonical work pages · 9 internal anchors

  1. [1]

    A ter- minology for scientific workflow systems,

    F. Suter, T. Coleman, ˙I. Altinta¸ s, R. M. Badia, B. Balis, K. Chard, I. Colonnelli, E. Deelman, P. Di Tommaso, T. Fahringeret al., “A ter- minology for scientific workflow systems,”Future Generation Computer Systems, vol. 174, p. 107974, 2026

  2. [2]

    Do large language models speak scientific workflows?

    O. Yildiz and T. Peterka, “Do large language models speak scientific workflows?” 2024. [Online]. Available: https://arxiv.org/abs/2412.10606

  3. [3]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” 2024, introduces Group Relative Policy Optimization (GRPO). [Online]. Available: https://arxiv.org/abs/2402.03300

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, and A. Liu, “DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,”Nature, vol. 645, pp. 633–638, 2025, author list truncated; see arXiv for full DeepSeek-AI roste...

  5. [5]

    GSAS-II: the genesis of a modern open-source all purpose crystallography software package,

    B. H. Toby and R. B. V on Dreele, “GSAS-II: the genesis of a modern open-source all purpose crystallography software package,”Journal of Applied Crystallography, vol. 46, no. 2, pp. 544–549, 2013

  6. [6]

    Model context protocol specification,

    Anthropic, “Model context protocol specification,” https: //modelcontextprotocol.io/specification, 2024, open protocol for LLM-tool integration; specification version 2025-11-25 (revision dated 2025-11-25 in the canonical schema repository at https://github.com/modelcontextprotocol/specification). Originally announced by Anthropic November 2024

  7. [7]

    Agent2agent (A2A) pro- tocol specification,

    A2A Project, Google, and Linux Foundation, “Agent2agent (A2A) pro- tocol specification,” https://a2a-protocol.org/latest/specification/, 2025, open protocol for inter-agent communication; specification version 1.0.0 (announced by Google at Cloud Next on 2025-04-09; contributed to the Linux Foundation in June 2025; v0.3 released 2025-07-31). Canonical sche...

  8. [8]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” inAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), Datasets and Benchmarks Track, 2023. [Online]. Available: https://arxiv.org/abs/2306.05685

  9. [9]

    Applying the FAIR principles to computational workflows,

    S. R. Wilkinson, M. Aloqalaa, K. Belhajjame, M. R. Crusoe, B. de Paula Kinoshita, L. Gadelha, D. Garijo, O. J. R. Gustafsson, N. Juty, S. Kanwal, F. Z. Khan, J. Köster, K. Peters-von Gehlen, L. Pouchard, R. K. Rannow, S. Soiland-Reyes, N. Soranzo, S. Sufi, Z. Sun, B. Vilne, M. A. Wouters, D. Yuen, and C. Goble, “Applying the FAIR principles to computation...

  10. [10]

    PROV-AGENT: Unified provenance for tracking AI agent interactions in agentic workflows,

    R. Souza, A. Gueroudji, S. DeWitt, D. Rosendo, T. Ghosal, R. Ross, P. Balaprakash, and R. Ferreira da Silva, “PROV-AGENT: Unified provenance for tracking AI agent interactions in agentic workflows,” inProceedings of the 21st IEEE International Conference on e-Science (eScience 2025), Chicago, IL, USA, 2025

  11. [11]

    The FAIR guiding principles for scientific data management and stewardship,

    M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourneet al., “The FAIR guiding principles for scientific data management and stewardship,”Scientific Data, vol. 3, p. 160018, 2016

  12. [12]

    Introducing the FAIR principles for research software,

    M. Barker, N. P. Chue Hong, D. S. Katz, A.-L. Lamprecht, C. Martinez- Ortiz, F. Psomopoulos, J. Harrow, L. J. Castro, M. Gruenpeter, P. A. Martinez, and T. Honeyman, “Introducing the FAIR principles for research software,”Scientific Data, vol. 9, no. 1, p. 622, 2022

  13. [13]

    FAIR computational workflows,

    C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y . Gil, M. R. Crusoe, K. Peters, and D. Schober, “FAIR computational workflows,” Data Intelligence, vol. 2, no. 1-2, pp. 108–121, 2020

  14. [14]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR 2023),

  15. [15]

    ReAct: Synergizing Reasoning and Acting in Language Models

    [Online]. Available: https://arxiv.org/abs/2210.03629

  16. [16]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. [Online]. Available: https://arxiv.org/abs/2303. 11366

  17. [17]

    Pegasus, a workflow management system for science automation,

    E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. Ferreira da Silva, M. Livny, and K. Wenger, “Pegasus, a workflow management system for science automation,” Future Generation Computer Systems, vol. 46, pp. 17–35, 2015

  18. [18]

    Snakemake—a scalable bioinformatics workflow engine,

    J. Köster and S. Rahmann, “Snakemake—a scalable bioinformatics workflow engine,”Bioinformatics, vol. 28, no. 19, pp. 2520–2522, 2012

  19. [19]

    Nextflow enables reproducible computational work- flows,

    P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational work- flows,”Nature Biotechnology, vol. 35, no. 4, pp. 316–319, 2017

  20. [20]

    Methods included: Standardizing computational reuse and portability with the Common Workflow Language,

    M. R. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijani ´c, H. Ménager, S. Soiland-Reyes, B. Gavrilovi ´c, and C. Goble, “Methods included: Standardizing computational reuse and portability with the Common Workflow Language,”Communications of the ACM, vol. 65, no. 6, pp. 54–63, 2022

  21. [21]

    Parsl: Pervasive parallel programming in Python,

    Y . Babuji, A. Woodard, Z. Li, D. S. Katz, B. Clifford, R. Kumar, L. Lacinski, R. Chard, J. M. Wozniak, I. Foster, M. Wilde, and K. Chard, “Parsl: Pervasive parallel programming in Python,” inProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’19). Phoenix, AZ, USA: Association for Computing Machin...

  22. [22]

    funcX: A federated function serving fabric for science,

    R. Chard, Y . Babuji, Z. Li, T. Skluzacek, A. Woodard, B. Blaiszik, I. Foster, and K. Chard, “funcX: A federated function serving fabric for science,” inProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’20). Stockholm, Sweden: Association for Computing Machinery, 2020, pp. 65–76

  23. [23]

    The Galaxy platform for accessible, reproducible and collaborative biomedical anal- yses: 2018 update,

    E. Afgan, D. Baker, B. Batut, M. van den Beek, D. Bouvier, M. ˇCech, J. Chilton, D. Clements, N. Coraor, B. A. Grüning, A. Guerler, J. Hillman-Jackson, S. Hiltemann, V . Jalili, H. Rasche, N. Soranzo, J. Goecks, J. Taylor, A. Nekrutenko, and D. Blankenberg, “The Galaxy platform for accessible, reproducible and collaborative biomedical anal- yses: 2018 upd...

  24. [24]

    Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent com- putational research in the life sciences,

    J. Goecks, A. Nekrutenko, and J. Taylor, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent com- putational research in the life sciences,”Genome Biology, vol. 11, no. 8, p. R86, 2010

  25. [25]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts, “DSPy: Compiling declarative language model calls into self-improving pipelines,” inInternational Conference on Learning Representations (ICLR 2024), 2024. [Online]. Available: https://arxiv.org/abs...

  26. [26]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, C. Potts, K. Sen, A. G. Dimakis, I. Stoica, D. Klein, M. Zaharia, and O. Khattab, “GEPA: Reflective prompt evolution can outperform reinforcement learning,” inInternational Conference on Learning Representations (ICLR 2026, Oral), 2025. [Onl...

  27. [27]

    Mathematical discoveries from program search with large language models,

    B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. R. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, P. Kohli, and A. Fawzi, “Mathematical discoveries from program search with large language models,”Nature, vol. 625, pp. 468–475, 2024

  28. [28]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” inInternational Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://arxiv.org/abs/2310.12931

  29. [29]

    A profile refinement method for nuclear and magnetic structures,

    H. M. Rietveld, “A profile refinement method for nuclear and magnetic structures,”Journal of Applied Crystallography, vol. 2, no. 2, pp. 65–71, 1969

  30. [30]

    Anthropic, “Claude,” https://www.anthropic.com, 2026, accessed: 2026

  31. [31]

    Autonomous electrochemistry platform with real-time normality testing of voltammetry measurements using ML,

    A. Al-Najjar, N. S. V . Rao, C. A. Bridges, S. Dai, and A. Walters, “Autonomous electrochemistry platform with real-time normality testing of voltammetry measurements using ML,” inProceedings of the 2024 IEEE 20th International Conference on e-Science (e-Science). IEEE, 2024, pp. 1–10, oRNL ACL system paper; Pyro RPC framework and ML normality testing

  32. [32]

    An autonomous chemistry lab for accelerated materials discov- ery and innovation,

    Oak Ridge National Laboratory, “An autonomous chemistry lab for accelerated materials discov- ery and innovation,” https://www.ornl.gov/project/ autonomous-chemistry-lab-accelerated-materials-discovery-and-innovation, 2022, accessed: 2026-06-08

  33. [33]

    Cross-facility orchestration of electrochemistry experiments and computations,

    A. Al-Najjar, N. S. V . Rao, C. A. Bridges, and S. Dai, “Cross-facility orchestration of electrochemistry experiments and computations,” in Workshops of the International Conference on High Performance Com- puting, Network, Storage, and Analysis (SC-W 2023). New York, NY , USA: ACM, Nov. 2023, pp. 2118–2125

  34. [34]

    The INTERSECT open federated architecture for the laboratory of the future,

    C. Engelmann, O. Kuchar, S. Boehm, M. J. Brim, T. Naughton, S. Somnath, S. Atchley, J. Lange, B. Mintz, and E. Arenholz, “The INTERSECT open federated architecture for the laboratory of the future,” inCommunications in Computer and Information Science (CCIS): Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for E...

  35. [35]

    LLM agents for interactive workflow provenance: Reference architecture and evaluation methodology,

    R. Souza, T. Poteet, B. Etz, D. Rosendo, A. Gueroudji, W. Shin, P. Bal- aprakash, and R. Ferreira da Silva, “LLM agents for interactive workflow provenance: Reference architecture and evaluation methodology,” in Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (WORKS25), S...

  36. [36]

    ChemOS 2.0: An orchestration architecture for chemical self-driving laboratories,

    M. Sim, M. G. Vakili, F. Strieth-Kalthoff, H. Hao, R. J. Hickman, S. Miret, S. Pablo-García, and A. Aspuru-Guzik, “ChemOS 2.0: An orchestration architecture for chemical self-driving laboratories,”Matter, vol. 7, no. 9, pp. 2959–2977, 2024

  37. [37]

    MADSci: A modular python-based framework to enable autonomous science,

    R. D. Lewis, T. S. Ginsburg, D. Ozgulbas, C. Stone, A. Stroka, A. Cleary, I. T. Foster, and N. Paulson, “MADSci: A modular python-based framework to enable autonomous science,”Journal of Open Source Software, vol. 11, no. 119, p. 9416, 2026, argonne MADSci framework: https://github.com/AD-SDL/MADSci

  38. [38]

    Employing artificial intelligence to steer exascale workflows with Colmena,

    L. Ward, J. G. Pauloski, V . Hayot-Sasson, Y . Babuji, A. Brace, R. Chard, K. Chard, R. Thakur, and I. Foster, “Employing artificial intelligence to steer exascale workflows with Colmena,”The International Journal of High Performance Computing Applications, vol. 39, no. 1, pp. 52–64, 2025

  39. [39]

    Towards a modular architecture for science factories,

    R. Vescovi, T. Ginsburg, K. Hippe, D. Ozgulbas, C. Stone, A. Stroka, R. Butler, B. Blaiszik, T. Brettin, K. Chard, M. Hereld, A. Ramanathan, R. Stevens, A. Vriza, J. Xu, Q. Zhang, and I. Foster, “Towards a modular architecture for science factories,”Digital Discovery, vol. 2, no. 6, pp. 1980–1998, 2023

  40. [40]

    Augmenting large language models with chemistry tools,

    A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller, “Augmenting large language models with chemistry tools,” Nature Machine Intelligence, vol. 6, no. 5, pp. 525–535, 2024, chem- Crow LLM-with-tools chemistry agent

  41. [41]

    Autonomous chemical research with large language models,

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes, “Autonomous chemical research with large language models,”Nature, vol. 624, no. 7992, pp. 570–578, 2023, coscientist GPT-4 driven autonomous chem- istry agent

  42. [42]

    An autonomous laboratory for the accelerated synthesis of inorganic materials,

    N. J. Szymanski, B. Rendy, Y . Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y . Zeng, and G. Ceder, “An autonomous laboratory for the accelerated synthesis of inorganic materials,”Nature, vol. 624, no. 7990, pp. 86–91, 2023, a-Lab autonomous materials synthesis platform

  43. [43]

    Towards an AI co-scientist

    J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, K. Saab, D. Popovici, J. Blum, F. Zhang, K. Chou, A. Hassidim, B. Gokturk, A. Vahdat, P. Kohli, Y . Matias, A. Carroll, K. Kulkarni, N. Tomasev, Y . Guan, V . Dhillon, E. D. Vaishnav, B. Lee, T. R. D. Costa, J. R. Penadés, G. Peltz, Y . X...

  44. [44]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, “The AI scientist: Towards fully automated open-ended scientific discovery,” 2024, sakana AI technical report. [Online]. Available: https://arxiv.org/abs/2408.06292

  45. [45]

    Automated crystal structure analysis based on blackbox opti- misation,

    Y . Ozaki, Y . Suzuki, T. Hawai, K. Saito, M. Onishi, and K. Ono, “Automated crystal structure analysis based on blackbox opti- misation,”npj Computational Materials, vol. 6, no. 1, p. 75, 2020, bBO-Rietveld implementation and seed data: https://github.com/ quantumbeam/BBO-Rietveld

  46. [46]

    Optuna: A next- generation hyperparameter optimization framework,

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next- generation hyperparameter optimization framework,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2623–2631

  47. [47]

    Adaptively driven X-ray diffraction guided by machine learning for autonomous phase identification,

    N. J. Szymanski, C. J. Bartel, Y . Zeng, M. Diallo, H. Kim, and G. Ceder, “Adaptively driven X-ray diffraction guided by machine learning for autonomous phase identification,”npj Computational Materials, vol. 9, no. 1, p. 31, 2023, adaptive-XRD release with Aeris UAI driver: https: //github.com/njszym/Adaptive-XRD

  48. [48]

    Crystallography companion agent for high-throughput materials discovery,

    P. M. Maffettone, L. Banko, P. Cui, Y . Lysogorskiy, M. A. Little, D. Olds, A. Ludwig, and A. I. Cooper, “Crystallography companion agent for high-throughput materials discovery,”Nature Computational Science, vol. 1, no. 4, pp. 290–297, 2021

  49. [49]

    Dara: Automated multiple-hypothesis phase identification and refinement from powder X-ray diffraction,

    Y . Fei, M. J. McDermott, C. L. Rom, S. Wang, and G. Ceder, “Dara: Automated multiple-hypothesis phase identification and refinement from powder X-ray diffraction,”Chemistry of Materials, vol. 38, no. 3, pp. 1364–1376, 2026

  50. [50]

    ChatGPT,

    OpenAI, “ChatGPT,” https://www.openai.com, 2026, accessed: 2026