Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

Fiona Y. Wong; Markus J. Buehler

arxiv: 2605.22300 · v1 · pith:JY7ZLVROnew · submitted 2026-05-21 · 💻 cs.AI · cs.LG· cs.MA

Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

Fiona Y. Wong , Markus J. Buehler This is my paper

Pith reviewed 2026-05-22 05:16 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords coordinated AI agentsscientific inferencepartial evidencecross-domain benchmarkmulti-agent systemsevaluation protocolsparadigm shiftsexoplanet vetting

0 comments

The pith

Coordinated AI agents add value to scientific inference from partial evidence only when supported by explicit comparators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to determine under what conditions coordinated AI agents provide benefits over simpler single-agent or baseline approaches when scientists must work with incomplete evidence drawn from multiple instruments, databases, and fields. It tests this through a benchmark covering four quite different problems: converting molecular structures to musical forms, tracking shifts in scientific paradigms over time, forecasting the spread of diseases carried by vectors, and validating candidates for exoplanets detected by transit methods. Each task includes clear comparison points and fixed scoring rules so that any claimed improvement can be checked directly. The findings indicate that coordination is useful mainly when it combines complementary data channels to raise accuracy, or when it supplies better traceability, but it offers little extra when one data source already captures most of the signal or when the task is mainly about creating new representations.

Core claim

The central discovery is that a cross-domain benchmark with four tasks and explicit baselines identifies three distinct regimes for the value of coordinated AI agents in scientific inference from partial evidence. Cross-channel composites raise performance when each discipline supplies only part of the picture, reaching AUROC values of 0.944 for climate-vector disease emergence and 0.955 for exoplanet vetting. When one signal is dominant, coordination improves the quality of interpretation and provenance rather than raw accuracy, while in representational tasks like molecular sonification the benefit is in the new form of expression. Overall the benchmark credits coordination only when a性能,

What carries the argument

Cross-domain benchmark consisting of four tasks, each equipped with a frozen evaluation panel, predefined scoring protocols, explicit baselines, ablations, and null controls.

Load-bearing premise

The four selected tasks and their frozen evaluation panels with predefined scoring protocols are sufficiently representative of broader scientific inference from partial evidence across disciplines.

What would settle it

Observing that in a new domain with partial evidence, a simple non-coordinated workflow consistently matches or exceeds the coordinated agent's performance on the primary metric without any supporting comparator.

Figures

Figures reproduced from arXiv: 2605.22300 by Fiona Y. Wong, Markus J. Buehler.

**Figure 1.** Figure 1: Summary evidence map. The benchmark design and the regime map summarize where coordination changes the supported inference, where it mainly adds provenance and auditability, and where its contribution is representational rather than predictive. 2.2 A regime map identifying where coordination matters The cross-application comparison yields three operating regimes ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Climate-Vector Emergence: Dakar sentinel site 24-year climate record (2000–2023). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Climate-Vector Emergence: ENSO–container ecology mechanistic pathway. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Cosmic Filter: per-channel lead time to positive [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Cosmic Filter: per-candidate signal presence ma [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Computational Kuhn: citation growth curves across 16 paradigm shifts. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Computational Kuhn: composite early-warning signal with Kuhn phase annotations. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Sound of Molecules: era-match heatmap. Pairwise similarity scores between 16 drug compounds spanning six pharmacological classes and 6 composers spanning the Baroque through Modern eras. Scores are derived from RDKit descriptor vectors projected onto harmonic-feature embeddings; color scale runs from low (dark) to high (pale blue). Block structure is evaluated against shuffled-label and randommapping c… view at source ↗

**Figure 9.** Figure 9: Sound of Molecules: physicochemical space colored by era assignment. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Artifact-mediated benchmark infrastructure. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

read the original abstract

Scientific evidence often spans instruments, databases, and disciplines, so no single source records the full phenomenon. This makes it difficult to determine when coordinated AI agents add value over simpler scientific workflows. We evaluate this question with a cross-domain benchmark spanning four scientific tasks: mapping molecular structure into musical representations, detecting historical paradigm shifts in science, identifying vector-borne disease emergence, and vetting transiting-exoplanet candidates. Each case uses a frozen evaluation panel, predefined scoring protocols, explicit baselines, ablations or null controls, and stated limitations. The results define three operating regimes. When different disciplines each capture only part of the phenomenon, cross-channel composites improve over single-channel baselines: climate-vector emergence reaches AUROC 0.944 and exoplanet vetting reaches AUROC 0.955. However, the exoplanet workflow is effectively tied with a strong combined-summary baseline, showing that decomposition does not always improve top-line performance. When one signal dominates, as in paradigm-shift detection, coordination mainly improves interpretation and traceability. For molecular sonification, the gain is representational rather than predictive. ScienceClaw x Infinite provides the auditable artifact and provenance layer for this evaluation. The benchmark therefore assigns value to coordination only when the corresponding performance, provenance, or representation claim is supported by explicit comparators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Benchmark usefully identifies when coordination helps but tasks may limit generalizability.

read the letter

This paper's main contribution is a benchmark that shows coordinated AI agents add value over baselines in some partial-evidence scientific tasks but not others, based on four cross-domain examples. It does well in setting up transparent comparisons. The four tasks—molecular sonification, paradigm-shift detection, vector-borne disease emergence, and exoplanet vetting—each come with frozen evaluation panels, predefined scoring, baselines, ablations, and controls. The reported AUROCs of 0.944 and 0.955 in the fusion cases, along with notes on when it ties baselines or helps representation instead, give concrete data points. The provenance layer helps with auditability. The soft spot is representativeness. The abstract gives no selection criteria or coverage argument for these tasks, so the three operating regimes might not apply to other scientific inference problems. If the tasks favor coordination unusually, the takeaway that value is assigned only with explicit comparators could be narrower than claimed. This paper is for people working on multi-agent systems for science who need empirical maps rather than general assertions. Readers focused on practical deployment will get the most from the comparators and regime distinctions. The evaluation design looks solid and avoids circularity. I would recommend it for peer review to get input on task selection and to verify the full methods and results.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a cross-domain benchmark spanning four scientific tasks—molecular sonification, paradigm-shift detection, vector-borne disease emergence, and exoplanet vetting—to determine when coordinated AI agents improve inference from partial evidence. Each task employs frozen evaluation panels, predefined scoring, explicit baselines/ablations/null controls, and stated limitations. Results delineate three regimes: performance gains from cross-channel composites when signals are partial (AUROC 0.944 for vector-borne disease, 0.955 for exoplanet vetting, though the latter ties a strong summary baseline); improved interpretation/traceability when one signal dominates (paradigm shifts); and representational rather than predictive gains (molecular sonification). ScienceClaw x Infinite supplies the auditable provenance layer. The benchmark assigns value to coordination only when backed by explicit comparators.

Significance. If the regimes and comparator-backed assignment of value hold, the work supplies a practical, falsifiable framework for assessing multi-agent coordination in scientific workflows that span instruments and disciplines. Strengths include the use of external baselines rather than internal parameter fitting, explicit controls, and an auditable artifact for reproducibility. This could guide deployment of coordinated agents by distinguishing performance, interpretability, and representation benefits.

major comments (1)

[Abstract and task selection / methods sections] The central claim that the benchmark 'reveals when' coordination adds value and generalizes across 'scientific inference from partial evidence' rests on the representativeness of the four tasks. No selection criteria, coverage argument, diversity justification, or discussion of why these tasks (molecular sonification, paradigm-shift detection, vector-borne disease emergence, exoplanet vetting) are typical rather than atypical in amenability to cross-channel fusion or frozen panels appears in the abstract or task-description sections. Without this, the three regimes risk being task-specific rather than broadly informative.

minor comments (2)

[Abstract] Abstract reports AUROC values but does not summarize the key limitations or exclusion rules mentioned in the full evaluation; adding a concise limitations clause would improve standalone readability.
[Results] Notation for the three regimes is introduced narratively; a small summary table or explicit enumeration in the results section would aid cross-reference with the baseline comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We agree that the manuscript would benefit from greater transparency on task selection and will revise accordingly to strengthen the generalizability claims.

read point-by-point responses

Referee: [Abstract and task selection / methods sections] The central claim that the benchmark 'reveals when' coordination adds value and generalizes across 'scientific inference from partial evidence' rests on the representativeness of the four tasks. No selection criteria, coverage argument, diversity justification, or discussion of why these tasks (molecular sonification, paradigm-shift detection, vector-borne disease emergence, exoplanet vetting) are typical rather than atypical in amenability to cross-channel fusion or frozen panels appears in the abstract or task-description sections. Without this, the three regimes risk being task-specific rather than broadly informative.

Authors: We acknowledge that the current manuscript does not include explicit selection criteria or a diversity justification for the four tasks in the abstract or task-description sections. In the revised version we will add a dedicated subsection (likely in Methods or as an expanded paragraph in the Introduction) that articulates the rationale: the tasks were deliberately chosen to cover four distinct modalities of partial evidence (acoustic representations of molecular structure, historical textual records, multi-source epidemiological surveillance data, and photometric time-series signals) across four different scientific domains. This selection enables direct comparison of the three identified regimes under controlled conditions of signal incompleteness and source dominance while employing frozen panels and explicit baselines in each case. We do not claim the tasks constitute a statistically representative sample of all scientific inference problems; rather, they function as illustrative, cross-domain test cases that allow falsifiable evaluation of when coordination improves performance, interpretability, or representation. We will also expand the Limitations section to discuss the scope of generalization and the need for future benchmarks with additional tasks. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with external comparators shows no derivation circularity

full rationale

The paper presents results from a cross-domain benchmark on four tasks, each using frozen evaluation panels, predefined scoring protocols, explicit baselines, ablations, and null controls. The three operating regimes and the claim that coordination adds value only when backed by comparators are direct outputs of these empirical comparisons rather than any mathematical derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the paper's own inputs or prior self-citations in a load-bearing way. The evaluation is therefore self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the four tasks and the fairness of the stated baselines and scoring protocols; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The four chosen scientific tasks adequately sample the space of inference problems involving partial evidence across instruments and disciplines.
This premise allows the authors to generalize from the observed regimes to broader claims about when coordination adds value.

pith-pipeline@v0.9.0 · 5766 in / 1287 out tokens · 47057 ms · 2026-05-22T05:16:51.298536+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The results define three operating regimes... The benchmark therefore assigns value to coordination only when the corresponding performance, provenance, or representation claim is supported by explicit comparators.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-domain benchmark spanning four scientific tasks: mapping molecular structure into musical representations, detecting historical paradigm shifts...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

URL https: //doi.org/10.1038/s41586-026-10265-5

Lu, C.et al.Towards end-to-end automation of ai re- search.Nature651, 914–919 (2026). URL https: //doi.org/10.1038/s41586-026-10265-5

work page doi:10.1038/s41586-026-10265-5 2026
[2]

Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning,

Ghafarollahi, A. & Buehler, M. J. Sciagents: Automating scientific discovery through bioinspired multi-agent intel- ligent graph reasoning.Advanced Materials37, 2413523 (2025). URL https://advanced.onlinelibrary. wiley.com/doi/abs/10.1002/adma.202413523. https://advanced.onlinelibrary.wiley.com/ doi/pdf/10.1002/adma.202413523

work page doi:10.1002/adma.202413523 2025
[3]

URL https://doi.org/ 10.1038/s41586-026-10644-y

Gottweis, J.et al.Accelerating scientific discovery with co-scientist.Nature(2026). URL https://doi.org/ 10.1038/s41586-026-10644-y

work page doi:10.1038/s41586-026-10644-y 2026
[4]

URL https://doi.org/10.1038/ s41586-026-10658-6

Aygün, E.et al.An ai system to help sci- entists write expert-level empirical software.Na- ture(2026). URL https://doi.org/10.1038/ s41586-026-10658-6

work page 2026
[5]

E.et al.A multi-agent system for automating scientific discovery.Nature(2026)

Ghareeb, A. E.et al.A multi-agent system for automating scientific discovery.Nature(2026). URL https://doi. org/10.1038/s41586-026-10652-y

work page doi:10.1038/s41586-026-10652-y 2026
[6]

Buehler, M. J. Accelerating scientific discovery with generative knowledge extraction, graph-based repre- sentation, and multimodal intelligent graph reason- ing.Machine Learning: Science and Technology5 (2024). URL https://api.semanticscholar.org/ CorpusID:268531443

work page 2024
[7]

& Buehler, M

Stewart, I. & Buehler, M. J. Molecular analysis and design using generative artificial intelligence via multi- agent modeling.Molecular Systems Design & Engineer- ing10, 314–337 (2025). URL http://dx.doi.org/ 10.1039/D4ME00174E

work page doi:10.1039/d4me00174e 2025
[8]

Y ., Lee, D

Wang, F. Y ., Lee, D. S., Kaplan, D. L. & Buehler, M. J. Swarms of large language model agents for protein sequence design with experimental validation (2025). URL https://arxiv.org/abs/2511.22311. 2511.22311

work page arXiv 2025
[9]

& Buehler, M

Ghafarollahi, A. & Buehler, M. J. Sparks: Multi-agent artificial intelligence model discovers protein design prin- ciples (2025). URL https://arxiv.org/abs/2504. 19017.2504.19017

work page arXiv 2025
[10]

& Buehler, M

Ghafarollahi, A. & Buehler, M. J. Protagents: protein discovery via large language model multi-agent collabo- rations combining physics and machine learning.Digital Discovery3, 1389–1409 (2024)

work page 2024
[11]

GraphAgents: Knowledge Graph-Guided Agentic AI for Cross-Domain Materials Design

Stewart, I. A., Hage, T. P., Hsu, Y .-C. & Buehler, M. J. Graphagents: Knowledge graph-guided agentic ai for cross-domain materials design (2026). URL https:// arxiv.org/abs/2602.07491.2602.07491

work page arXiv 2026
[12]

URL https: //doi.org/10.1038/s41586-023-06221-2

Wang, H.et al.Scientific discovery in the age of artificial intelligence.Nature620, 47–60 (2023). URL https: //doi.org/10.1038/s41586-023-06221-2

work page doi:10.1038/s41586-023-06221-2 2023
[13]

D., von Luxburg, U

Berens, P., Cranmer, K., Lawrence, N. D., von Luxburg, U. & Montgomery, J. Ai for science: An emerging agenda.arXiv preprint(2023). URL https://arxiv. org/abs/2303.04217.2303.04217

work page arXiv 2023
[14]

Lafferty, K. D. The ecology of climate change and infectious diseases.Ecology90, 888–900 (2009). URL https://esajournals.onlinelibrary. wiley.com/doi/abs/10.1890/08-0079.1. https://esajournals.onlinelibrary.wiley. com/doi/pdf/10.1890/08-0079.1

work page doi:10.1890/08-0079.1 2009
[16]

Medlock, J. M. & Leach, S. A. Effect of climate change on vector-borne disease risk in the uk.The Lancet In- fectious Diseases15, 721–730 (2015). URL https:// doi.org/10.1016/S1473-3099(15)70091-5. Doi: 10.1016/S1473-3099(15)70091-5

work page doi:10.1016/s1473-3099(15)70091-5 2015
[17]

C., Rocklöv, J

Semenza, J. C., Rocklöv, J. & Ebi, K. L. Climate change and cascading risks from infectious disease.Infect Dis Ther11, 1371–1390 (2022). 2193-6382 Semenza, Jan C Rocklöv, Joacim Ebi, Kristie L Journal Article Re- view New Zealand 2022/05/19 Infect Dis Ther. 2022 Aug;11(4):1371-1390. doi: 10.1007/s40121-022-00647-

work page doi:10.1007/s40121-022-00647- 2022
[18]

Kraemer, M. U. G.et al.Past and future spread of the arbovirus vectors aedes aegypti and aedes albopic- tus.Nature Microbiology4, 854–863 (2019). URL https://doi.org/10.1038/s41564-019-0376-y

work page doi:10.1038/s41564-019-0376-y 2019
[19]

E.et al.Planetary candidates observed by kepler

Thompson, S. E.et al.Planetary candidates observed by kepler. viii. a fully automated catalog with mea- sured completeness and reliability based on data re- lease 25.The Astrophysical Journal Supplement Series 235, 38 (2018). URL https://doi.org/10.3847/ 1538-4365/aab4f9

work page 2018
[20]

L.et al.Planetary candidates observed by kepler

Coughlin, J. L.et al.Planetary candidates observed by kepler. vii. the first fully uniform catalog based on the entire 48-month data set (q1–q17 dr24).The Astrophys- ical Journal Supplement Series224, 12 (2016). URL https://doi.org/10.3847/0067-0049/224/1/12

work page doi:10.3847/0067-0049/224/1/12 2016
[21]

D.et al.False positive probabilities for all kepler objects of interest: 1284 newly validated planets and 428 likely false positives.The Astrophysical Jour- nal822, 86 (2016)

Morton, T. D.et al.False positive probabilities for all kepler objects of interest: 1284 newly validated planets and 428 likely false positives.The Astrophysical Jour- nal822, 86 (2016). URL https://doi.org/10.3847/ 0004-637X/822/2/86

work page 2016
[22]

Buehler, M. J. Multiscale modeling at the inter- face of molecular mechanics and natural language through attention neural networks.Accounts of Chem- ical Research55, 3387–3403 (2022). URL https: //doi.org/10.1021/acs.accounts.2c00330. Doi: 10.1021/acs.accounts.2c00330

work page doi:10.1021/acs.accounts.2c00330 2022
[23]

https://www

RDKit: Open-source cheminformatics. https://www. rdkit.org. Accessed: 2026-05-20. 13

work page 2026
[24]

& Ariza, C

Cuthbert, M. & Ariza, C. Music21: A toolkit for computer-aided musicology and symbolic music data. InInternational Society for Music Informa- tion Retrieval Conference(2010). URL https://api. semanticscholar.org/CorpusID:6411706

work page 2010
[25]

S.The Structure of Scientific Revolutions(Uni- versity of Chicago Press, 2012), 4 edn

Kuhn, T. S.The Structure of Scientific Revolutions(Uni- versity of Chicago Press, 2012), 4 edn

work page 2012
[26]

URL https://www.nejm.org/doi/full/10.1056/ NEJM200106143442401

Nash, D.et al.The outbreak of west nile virus in- fection in the new york city area in 1999.New England Journal of Medicine344, 1807–1814 (2001). URL https://www.nejm.org/doi/full/10.1056/ NEJM200106143442401. https://www.nejm.org/ doi/pdf/10.1056/NEJM200106143442401

work page doi:10.1056/nejm200106143442401 1999
[27]

Chari and L

Ryan, S. J., Carlson, C. J., Mordecai, E. A. & Johnson, L. R. Global expansion and redistribution of aedes-borne virus transmission risk with climate change.PLOS Neglected Tropical Diseases13, e0007213 (2019). URL https://doi.org/10.1371/journal. pntd.0007213

work page doi:10.1371/journal 2019
[28]

V .et al.An earth-sized planet in the habit- able zone of a cool star.Science344, 277–280 (2014)

Quintana, E. V .et al.An earth-sized planet in the habit- able zone of a cool star.Science344, 277–280 (2014)

work page 2014
[29]

Luu, R. K. & Buehler, M. J. Bioinspiredllm: Conversational large language model for the mechanics of biological and bio-inspired mate- rials.Advanced Science11, 2306724 (2024). URL https://advanced.onlinelibrary. wiley.com/doi/abs/10.1002/advs.202306724. https://advanced.onlinelibrary.wiley.com/ doi/pdf/10.1002/advs.202306724

work page doi:10.1002/advs.202306724 2024
[30]

Lu, W., Luu, R. K. & Buehler, M. J. Fine-tuning large language models for domain adaptation: explo- ration of training strategies, scaling, model merging and synergistic capabilities.npj Computational Materi- als11, 84 (2025). URL https://doi.org/10.1038/ s41524-025-01564-y

work page 2025
[31]

Buehler, M. J. Preflexor: preference-based recur- sive language modeling for exploratory optimization of reasoning and agentic thinking.npj Artificial Intelli- gence1, 4 (2025). URL https://doi.org/10.1038/ s44387-025-00003-z

work page 2025
[32]

K., Knowles, T

Yang, Z., Yorke, S. K., Knowles, T. P. J. & Buehler, M. J. Learning the rules of peptide self-assembly through data mining with large language models.Science Advances11, eadv1971 (2025). URL https://doi.org/10.1126/ sciadv.adv1971. Doi: 10.1126/sciadv.adv1971

work page doi:10.1126/sciadv.adv1971 2025
[33]

Overcoming catastrophic forgetting in neural networks

Ghafarollahi, A. & Buehler, M. J. Automat- ing alloy design and discovery with physics-aware multimodal multiagent ai.Proceedings of the National Academy of Sciences122, e2414074122 (2025). URL https://doi.org/10.1073/pnas. 2414074122. Doi: 10.1073/pnas.2414074122

work page doi:10.1073/pnas 2025
[34]

Y .et al.Autonomous agents coordinating dis- tributed discovery through emergent artifact exchange

Wang, F. Y .et al.Autonomous agents coordinating dis- tributed discovery through emergent artifact exchange. arXiv preprint arXiv:2603.14312(2026). URL https: //arxiv.org/abs/2603.14312

work page arXiv 2026
[35]

A.et al.Detecting the impact of tempera- ture on transmission of zika, dengue, and chikungunya using mechanistic models.PLOS Neglected Tropical Dis- eases11, e0005568 (2017)

Mordecai, E. A.et al.Detecting the impact of tempera- ture on transmission of zika, dengue, and chikungunya using mechanistic models.PLOS Neglected Tropical Dis- eases11, e0005568 (2017). URL https://doi.org/ 10.1371/journal.pntd.0005568

work page doi:10.1371/journal.pntd.0005568 2017
[36]

Crossfield, I. J. M.et al.197 candidates and 104 validated planets in k2’s first five fields.The Astrophysical Journal Supplement Series226, 7 (2016). URL https://doi. org/10.3847/0067-0049/226/1/7

work page doi:10.3847/0067-0049/226/1/7 2016
[37]

W.et al.275 candidates and 149 validated planets orbiting bright stars in k2 campaigns 0–10.The Astronomical Journal155, 136 (2018)

Mayo, A. W.et al.275 candidates and 149 validated planets orbiting bright stars in k2 campaigns 0–10.The Astronomical Journal155, 136 (2018). URL https: //doi.org/10.3847/1538-3881/aaadff

work page doi:10.3847/1538-3881/aaadff 2018
[38]

Reisen, W. K. Epidemiology of st. louis encephalitis virus.Adv Virus Res61, 139–83 (2003). Reisen, William K Journal Article Research Support, Non-U.S. Gov’t Research Support, U.S. Gov’t, Non-P.H.S. Research Sup- port, U.S. Gov’t, P.H.S. Review United States 2004/01/13 Adv Virus Res. 2003;61:139-83. doi: 10.1016/s0065- 3527(03)61004-3

work page doi:10.1016/s0065- 2003
[39]

Climate change impacts on west nile virus trans- mission in a global context.Philos Trans R Soc Lond B Biol Sci370(2015)

Paz, S. Climate change impacts on west nile virus trans- mission in a global context.Philos Trans R Soc Lond B Biol Sci370(2015). 1471-2970 Paz, Shlomit Journal Article Review England 2015/02/18 Philos Trans R Soc Lond B Biol Sci. 2015 Apr 5;370(1665):20130561. doi: 10.1098/rstb.2013.0561

work page doi:10.1098/rstb.2013.0561 2015
[40]

M.et al.Planetary candidates observed by kepler

Batalha, N. M.et al.Planetary candidates observed by kepler. iii. analysis of the first 16 months of data.The Astrophysical Journal Supplement Series204, 24 (2013). 14

work page 2013

[1] [1]

URL https: //doi.org/10.1038/s41586-026-10265-5

Lu, C.et al.Towards end-to-end automation of ai re- search.Nature651, 914–919 (2026). URL https: //doi.org/10.1038/s41586-026-10265-5

work page doi:10.1038/s41586-026-10265-5 2026

[2] [2]

Sciagents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning,

Ghafarollahi, A. & Buehler, M. J. Sciagents: Automating scientific discovery through bioinspired multi-agent intel- ligent graph reasoning.Advanced Materials37, 2413523 (2025). URL https://advanced.onlinelibrary. wiley.com/doi/abs/10.1002/adma.202413523. https://advanced.onlinelibrary.wiley.com/ doi/pdf/10.1002/adma.202413523

work page doi:10.1002/adma.202413523 2025

[3] [3]

URL https://doi.org/ 10.1038/s41586-026-10644-y

Gottweis, J.et al.Accelerating scientific discovery with co-scientist.Nature(2026). URL https://doi.org/ 10.1038/s41586-026-10644-y

work page doi:10.1038/s41586-026-10644-y 2026

[4] [4]

URL https://doi.org/10.1038/ s41586-026-10658-6

Aygün, E.et al.An ai system to help sci- entists write expert-level empirical software.Na- ture(2026). URL https://doi.org/10.1038/ s41586-026-10658-6

work page 2026

[5] [5]

E.et al.A multi-agent system for automating scientific discovery.Nature(2026)

Ghareeb, A. E.et al.A multi-agent system for automating scientific discovery.Nature(2026). URL https://doi. org/10.1038/s41586-026-10652-y

work page doi:10.1038/s41586-026-10652-y 2026

[6] [6]

Buehler, M. J. Accelerating scientific discovery with generative knowledge extraction, graph-based repre- sentation, and multimodal intelligent graph reason- ing.Machine Learning: Science and Technology5 (2024). URL https://api.semanticscholar.org/ CorpusID:268531443

work page 2024

[7] [7]

& Buehler, M

Stewart, I. & Buehler, M. J. Molecular analysis and design using generative artificial intelligence via multi- agent modeling.Molecular Systems Design & Engineer- ing10, 314–337 (2025). URL http://dx.doi.org/ 10.1039/D4ME00174E

work page doi:10.1039/d4me00174e 2025

[8] [8]

Y ., Lee, D

Wang, F. Y ., Lee, D. S., Kaplan, D. L. & Buehler, M. J. Swarms of large language model agents for protein sequence design with experimental validation (2025). URL https://arxiv.org/abs/2511.22311. 2511.22311

work page arXiv 2025

[9] [9]

& Buehler, M

Ghafarollahi, A. & Buehler, M. J. Sparks: Multi-agent artificial intelligence model discovers protein design prin- ciples (2025). URL https://arxiv.org/abs/2504. 19017.2504.19017

work page arXiv 2025

[10] [10]

& Buehler, M

Ghafarollahi, A. & Buehler, M. J. Protagents: protein discovery via large language model multi-agent collabo- rations combining physics and machine learning.Digital Discovery3, 1389–1409 (2024)

work page 2024

[11] [11]

GraphAgents: Knowledge Graph-Guided Agentic AI for Cross-Domain Materials Design

Stewart, I. A., Hage, T. P., Hsu, Y .-C. & Buehler, M. J. Graphagents: Knowledge graph-guided agentic ai for cross-domain materials design (2026). URL https:// arxiv.org/abs/2602.07491.2602.07491

work page arXiv 2026

[12] [12]

URL https: //doi.org/10.1038/s41586-023-06221-2

Wang, H.et al.Scientific discovery in the age of artificial intelligence.Nature620, 47–60 (2023). URL https: //doi.org/10.1038/s41586-023-06221-2

work page doi:10.1038/s41586-023-06221-2 2023

[13] [13]

D., von Luxburg, U

Berens, P., Cranmer, K., Lawrence, N. D., von Luxburg, U. & Montgomery, J. Ai for science: An emerging agenda.arXiv preprint(2023). URL https://arxiv. org/abs/2303.04217.2303.04217

work page arXiv 2023

[14] [14]

Lafferty, K. D. The ecology of climate change and infectious diseases.Ecology90, 888–900 (2009). URL https://esajournals.onlinelibrary. wiley.com/doi/abs/10.1890/08-0079.1. https://esajournals.onlinelibrary.wiley. com/doi/pdf/10.1890/08-0079.1

work page doi:10.1890/08-0079.1 2009

[15] [16]

Medlock, J. M. & Leach, S. A. Effect of climate change on vector-borne disease risk in the uk.The Lancet In- fectious Diseases15, 721–730 (2015). URL https:// doi.org/10.1016/S1473-3099(15)70091-5. Doi: 10.1016/S1473-3099(15)70091-5

work page doi:10.1016/s1473-3099(15)70091-5 2015

[16] [17]

C., Rocklöv, J

Semenza, J. C., Rocklöv, J. & Ebi, K. L. Climate change and cascading risks from infectious disease.Infect Dis Ther11, 1371–1390 (2022). 2193-6382 Semenza, Jan C Rocklöv, Joacim Ebi, Kristie L Journal Article Re- view New Zealand 2022/05/19 Infect Dis Ther. 2022 Aug;11(4):1371-1390. doi: 10.1007/s40121-022-00647-

work page doi:10.1007/s40121-022-00647- 2022

[17] [18]

Kraemer, M. U. G.et al.Past and future spread of the arbovirus vectors aedes aegypti and aedes albopic- tus.Nature Microbiology4, 854–863 (2019). URL https://doi.org/10.1038/s41564-019-0376-y

work page doi:10.1038/s41564-019-0376-y 2019

[18] [19]

E.et al.Planetary candidates observed by kepler

Thompson, S. E.et al.Planetary candidates observed by kepler. viii. a fully automated catalog with mea- sured completeness and reliability based on data re- lease 25.The Astrophysical Journal Supplement Series 235, 38 (2018). URL https://doi.org/10.3847/ 1538-4365/aab4f9

work page 2018

[19] [20]

L.et al.Planetary candidates observed by kepler

Coughlin, J. L.et al.Planetary candidates observed by kepler. vii. the first fully uniform catalog based on the entire 48-month data set (q1–q17 dr24).The Astrophys- ical Journal Supplement Series224, 12 (2016). URL https://doi.org/10.3847/0067-0049/224/1/12

work page doi:10.3847/0067-0049/224/1/12 2016

[20] [21]

D.et al.False positive probabilities for all kepler objects of interest: 1284 newly validated planets and 428 likely false positives.The Astrophysical Jour- nal822, 86 (2016)

Morton, T. D.et al.False positive probabilities for all kepler objects of interest: 1284 newly validated planets and 428 likely false positives.The Astrophysical Jour- nal822, 86 (2016). URL https://doi.org/10.3847/ 0004-637X/822/2/86

work page 2016

[21] [22]

Buehler, M. J. Multiscale modeling at the inter- face of molecular mechanics and natural language through attention neural networks.Accounts of Chem- ical Research55, 3387–3403 (2022). URL https: //doi.org/10.1021/acs.accounts.2c00330. Doi: 10.1021/acs.accounts.2c00330

work page doi:10.1021/acs.accounts.2c00330 2022

[22] [23]

https://www

RDKit: Open-source cheminformatics. https://www. rdkit.org. Accessed: 2026-05-20. 13

work page 2026

[23] [24]

& Ariza, C

Cuthbert, M. & Ariza, C. Music21: A toolkit for computer-aided musicology and symbolic music data. InInternational Society for Music Informa- tion Retrieval Conference(2010). URL https://api. semanticscholar.org/CorpusID:6411706

work page 2010

[24] [25]

S.The Structure of Scientific Revolutions(Uni- versity of Chicago Press, 2012), 4 edn

Kuhn, T. S.The Structure of Scientific Revolutions(Uni- versity of Chicago Press, 2012), 4 edn

work page 2012

[25] [26]

URL https://www.nejm.org/doi/full/10.1056/ NEJM200106143442401

Nash, D.et al.The outbreak of west nile virus in- fection in the new york city area in 1999.New England Journal of Medicine344, 1807–1814 (2001). URL https://www.nejm.org/doi/full/10.1056/ NEJM200106143442401. https://www.nejm.org/ doi/pdf/10.1056/NEJM200106143442401

work page doi:10.1056/nejm200106143442401 1999

[26] [27]

Chari and L

Ryan, S. J., Carlson, C. J., Mordecai, E. A. & Johnson, L. R. Global expansion and redistribution of aedes-borne virus transmission risk with climate change.PLOS Neglected Tropical Diseases13, e0007213 (2019). URL https://doi.org/10.1371/journal. pntd.0007213

work page doi:10.1371/journal 2019

[27] [28]

V .et al.An earth-sized planet in the habit- able zone of a cool star.Science344, 277–280 (2014)

Quintana, E. V .et al.An earth-sized planet in the habit- able zone of a cool star.Science344, 277–280 (2014)

work page 2014

[28] [29]

Luu, R. K. & Buehler, M. J. Bioinspiredllm: Conversational large language model for the mechanics of biological and bio-inspired mate- rials.Advanced Science11, 2306724 (2024). URL https://advanced.onlinelibrary. wiley.com/doi/abs/10.1002/advs.202306724. https://advanced.onlinelibrary.wiley.com/ doi/pdf/10.1002/advs.202306724

work page doi:10.1002/advs.202306724 2024

[29] [30]

Lu, W., Luu, R. K. & Buehler, M. J. Fine-tuning large language models for domain adaptation: explo- ration of training strategies, scaling, model merging and synergistic capabilities.npj Computational Materi- als11, 84 (2025). URL https://doi.org/10.1038/ s41524-025-01564-y

work page 2025

[30] [31]

Buehler, M. J. Preflexor: preference-based recur- sive language modeling for exploratory optimization of reasoning and agentic thinking.npj Artificial Intelli- gence1, 4 (2025). URL https://doi.org/10.1038/ s44387-025-00003-z

work page 2025

[31] [32]

K., Knowles, T

Yang, Z., Yorke, S. K., Knowles, T. P. J. & Buehler, M. J. Learning the rules of peptide self-assembly through data mining with large language models.Science Advances11, eadv1971 (2025). URL https://doi.org/10.1126/ sciadv.adv1971. Doi: 10.1126/sciadv.adv1971

work page doi:10.1126/sciadv.adv1971 2025

[32] [33]

Overcoming catastrophic forgetting in neural networks

Ghafarollahi, A. & Buehler, M. J. Automat- ing alloy design and discovery with physics-aware multimodal multiagent ai.Proceedings of the National Academy of Sciences122, e2414074122 (2025). URL https://doi.org/10.1073/pnas. 2414074122. Doi: 10.1073/pnas.2414074122

work page doi:10.1073/pnas 2025

[33] [34]

Y .et al.Autonomous agents coordinating dis- tributed discovery through emergent artifact exchange

Wang, F. Y .et al.Autonomous agents coordinating dis- tributed discovery through emergent artifact exchange. arXiv preprint arXiv:2603.14312(2026). URL https: //arxiv.org/abs/2603.14312

work page arXiv 2026

[34] [35]

A.et al.Detecting the impact of tempera- ture on transmission of zika, dengue, and chikungunya using mechanistic models.PLOS Neglected Tropical Dis- eases11, e0005568 (2017)

Mordecai, E. A.et al.Detecting the impact of tempera- ture on transmission of zika, dengue, and chikungunya using mechanistic models.PLOS Neglected Tropical Dis- eases11, e0005568 (2017). URL https://doi.org/ 10.1371/journal.pntd.0005568

work page doi:10.1371/journal.pntd.0005568 2017

[35] [36]

Crossfield, I. J. M.et al.197 candidates and 104 validated planets in k2’s first five fields.The Astrophysical Journal Supplement Series226, 7 (2016). URL https://doi. org/10.3847/0067-0049/226/1/7

work page doi:10.3847/0067-0049/226/1/7 2016

[36] [37]

W.et al.275 candidates and 149 validated planets orbiting bright stars in k2 campaigns 0–10.The Astronomical Journal155, 136 (2018)

Mayo, A. W.et al.275 candidates and 149 validated planets orbiting bright stars in k2 campaigns 0–10.The Astronomical Journal155, 136 (2018). URL https: //doi.org/10.3847/1538-3881/aaadff

work page doi:10.3847/1538-3881/aaadff 2018

[37] [38]

Reisen, W. K. Epidemiology of st. louis encephalitis virus.Adv Virus Res61, 139–83 (2003). Reisen, William K Journal Article Research Support, Non-U.S. Gov’t Research Support, U.S. Gov’t, Non-P.H.S. Research Sup- port, U.S. Gov’t, P.H.S. Review United States 2004/01/13 Adv Virus Res. 2003;61:139-83. doi: 10.1016/s0065- 3527(03)61004-3

work page doi:10.1016/s0065- 2003

[38] [39]

Climate change impacts on west nile virus trans- mission in a global context.Philos Trans R Soc Lond B Biol Sci370(2015)

Paz, S. Climate change impacts on west nile virus trans- mission in a global context.Philos Trans R Soc Lond B Biol Sci370(2015). 1471-2970 Paz, Shlomit Journal Article Review England 2015/02/18 Philos Trans R Soc Lond B Biol Sci. 2015 Apr 5;370(1665):20130561. doi: 10.1098/rstb.2013.0561

work page doi:10.1098/rstb.2013.0561 2015

[39] [40]

M.et al.Planetary candidates observed by kepler

Batalha, N. M.et al.Planetary candidates observed by kepler. iii. analysis of the first 16 months of data.The Astrophysical Journal Supplement Series204, 24 (2013). 14

work page 2013