pith. sign in

arxiv: 2509.06503 · v3 · pith:2QRJUZEZnew · submitted 2025-09-08 · 💻 cs.AI · q-bio.QM

An AI system to help scientists write expert-level empirical software

Pith reviewed 2026-05-22 13:17 UTC · model grok-4.3

classification 💻 cs.AI q-bio.QM
keywords AI for scienceempirical software generationtree searchlarge language modelssingle-cell analysisCOVID-19 forecastingscientific discovery automation
0
0 comments X

The pith

An AI system uses tree search over LLM-generated code to produce scientific software that outperforms human experts on real leaderboards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Empirical Research Assistance, or ERA, which pairs a large language model with tree search to automatically write and refine software for computational experiments. The system searches through many possible code solutions, keeping only those that improve a chosen quality metric, and pulls in ideas from outside papers to create novel approaches. If successful, this would shorten the time scientists spend writing custom code and let them test more ideas faster. The authors show ERA discovering dozens of new analysis methods that beat the current best entries on public benchmarks in biology and producing better forecasts than official models in public health.

Core claim

ERA is an AI system that uses a large language model guided by tree search to generate, evaluate, and iteratively improve scientific software whose goal is to maximize a domain-specific quality metric. When the system is allowed to explore and integrate complex research ideas from external sources, it produces runnable code that achieves expert-level performance, including 40 novel methods for single-cell data analysis that outperformed the top human-developed entries on a public leaderboard and 14 forecasting models that outperformed the CDC ensemble for COVID-19 hospitalizations.

What carries the argument

Tree search over variants of code generated by the large language model, with each candidate evaluated directly by the target quality metric to decide which branches to expand.

If this is right

  • The same tree-search approach can be applied to other domains such as geospatial analysis and zebrafish neural prediction, yielding expert-level code without manual coding.
  • ERA can produce entirely new rule-based constructions for time series forecasting that improve on existing techniques.
  • By repeatedly integrating ideas from published literature, the system generates solutions that human developers had not previously combined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the quality metric can be defined for a new field, the same machinery could accelerate software development in that field without requiring new training of the underlying model.
  • The approach opens the possibility of chaining multiple such systems, where one ERA instance writes code that another instance then uses as input for a downstream analysis.
  • A natural next test would be whether human scientists can steer the search by occasionally editing the quality metric or injecting new constraints mid-process.

Load-bearing premise

The chosen quality metric truly measures expert-level scientific performance and the language model can turn external research ideas into correct, runnable code without human fixes.

What would settle it

Running the 40 single-cell methods discovered by ERA on the same public leaderboard and finding that none of them rank above the previous top human entry.

read the original abstract

The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments\cite{hannay2009how}. To address this, we present Empirical Research Assistance (ERA), an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS)\cite{silver2016mastering} to systematically improve the quality metric and intelligently navigate the large space of possible solutions. ERA achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a diverse range of tasks. In bioinformatics, ERA discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, ERA generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. ERA also produced expert-level software for geospatial analysis, neural activity prediction in zebrafish, and numerical solution of integrals, and a novel rule-based construction for time series forecasting. By devising and implementing novel solutions to diverse tasks, ERA represents a significant step towards accelerating scientific progress.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Empirical Research Assistance (ERA), an AI system that uses a large language model together with tree search to generate and iteratively improve expert-level empirical software whose objective is to maximize a user-specified quality metric. The central claims are that ERA discovers and implements novel solutions by integrating complex ideas from external sources, with concrete demonstrations including 40 novel methods for single-cell data analysis that outperform the top human-developed entries on a public leaderboard and 14 models for COVID-19 hospitalization forecasting that surpass the CDC ensemble and all other individual models.

Significance. If the outperformance claims are shown to rest on an independently validated quality metric rather than direct optimization of the reported scores, the work would constitute a meaningful advance in automating the creation of domain-specific scientific code. The combination of LLM-based idea integration with tree search for systematic exploration across diverse tasks (bioinformatics, epidemiology, geospatial analysis, neural activity prediction, and numerical methods) is a clear strength, as is the emphasis on producing runnable, expert-level software rather than isolated code snippets.

major comments (2)
  1. [Abstract / Results (bioinformatics and epidemiology)] Abstract and results on bioinformatics/epidemiology: the headline claims of 40 outperforming methods and 14 outperforming models are load-bearing for the assertion of expert-level performance, yet no definition of the quality metric, exact baseline implementations, error bars, or controls against post-hoc selection of the reported solutions are supplied. Without these, it is impossible to determine whether tree search produced genuinely novel expert software or simply optimized the scalar used for both guidance and final reporting.
  2. [Results sections on single-cell analysis and COVID-19 forecasting] The central claim that ERA 'achieves expert-level results when it explores and integrates complex research ideas from external sources' requires evidence that the quality metric is independent of the public leaderboard or forecast accuracy used for evaluation. No independent validation set, expert review protocol, or failure-case analysis is described that would decouple the metric from the reported wins.
minor comments (2)
  1. [Abstract] The abstract cites tree search but does not briefly indicate how the search is adapted to the space of code solutions and external literature integration.
  2. [Results] Ensure that all statements of novelty are accompanied by explicit comparison to the closest prior human or automated methods rather than only to leaderboard rank.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We are encouraged by the recognition of ERA's potential to advance automated scientific software development. Below, we provide point-by-point responses to the major comments, clarifying our approach and outlining revisions to address the concerns about metric definition and independence.

read point-by-point responses
  1. Referee: Abstract and results on bioinformatics/epidemiology: the headline claims of 40 outperforming methods and 14 outperforming models are load-bearing for the assertion of expert-level performance, yet no definition of the quality metric, exact baseline implementations, error bars, or controls against post-hoc selection of the reported solutions are supplied. Without these, it is impossible to determine whether tree search produced genuinely novel expert software or simply optimized the scalar used for both guidance and final reporting.

    Authors: We agree that these details are essential for rigorous evaluation. In the revised manuscript, we will add a dedicated section defining the quality metrics: for single-cell analysis, it is the composite score from the public leaderboard (e.g., based on clustering accuracy metrics like ARI and NMI on test data); for COVID-19 forecasting, it follows the CDC's evaluation protocol using mean absolute error or similar on reported hospitalizations. Exact baseline implementations will be described by referencing the top leaderboard entries and noting how we reproduced or compared against them. Error bars will be included from repeated ERA runs with different random seeds. For controls against post-hoc selection, we will report the number of solutions explored and the distribution of scores, showing that the reported ones are the top performers from the search rather than selected after the fact. While the metric guides the search, the novelty comes from the LLM proposing and implementing integrated ideas from external literature. revision: yes

  2. Referee: The central claim that ERA 'achieves expert-level results when it explores and integrates complex research ideas from external sources' requires evidence that the quality metric is independent of the public leaderboard or forecast accuracy used for evaluation. No independent validation set, expert review protocol, or failure-case analysis is described that would decouple the metric from the reported wins.

    Authors: The quality metric is indeed the performance on the respective benchmarks, as ERA is designed to maximize user-specified metrics for practical scientific tasks. However, the key contribution is the systematic exploration via tree search that allows integration of complex ideas (e.g., from recent papers on single-cell methods) into runnable code, leading to solutions that surpass existing ones. We will revise to include a failure-case analysis, describing instances where the search converged to suboptimal solutions or failed to integrate ideas effectively. We will also detail the expert review by noting that the generated code was validated for correctness and novelty through comparison to literature. An independent validation set separate from the leaderboard is not described because the leaderboards serve as the standard evaluation; we will add a limitations section discussing potential overfitting to public benchmarks and the value of future private test sets. revision: partial

Circularity Check

0 steps flagged

No significant circularity: results grounded in external public benchmarks

full rationale

The paper presents ERA as using LLM+tree search to maximize an internal quality metric, then reports outperformance on independent public leaderboards (bioinformatics) and the CDC ensemble (epidemiology). These external benchmarks are not shown to be identical to the search metric by any quoted equation or definition, and no self-citation chain or ansatz is invoked to force the headline results. The derivation chain therefore remains self-contained against external validation sets rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on standard LLM code-generation capabilities and tree-search navigation but introduces the integrated ERA system and the specific quality-metric-driven loop as its main addition; no new physical entities or mathematical axioms are postulated.

free parameters (1)
  • Quality metric
    The metric that tree search maximizes is central to all reported wins yet is not given an explicit functional form in the abstract.
axioms (1)
  • domain assumption Tree search combined with an LLM can systematically explore and improve code solutions by integrating external research ideas.
    Invoked to explain how ERA reaches expert-level performance across tasks.

pith-pipeline@v0.9.0 · 5921 in / 1466 out tokens · 51743 ms · 2026-05-22T13:17:42.083645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

    cs.AI 2026-05 unverdicted novelty 7.0

    An LLM-guided tree search system autonomously creates diverse forecasting models that match or beat CDC human-curated ensembles in a 2025-2026 prospective multi-pathogen evaluation.

  2. Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI

    physics.ao-ph 2026-05 unverdicted novelty 7.0

    An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month ...

  3. Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search

    cs.CL 2026-05 conditional novelty 6.0

    LLM-guided tree search with coding agents optimizes 3D photovoltaic designs for higher diurnal energy yield after correcting for simulation exploits.

  4. Glia: A Human-Inspired AI for Automated Systems Design and Optimization

    cs.AI 2025-10 unverdicted novelty 6.0

    Glia deploys a multi-agent LLM workflow with reasoning, experimentation, and analysis agents to generate interpretable algorithms for request routing, scheduling, and auto-scaling in distributed GPU clusters, reaching...

  5. ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms

    cs.LG 2025-12 unverdicted novelty 5.0

    ATHENA introduces an agentic team framework that autonomously manages the end-to-end computational research lifecycle via a knowledge-driven HENA loop to achieve validation errors of 10^{-14} in scientific computing a...

  6. TusoAI: Agentic Optimization for Scientific Methods

    cs.AI 2025-09 unverdicted novelty 5.0

    TusoAI is an LLM-based agent that builds and iteratively optimizes domain-specific computational methods for scientific data analysis, outperforming expert baselines on RNA-seq denoising and earth monitoring while rep...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 6 Pith papers · 10 internal anchors

  1. [1]

    A., Cardille, J

    Fortin, J. A., Cardille, J. A. & Perez, E. Multi-sensor detection of forest-cover change across 45 years in Mato Grosso, Brazil.Remote Sens. Environ.238, 111266 (2020)

  2. [2]

    & Kohn, W

    Hohenberg, P. & Kohn, W. Inhomogeneous electron gas.Phys. Rev.136, B864 (1964)

  3. [3]

    & Sham, L

    Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Phys. Rev.140, A1133 (1965)

  4. [4]

    & Levitt, M

    Warshel, A. & Levitt, M. Theoretical studies of enzymic reactions: dielectric, electrostatic and steric stabilization of the carbonium ion in the reaction of lysozyme.J. Mol. Biol.103, 227–249 (1976). 22 An AI system to help scientists write expert-level empirical software

  5. [5]

    Jumper, J.et al.Highly accurate protein structure prediction with AlphaFold.Nature596, 583–589 (2021)

  6. [6]

    Baek, M.et al.Accurate prediction of protein structures and interactions using a three-track neural network.Science373, 871–876 (2021)

  7. [7]

    Hourdin, F.et al.The art and science of climate model tuning.Bull. Am. Meteorol. Soc.98, 589–602 (2017)

  8. [8]

    Basic philosophy of CFD

    Anderson Jr., J. Basic philosophy of CFD. InComputational Fluid Dynamics, 3–14 (Springer, 2009)

  9. [9]

    Silver, N.The signal and the noise: why so many predictions fail-but some don’t(Penguin, 2012)

  10. [10]

    D.Making sense of chaos: a better economics for a better world(Yale Univ

    Farmer, J. D.Making sense of chaos: a better economics for a better world(Yale Univ. Press, 2024)

  11. [11]

    & Blanchard, O

    Bernanke, B. & Blanchard, O. What caused the US pandemic-era inflation?Am. Econ. J. Macroecon.17, 1–35 (2025)

  12. [12]

    Silver, D.et al.Mastering the game of Go with deep neural networks and tree search.Nature 529, 484–489 (2016)

  13. [13]

    Silver, D.et al.Mastering the game of Go without human knowledge.Nature550, 354–359 (2017)

  14. [14]

    Jiang, Z.et al.AIDE: AI-driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025)

  15. [15]

    Novikov, A.et al.AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131(2025)

  16. [16]

    Romera-Paredes, B.et al.Mathematical discoveries from program search with large language models.Nature625, 468–475 (2024)

  17. [17]

    & Tan, K

    Wu, X., Wu, S.-h., Wu, J., Feng, L. & Tan, K. C. Evolutionary computation in the era of large language model: survey and roadmap.IEEE Trans. Evol. Comput.(2024)

  18. [18]

    Automated Design of Agentic Systems

    Hu, S., Lu, C. & Clune, J. Automated design of agentic systems.arXiv preprint arXiv:2408.08435 (2024)

  19. [19]

    Cell186, 5876–5891.e20 (2023)

    Xu, C.et al.Automatic cell-type harmonization and integration across Human Cell Atlas datasets. Cell186, 5876–5891.e20 (2023)

  20. [20]

    Regev, A.et al.The Human Cell Atlas.eLife6, e27041 (2017)

  21. [21]

    COVID-19 forecast hub (2025)

    Centers for Disease Control and Prevention. COVID-19 forecast hub (2025). URL https: //github.com/cdcgov/covid19-forecast-hub?tab=readme-ov-file

  22. [22]

    & Zhou, W

    Shao, Z., Yang, K. & Zhou, W. Performance evaluation of single-label and multi-label remote sensing image retrieval using a dense labeling dataset.Remote Sens.10, 964 (2018)

  23. [23]

    arXiv preprint arXiv:2503.02618(2025)

    Lueckmann, J.-M.et al.ZAPBench: a benchmark for whole-brain activity prediction in zebrafish. arXiv preprint arXiv:2503.02618(2025)

  24. [24]

    arXiv preprint arXiv:2410.10393(2024)

    Aksu, T.et al.GIFT-Eval: a benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393(2024). URL https://huggingface.co/spaces/Salesforce/ GIFT-Eval. 23 An AI system to help scientists write expert-level empirical software

  25. [25]

    and Transl

    Jovic, D.et al.Single-cell RNA sequencing technologies and applications: a brief overview.Clin. and Transl. Med.12, e694 (2022)

  26. [26]

    & Teichmann, S

    Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of single-cell RNA-seq in the past decade.Nat. Protoc.13, 599–604 (2018)

  27. [27]

    CZI Cell Science Programet al.CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data.Nucleic Acids Res.53, D886–D900 (2025)

  28. [28]

    Zhang, J.et al.Tahoe-100M: a giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling.bioRxiv2025–02 (2025)

  29. [29]

    & Satija, R

    Stuart, T. & Satija, R. Integrative single-cell analysis.Nat. Rev. Genet.20, 257–272 (2019)

  30. [30]

    & Oshlack, A

    Zappia, L., Phipson, B. & Oshlack, A. Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database.PLoS Comput. Biol.14, e1006245 (2018)

  31. [31]

    Tran, H. T. N.et al.A benchmark of batch-effect correction methods for single-cell RNA sequencing data.Genome Biol.21, 1–32 (2020)

  32. [32]

    Chazarra-Gil, R., van Dongen, S., Kiselev, V. Y. & Hemberg, M. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench.Nucleic Acids Res.49, e42 (2021)

  33. [33]

    D.et al.Benchmarking atlas-level data integration in single-cell genomics.Nat

    Luecken, M. D.et al.Benchmarking atlas-level data integration in single-cell genomics.Nat. Methods19, 41–50 (2022)

  34. [34]

    D.et al.Defining and benchmarking open problems in single-cell analysis.Nat

    Luecken, M. D.et al.Defining and benchmarking open problems in single-cell analysis.Nat. Biotechnol.43, 1035–1040 (2025)

  35. [35]

    Gemini Deep Research (2025)

    Google. Gemini Deep Research (2025). URL https://gemini.google/overview/ deep-research/?hl=en

  36. [36]

    Gottweis, J.et al.Towards an AI co-scientist.arXiv preprint arXiv:2502.18864(2025)

  37. [37]

    E., Li, C

    Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods.Biostatistics8, 118–127 (2007)

  38. [38]

    Polański, K.et al.BBKNN: fast batch alignment of single cell transcriptomes.Bioinformatics36, 964–965 (2019)

  39. [39]

    Chandrashekar, A.et al.TabVI: leveraging lightweight transformer architectures to learn biologically meaningful cellular representations.bioRxiv2025–02 (2025)

  40. [40]

    & Newsam, S

    Yang, Y. & Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proc. 18th SIGSPATIAL Int. Conf. on Adv. in Geogr. Inf. Syst., 270–279 (Association for Computing Machinery, 2010)

  41. [41]

    Russakovsky, O.et al.ImageNet large scale visual recognition challenge.Int. J. Comput. Vis. 115, 211–252 (2015)

  42. [42]

    & Hinton, G

    Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks.Adv. Neural Inf. Process. Syst.25(2012)

  43. [43]

    Zhong, B., Du, J., Liu, M., Yang, A. & Wu, J. Region-enhancing network for semantic segmenta- tion of remote-sensing imagery.Sensors21(2021). 24 An AI system to help scientists write expert-level empirical software

  44. [44]

    Zhang, Z., Liu, B. & Li, Y. FURSformer: semantic segmentation network for remote sensing images with fused heterogeneous features.Electronics12(2023)

  45. [45]

    Atiampo, A. K. & Diédié, G. H. F. New fusion approach of spatial and channel attention for semantic segmentation of very high spatial resolution remote sensing images.Open J. Appl. Sci. 14, 288–319 (2024)

  46. [46]

    & Feng, S

    Sun, Y., Bi, F., Gao, Y., Chen, L. & Feng, S. A multi-attention UNet for semantic segmentation in remote sensing images.Symmetry14, 906 (2022)

  47. [47]

    M., Mohamed, M

    Elgamily, K. M., Mohamed, M. A., Abou-Taleb, A. M. & Ata, M. M. A novel W13 deep CNN structure for improved semantic segmentation of multiple objects in remote sensing imagery. Neural Comput. Appl.37, 5397–5427 (2025)

  48. [48]

    Immer, A.et al.Forecasting whole-brain neuronal activity from volumetric video.arXiv preprint arXiv:2503.00073(2025)

  49. [49]

    Zeng, A., Chen, M., Zhang, L. & Xu, Q. Are transformers effective for time series forecasting? In Proc AAAI Conf. Artif. Intell., vol. 37, 11121–11128 (2023)

  50. [50]

    Das, A.et al.Long-term forecasting with TiDE: Time-series Dense Encoder.Trans. Mach. Learn. Res.(2023)

  51. [51]

    Chen, S.-A., Li, C.-L., Yoder, N., Arik, S. O. & Pfister, T. TSMixer: An All-MLP architecture for time series forecasting.Trans. Mach. Learn. Res.(2023)

  52. [52]

    & Courville, A

    Perez, E., Strub, F., De Vries, H., Dumoulin, V. & Courville, A. FiLM: Visual reasoning with a general conditioning layer. InProc AAAI Conf. Artif. Intell., vol. 32 (2018)

  53. [53]

    Deistler, M.et al.Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics.bioRxiv2024–08 (2024)

  54. [54]

    B., Müller, S., Salinas, D

    Hoo, S. B., Müller, S., Salinas, D. & Hutter, F. From tables to time: how TabPFN-v2 outperforms specialized time series forecasting models.arXiv preprint arXiv:2501.02945(2025)

  55. [55]

    Liu, Y.et al.Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816(2025)

  56. [56]

    F.et al.Chronos: learning the language of time series.Trans

    Ansari, A. F.et al.Chronos: learning the language of time series.Trans. Mach. Learn. Res. (2024)

  57. [57]

    N., Carpov, D., Chapados, N

    Oreshkin, B. N., Carpov, D., Chapados, N. & Bengio, Y. N-BEATS: neural basis expansion analysis for interpretable time series forecasting.arXiv preprint arXiv:1905.10437(2019)

  58. [58]

    Ho, S. L. & Xie, M. The use of ARIMA models for reliability forecasting and analysis.Comput. Ind. Eng.35, 213–216 (1998)

  59. [59]

    Piessens, R., de Doncker-Kapenga, E., Überhuber, C. W. & Kahaner, D.QUADPACK: a subroutine package for automatic integration(Springer-Verlag, 1983)

  60. [60]

    & Ryzhik, I.Table of integrals, series, and products, 8th edn(Academic Press, 1994)

    Gradshteyn, I. & Ryzhik, I.Table of integrals, series, and products, 8th edn(Academic Press, 1994)

  61. [61]

    Koza, J. R. Genetic programming as a means for programming computers by natural selection. Stat. Comput.4, 87–112 (1994). 25 An AI system to help scientists write expert-level empirical software

  62. [62]

    & Sloane, A

    Mernik, M., Heering, J. & Sloane, A. M. When and how to develop domain-specific languages. ACM computing surveys (CSUR)37, 316–344 (2005)

  63. [63]

    Generative programming: Methods, techniques, and applications tutorial abstract

    Czarnecki, K. Generative programming: Methods, techniques, and applications tutorial abstract. InInternational Conference on Software Reuse, 351–352 (Springer, 2002)

  64. [64]

    Chen, M.et al.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  65. [65]

    Li, Y.et al.Competition-level code generation with AlphaCode.Science378, 1092–1097 (2022)

  66. [66]

    & Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer Nature, 2019)

    Hutter, F., Kotthoff, L. & Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer Nature, 2019)

  67. [67]

    Merchant, A.et al.Scaling deep learning for materials discovery.Nature624, 80–85 (2023)

  68. [68]

    Xiao, Y.et al.CellAgent: An LLM-driven multi-agent framework for automated single-cell data analysis.arXiv preprint arXiv:2407.09811(2024)

  69. [69]

    bioRxiv2025–03 (2025)

    Zhang, H.et al.CompBioAgent: An LLM-powered agent for single-cell RNA-seq data exploration. bioRxiv2025–03 (2025)

  70. [70]

    Sci.11, 2407094 (2024)

    Zhou, J.et al.An AI agent for fully automated multi-omic analyses.Adv. Sci.11, 2407094 (2024)

  71. [71]

    Xin, Q.et al.BioInformatics Agent (BIA): unleashing the power of large language models to reshape bioinformatics workflow.bioRxiv2024–05 (2024)

  72. [72]

    Alber, S.et al.CellVoyager: AI compbio agent generates new insights by autonomously analyzing biological data.bioRxiv2025–06 (2025)

  73. [73]

    K., Cucerzan, S

    Baek, J., Jauhar, S. K., Cucerzan, S. & Hwang, S. J. ResearchAgent: iterative research idea generationoverscientificliteraturewithlargelanguagemodels.arXivpreprintarXiv:2404.07738 (2024)

  74. [74]

    Lu, C.et al.The AI Scientist: towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292(2024)

  75. [75]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    Du, M., Xu, B., Zhu, C., Wang, X. & Mao, Z. DeepResearch Bench: a comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763(2025)

  76. [76]

    Perplexity Deep Research (2025)

    Perplexity. Perplexity Deep Research (2025). URL https://www.perplexity.ai/hub/blog/ introducing-perplexity-deep-research

  77. [77]

    Coelho, J.et al.DeepResearchGym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253(2025)

  78. [78]

    & Peng, J

    Xu, R. & Peng, J. A comprehensive survey of deep research: Systems, methodologies, and applications.arXiv preprint arXiv:2506.12594(2025)

  79. [79]

    Lee, J.et al.Gemini Embedding: Generalizable embeddings from Gemini.arXiv preprint arXiv:2503.07891(2025)

  80. [80]

    openproblems (2025)

    Gigante, S., Cannoodt, R.et al. openproblems (2025). URL https://github.com/ openproblems-bio/openproblems. 26 An AI system to help scientists write expert-level empirical software

Showing first 80 references.