An AI system to help scientists write expert-level empirical software

Anastasiya Belyaeva; Anna Bulanova; Brian P. Williams; Chris Co; Chujun He; Cory Y. McLean; Dan Liebling; David Smalling; Erica Brand; Eser Ayg\"un

arxiv: 2509.06503 · v3 · pith:2QRJUZEZnew · submitted 2025-09-08 · 💻 cs.AI · q-bio.QM

An AI system to help scientists write expert-level empirical software

Eser Ayg\"un , Anastasiya Belyaeva , Gheorghe Comanici , Marc Coram , Hao Cui , Jake Garrison , Renee Johnston Anton Kast , Cory Y. McLean

show 33 more authors

Peter Norgaard Zahra Shamsi David Smalling James Thompson Subhashini Venugopalan Brian P. Williams Chujun He Sarah Martinson Martyna Plomecka Lai Wei Yuchen Zhou Qian-Ze Zhu Matthew Abraham Erica Brand Anna Bulanova Jeffrey A. Cardille Chris Co Scott Ellsworth Grace Joseph Malcolm Kane Ryan Krueger Johan Kartiwa Dan Liebling Jan-Matthis Lueckmann Paul Raccuglia Xuefei (Julie) Wang Katherine Chou James Manyika Yossi Matias John C. Platt Lizzie Dorfman Shibl Mourad Michael P. Brenner

This is my paper

Pith reviewed 2026-05-22 13:17 UTC · model grok-4.3

classification 💻 cs.AI q-bio.QM

keywords AI for scienceempirical software generationtree searchlarge language modelssingle-cell analysisCOVID-19 forecastingscientific discovery automation

0 comments

The pith

An AI system uses tree search over LLM-generated code to produce scientific software that outperforms human experts on real leaderboards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Empirical Research Assistance, or ERA, which pairs a large language model with tree search to automatically write and refine software for computational experiments. The system searches through many possible code solutions, keeping only those that improve a chosen quality metric, and pulls in ideas from outside papers to create novel approaches. If successful, this would shorten the time scientists spend writing custom code and let them test more ideas faster. The authors show ERA discovering dozens of new analysis methods that beat the current best entries on public benchmarks in biology and producing better forecasts than official models in public health.

Core claim

ERA is an AI system that uses a large language model guided by tree search to generate, evaluate, and iteratively improve scientific software whose goal is to maximize a domain-specific quality metric. When the system is allowed to explore and integrate complex research ideas from external sources, it produces runnable code that achieves expert-level performance, including 40 novel methods for single-cell data analysis that outperformed the top human-developed entries on a public leaderboard and 14 forecasting models that outperformed the CDC ensemble for COVID-19 hospitalizations.

What carries the argument

Tree search over variants of code generated by the large language model, with each candidate evaluated directly by the target quality metric to decide which branches to expand.

If this is right

The same tree-search approach can be applied to other domains such as geospatial analysis and zebrafish neural prediction, yielding expert-level code without manual coding.
ERA can produce entirely new rule-based constructions for time series forecasting that improve on existing techniques.
By repeatedly integrating ideas from published literature, the system generates solutions that human developers had not previously combined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the quality metric can be defined for a new field, the same machinery could accelerate software development in that field without requiring new training of the underlying model.
The approach opens the possibility of chaining multiple such systems, where one ERA instance writes code that another instance then uses as input for a downstream analysis.
A natural next test would be whether human scientists can steer the search by occasionally editing the quality metric or injecting new constraints mid-process.

Load-bearing premise

The chosen quality metric truly measures expert-level scientific performance and the language model can turn external research ideas into correct, runnable code without human fixes.

What would settle it

Running the 40 single-cell methods discovered by ERA on the same public leaderboard and finding that none of them rank above the previous top human entry.

read the original abstract

The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments\cite{hannay2009how}. To address this, we present Empirical Research Assistance (ERA), an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS)\cite{silver2016mastering} to systematically improve the quality metric and intelligently navigate the large space of possible solutions. ERA achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a diverse range of tasks. In bioinformatics, ERA discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, ERA generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. ERA also produced expert-level software for geospatial analysis, neural activity prediction in zebrafish, and numerical solution of integrals, and a novel rule-based construction for time series forecasting. By devising and implementing novel solutions to diverse tasks, ERA represents a significant step towards accelerating scientific progress.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ERA shows LLM plus tree search can generate code that tops public leaderboards in single-cell analysis and COVID forecasting, but the results depend on an unspecified quality metric whose independence from those same scores is not demonstrated.

read the letter

The main point is that this system uses an LLM to propose code changes and tree search to explore them, guided by a quality metric, and ends up with 40 new single-cell methods that beat the top human entries on a public leaderboard plus 14 forecast models that beat the CDC ensemble. It also reports results on a few other tasks like geospatial work and integral solving. That concrete scale on external benchmarks is the part worth noting, since most LLM code papers stop at toy problems or internal tests.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Empirical Research Assistance (ERA), an AI system that uses a large language model together with tree search to generate and iteratively improve expert-level empirical software whose objective is to maximize a user-specified quality metric. The central claims are that ERA discovers and implements novel solutions by integrating complex ideas from external sources, with concrete demonstrations including 40 novel methods for single-cell data analysis that outperform the top human-developed entries on a public leaderboard and 14 models for COVID-19 hospitalization forecasting that surpass the CDC ensemble and all other individual models.

Significance. If the outperformance claims are shown to rest on an independently validated quality metric rather than direct optimization of the reported scores, the work would constitute a meaningful advance in automating the creation of domain-specific scientific code. The combination of LLM-based idea integration with tree search for systematic exploration across diverse tasks (bioinformatics, epidemiology, geospatial analysis, neural activity prediction, and numerical methods) is a clear strength, as is the emphasis on producing runnable, expert-level software rather than isolated code snippets.

major comments (2)

[Abstract / Results (bioinformatics and epidemiology)] Abstract and results on bioinformatics/epidemiology: the headline claims of 40 outperforming methods and 14 outperforming models are load-bearing for the assertion of expert-level performance, yet no definition of the quality metric, exact baseline implementations, error bars, or controls against post-hoc selection of the reported solutions are supplied. Without these, it is impossible to determine whether tree search produced genuinely novel expert software or simply optimized the scalar used for both guidance and final reporting.
[Results sections on single-cell analysis and COVID-19 forecasting] The central claim that ERA 'achieves expert-level results when it explores and integrates complex research ideas from external sources' requires evidence that the quality metric is independent of the public leaderboard or forecast accuracy used for evaluation. No independent validation set, expert review protocol, or failure-case analysis is described that would decouple the metric from the reported wins.

minor comments (2)

[Abstract] The abstract cites tree search but does not briefly indicate how the search is adapted to the space of code solutions and external literature integration.
[Results] Ensure that all statements of novelty are accompanied by explicit comparison to the closest prior human or automated methods rather than only to leaderboard rank.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We are encouraged by the recognition of ERA's potential to advance automated scientific software development. Below, we provide point-by-point responses to the major comments, clarifying our approach and outlining revisions to address the concerns about metric definition and independence.

read point-by-point responses

Referee: Abstract and results on bioinformatics/epidemiology: the headline claims of 40 outperforming methods and 14 outperforming models are load-bearing for the assertion of expert-level performance, yet no definition of the quality metric, exact baseline implementations, error bars, or controls against post-hoc selection of the reported solutions are supplied. Without these, it is impossible to determine whether tree search produced genuinely novel expert software or simply optimized the scalar used for both guidance and final reporting.

Authors: We agree that these details are essential for rigorous evaluation. In the revised manuscript, we will add a dedicated section defining the quality metrics: for single-cell analysis, it is the composite score from the public leaderboard (e.g., based on clustering accuracy metrics like ARI and NMI on test data); for COVID-19 forecasting, it follows the CDC's evaluation protocol using mean absolute error or similar on reported hospitalizations. Exact baseline implementations will be described by referencing the top leaderboard entries and noting how we reproduced or compared against them. Error bars will be included from repeated ERA runs with different random seeds. For controls against post-hoc selection, we will report the number of solutions explored and the distribution of scores, showing that the reported ones are the top performers from the search rather than selected after the fact. While the metric guides the search, the novelty comes from the LLM proposing and implementing integrated ideas from external literature. revision: yes
Referee: The central claim that ERA 'achieves expert-level results when it explores and integrates complex research ideas from external sources' requires evidence that the quality metric is independent of the public leaderboard or forecast accuracy used for evaluation. No independent validation set, expert review protocol, or failure-case analysis is described that would decouple the metric from the reported wins.

Authors: The quality metric is indeed the performance on the respective benchmarks, as ERA is designed to maximize user-specified metrics for practical scientific tasks. However, the key contribution is the systematic exploration via tree search that allows integration of complex ideas (e.g., from recent papers on single-cell methods) into runnable code, leading to solutions that surpass existing ones. We will revise to include a failure-case analysis, describing instances where the search converged to suboptimal solutions or failed to integrate ideas effectively. We will also detail the expert review by noting that the generated code was validated for correctness and novelty through comparison to literature. An independent validation set separate from the leaderboard is not described because the leaderboards serve as the standard evaluation; we will add a limitations section discussing potential overfitting to public benchmarks and the value of future private test sets. revision: partial

Circularity Check

0 steps flagged

No significant circularity: results grounded in external public benchmarks

full rationale

The paper presents ERA as using LLM+tree search to maximize an internal quality metric, then reports outperformance on independent public leaderboards (bioinformatics) and the CDC ensemble (epidemiology). These external benchmarks are not shown to be identical to the search metric by any quoted equation or definition, and no self-citation chain or ansatz is invoked to force the headline results. The derivation chain therefore remains self-contained against external validation sets rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on standard LLM code-generation capabilities and tree-search navigation but introduces the integrated ERA system and the specific quality-metric-driven loop as its main addition; no new physical entities or mathematical axioms are postulated.

free parameters (1)

Quality metric
The metric that tree search maximizes is central to all reported wins yet is not given an explicit functional form in the abstract.

axioms (1)

domain assumption Tree search combined with an LLM can systematically explore and improve code solutions by integrating external research ideas.
Invoked to explain how ERA reaches expert-level performance across tasks.

pith-pipeline@v0.9.0 · 5921 in / 1466 out tokens · 51743 ms · 2026-05-22T13:17:42.083645+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ERA discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
cs.AI 2026-05 unverdicted novelty 7.0

An LLM-guided tree search system autonomously creates diverse forecasting models that match or beat CDC human-curated ensembles in a 2025-2026 prospective multi-pathogen evaluation.
Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI
physics.ao-ph 2026-05 unverdicted novelty 7.0

An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month ...
Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search
cs.CL 2026-05 conditional novelty 6.0

LLM-guided tree search with coding agents optimizes 3D photovoltaic designs for higher diurnal energy yield after correcting for simulation exploits.
Glia: A Human-Inspired AI for Automated Systems Design and Optimization
cs.AI 2025-10 unverdicted novelty 6.0

Glia deploys a multi-agent LLM workflow with reasoning, experimentation, and analysis agents to generate interpretable algorithms for request routing, scheduling, and auto-scaling in distributed GPU clusters, reaching...
ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms
cs.LG 2025-12 unverdicted novelty 5.0

ATHENA introduces an agentic team framework that autonomously manages the end-to-end computational research lifecycle via a knowledge-driven HENA loop to achieve validation errors of 10^{-14} in scientific computing a...
TusoAI: Agentic Optimization for Scientific Methods
cs.AI 2025-09 unverdicted novelty 5.0

TusoAI is an LLM-based agent that builds and iteratively optimizes domain-specific computational methods for scientific data analysis, outperforming expert baselines on RNA-seq denoising and earth monitoring while rep...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 6 Pith papers · 10 internal anchors

[1]

A., Cardille, J

Fortin, J. A., Cardille, J. A. & Perez, E. Multi-sensor detection of forest-cover change across 45 years in Mato Grosso, Brazil.Remote Sens. Environ.238, 111266 (2020)

work page 2020
[2]

& Kohn, W

Hohenberg, P. & Kohn, W. Inhomogeneous electron gas.Phys. Rev.136, B864 (1964)

work page 1964
[3]

& Sham, L

Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Phys. Rev.140, A1133 (1965)

work page 1965
[4]

& Levitt, M

Warshel, A. & Levitt, M. Theoretical studies of enzymic reactions: dielectric, electrostatic and steric stabilization of the carbonium ion in the reaction of lysozyme.J. Mol. Biol.103, 227–249 (1976). 22 An AI system to help scientists write expert-level empirical software

work page 1976
[5]

Jumper, J.et al.Highly accurate protein structure prediction with AlphaFold.Nature596, 583–589 (2021)

work page 2021
[6]

Baek, M.et al.Accurate prediction of protein structures and interactions using a three-track neural network.Science373, 871–876 (2021)

work page 2021
[7]

Hourdin, F.et al.The art and science of climate model tuning.Bull. Am. Meteorol. Soc.98, 589–602 (2017)

work page 2017
[8]

Basic philosophy of CFD

Anderson Jr., J. Basic philosophy of CFD. InComputational Fluid Dynamics, 3–14 (Springer, 2009)

work page 2009
[9]

Silver, N.The signal and the noise: why so many predictions fail-but some don’t(Penguin, 2012)

work page 2012
[10]

D.Making sense of chaos: a better economics for a better world(Yale Univ

Farmer, J. D.Making sense of chaos: a better economics for a better world(Yale Univ. Press, 2024)

work page 2024
[11]

& Blanchard, O

Bernanke, B. & Blanchard, O. What caused the US pandemic-era inflation?Am. Econ. J. Macroecon.17, 1–35 (2025)

work page 2025
[12]

Silver, D.et al.Mastering the game of Go with deep neural networks and tree search.Nature 529, 484–489 (2016)

work page 2016
[13]

Silver, D.et al.Mastering the game of Go without human knowledge.Nature550, 354–359 (2017)

work page 2017
[14]

Jiang, Z.et al.AIDE: AI-driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Novikov, A.et al.AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Romera-Paredes, B.et al.Mathematical discoveries from program search with large language models.Nature625, 468–475 (2024)

work page 2024
[17]

& Tan, K

Wu, X., Wu, S.-h., Wu, J., Feng, L. & Tan, K. C. Evolutionary computation in the era of large language model: survey and roadmap.IEEE Trans. Evol. Comput.(2024)

work page 2024
[18]

Automated Design of Agentic Systems

Hu, S., Lu, C. & Clune, J. Automated design of agentic systems.arXiv preprint arXiv:2408.08435 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Cell186, 5876–5891.e20 (2023)

Xu, C.et al.Automatic cell-type harmonization and integration across Human Cell Atlas datasets. Cell186, 5876–5891.e20 (2023)

work page 2023
[20]

Regev, A.et al.The Human Cell Atlas.eLife6, e27041 (2017)

work page 2017
[21]

COVID-19 forecast hub (2025)

Centers for Disease Control and Prevention. COVID-19 forecast hub (2025). URL https: //github.com/cdcgov/covid19-forecast-hub?tab=readme-ov-file

work page 2025
[22]

& Zhou, W

Shao, Z., Yang, K. & Zhou, W. Performance evaluation of single-label and multi-label remote sensing image retrieval using a dense labeling dataset.Remote Sens.10, 964 (2018)

work page 2018
[23]

arXiv preprint arXiv:2503.02618(2025)

Lueckmann, J.-M.et al.ZAPBench: a benchmark for whole-brain activity prediction in zebrafish. arXiv preprint arXiv:2503.02618(2025)

work page arXiv 2025
[24]

arXiv preprint arXiv:2410.10393(2024)

Aksu, T.et al.GIFT-Eval: a benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393(2024). URL https://huggingface.co/spaces/Salesforce/ GIFT-Eval. 23 An AI system to help scientists write expert-level empirical software

work page arXiv 2024
[25]

and Transl

Jovic, D.et al.Single-cell RNA sequencing technologies and applications: a brief overview.Clin. and Transl. Med.12, e694 (2022)

work page 2022
[26]

& Teichmann, S

Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of single-cell RNA-seq in the past decade.Nat. Protoc.13, 599–604 (2018)

work page 2018
[27]

CZI Cell Science Programet al.CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data.Nucleic Acids Res.53, D886–D900 (2025)

work page 2025
[28]

Zhang, J.et al.Tahoe-100M: a giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling.bioRxiv2025–02 (2025)

work page 2025
[29]

& Satija, R

Stuart, T. & Satija, R. Integrative single-cell analysis.Nat. Rev. Genet.20, 257–272 (2019)

work page 2019
[30]

& Oshlack, A

Zappia, L., Phipson, B. & Oshlack, A. Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database.PLoS Comput. Biol.14, e1006245 (2018)

work page 2018
[31]

Tran, H. T. N.et al.A benchmark of batch-effect correction methods for single-cell RNA sequencing data.Genome Biol.21, 1–32 (2020)

work page 2020
[32]

Chazarra-Gil, R., van Dongen, S., Kiselev, V. Y. & Hemberg, M. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench.Nucleic Acids Res.49, e42 (2021)

work page 2021
[33]

D.et al.Benchmarking atlas-level data integration in single-cell genomics.Nat

Luecken, M. D.et al.Benchmarking atlas-level data integration in single-cell genomics.Nat. Methods19, 41–50 (2022)

work page 2022
[34]

D.et al.Defining and benchmarking open problems in single-cell analysis.Nat

Luecken, M. D.et al.Defining and benchmarking open problems in single-cell analysis.Nat. Biotechnol.43, 1035–1040 (2025)

work page 2025
[35]

Gemini Deep Research (2025)

Google. Gemini Deep Research (2025). URL https://gemini.google/overview/ deep-research/?hl=en

work page 2025
[36]

Gottweis, J.et al.Towards an AI co-scientist.arXiv preprint arXiv:2502.18864(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

E., Li, C

Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods.Biostatistics8, 118–127 (2007)

work page 2007
[38]

Polański, K.et al.BBKNN: fast batch alignment of single cell transcriptomes.Bioinformatics36, 964–965 (2019)

work page 2019
[39]

Chandrashekar, A.et al.TabVI: leveraging lightweight transformer architectures to learn biologically meaningful cellular representations.bioRxiv2025–02 (2025)

work page 2025
[40]

& Newsam, S

Yang, Y. & Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proc. 18th SIGSPATIAL Int. Conf. on Adv. in Geogr. Inf. Syst., 270–279 (Association for Computing Machinery, 2010)

work page 2010
[41]

Russakovsky, O.et al.ImageNet large scale visual recognition challenge.Int. J. Comput. Vis. 115, 211–252 (2015)

work page 2015
[42]

& Hinton, G

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks.Adv. Neural Inf. Process. Syst.25(2012)

work page 2012
[43]

Zhong, B., Du, J., Liu, M., Yang, A. & Wu, J. Region-enhancing network for semantic segmenta- tion of remote-sensing imagery.Sensors21(2021). 24 An AI system to help scientists write expert-level empirical software

work page 2021
[44]

Zhang, Z., Liu, B. & Li, Y. FURSformer: semantic segmentation network for remote sensing images with fused heterogeneous features.Electronics12(2023)

work page 2023
[45]

Atiampo, A. K. & Diédié, G. H. F. New fusion approach of spatial and channel attention for semantic segmentation of very high spatial resolution remote sensing images.Open J. Appl. Sci. 14, 288–319 (2024)

work page 2024
[46]

& Feng, S

Sun, Y., Bi, F., Gao, Y., Chen, L. & Feng, S. A multi-attention UNet for semantic segmentation in remote sensing images.Symmetry14, 906 (2022)

work page 2022
[47]

M., Mohamed, M

Elgamily, K. M., Mohamed, M. A., Abou-Taleb, A. M. & Ata, M. M. A novel W13 deep CNN structure for improved semantic segmentation of multiple objects in remote sensing imagery. Neural Comput. Appl.37, 5397–5427 (2025)

work page 2025
[48]

Immer, A.et al.Forecasting whole-brain neuronal activity from volumetric video.arXiv preprint arXiv:2503.00073(2025)

work page arXiv 2025
[49]

Zeng, A., Chen, M., Zhang, L. & Xu, Q. Are transformers effective for time series forecasting? In Proc AAAI Conf. Artif. Intell., vol. 37, 11121–11128 (2023)

work page 2023
[50]

Das, A.et al.Long-term forecasting with TiDE: Time-series Dense Encoder.Trans. Mach. Learn. Res.(2023)

work page 2023
[51]

Chen, S.-A., Li, C.-L., Yoder, N., Arik, S. O. & Pfister, T. TSMixer: An All-MLP architecture for time series forecasting.Trans. Mach. Learn. Res.(2023)

work page 2023
[52]

& Courville, A

Perez, E., Strub, F., De Vries, H., Dumoulin, V. & Courville, A. FiLM: Visual reasoning with a general conditioning layer. InProc AAAI Conf. Artif. Intell., vol. 32 (2018)

work page 2018
[53]

Deistler, M.et al.Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics.bioRxiv2024–08 (2024)

work page 2024
[54]

B., Müller, S., Salinas, D

Hoo, S. B., Müller, S., Salinas, D. & Hutter, F. From tables to time: how TabPFN-v2 outperforms specialized time series forecasting models.arXiv preprint arXiv:2501.02945(2025)

work page arXiv 2025
[55]

Liu, Y.et al.Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

F.et al.Chronos: learning the language of time series.Trans

Ansari, A. F.et al.Chronos: learning the language of time series.Trans. Mach. Learn. Res. (2024)

work page 2024
[57]

N., Carpov, D., Chapados, N

Oreshkin, B. N., Carpov, D., Chapados, N. & Bengio, Y. N-BEATS: neural basis expansion analysis for interpretable time series forecasting.arXiv preprint arXiv:1905.10437(2019)

work page arXiv 1905
[58]

Ho, S. L. & Xie, M. The use of ARIMA models for reliability forecasting and analysis.Comput. Ind. Eng.35, 213–216 (1998)

work page 1998
[59]

Piessens, R., de Doncker-Kapenga, E., Überhuber, C. W. & Kahaner, D.QUADPACK: a subroutine package for automatic integration(Springer-Verlag, 1983)

work page 1983
[60]

& Ryzhik, I.Table of integrals, series, and products, 8th edn(Academic Press, 1994)

Gradshteyn, I. & Ryzhik, I.Table of integrals, series, and products, 8th edn(Academic Press, 1994)

work page 1994
[61]

Koza, J. R. Genetic programming as a means for programming computers by natural selection. Stat. Comput.4, 87–112 (1994). 25 An AI system to help scientists write expert-level empirical software

work page 1994
[62]

& Sloane, A

Mernik, M., Heering, J. & Sloane, A. M. When and how to develop domain-specific languages. ACM computing surveys (CSUR)37, 316–344 (2005)

work page 2005
[63]

Generative programming: Methods, techniques, and applications tutorial abstract

Czarnecki, K. Generative programming: Methods, techniques, and applications tutorial abstract. InInternational Conference on Software Reuse, 351–352 (Springer, 2002)

work page 2002
[64]

Chen, M.et al.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[65]

Li, Y.et al.Competition-level code generation with AlphaCode.Science378, 1092–1097 (2022)

work page 2022
[66]

& Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer Nature, 2019)

Hutter, F., Kotthoff, L. & Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer Nature, 2019)

work page 2019
[67]

Merchant, A.et al.Scaling deep learning for materials discovery.Nature624, 80–85 (2023)

work page 2023
[68]

Xiao, Y.et al.CellAgent: An LLM-driven multi-agent framework for automated single-cell data analysis.arXiv preprint arXiv:2407.09811(2024)

work page arXiv 2024
[69]

bioRxiv2025–03 (2025)

Zhang, H.et al.CompBioAgent: An LLM-powered agent for single-cell RNA-seq data exploration. bioRxiv2025–03 (2025)

work page 2025
[70]

Sci.11, 2407094 (2024)

Zhou, J.et al.An AI agent for fully automated multi-omic analyses.Adv. Sci.11, 2407094 (2024)

work page 2024
[71]

Xin, Q.et al.BioInformatics Agent (BIA): unleashing the power of large language models to reshape bioinformatics workflow.bioRxiv2024–05 (2024)

work page 2024
[72]

Alber, S.et al.CellVoyager: AI compbio agent generates new insights by autonomously analyzing biological data.bioRxiv2025–06 (2025)

work page 2025
[73]

K., Cucerzan, S

Baek, J., Jauhar, S. K., Cucerzan, S. & Hwang, S. J. ResearchAgent: iterative research idea generationoverscientificliteraturewithlargelanguagemodels.arXivpreprintarXiv:2404.07738 (2024)

work page arXiv 2024
[74]

Lu, C.et al.The AI Scientist: towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Du, M., Xu, B., Zhu, C., Wang, X. & Mao, Z. DeepResearch Bench: a comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Perplexity Deep Research (2025)

Perplexity. Perplexity Deep Research (2025). URL https://www.perplexity.ai/hub/blog/ introducing-perplexity-deep-research

work page 2025
[77]

Coelho, J.et al.DeepResearchGym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253(2025)

work page arXiv 2025
[78]

& Peng, J

Xu, R. & Peng, J. A comprehensive survey of deep research: Systems, methodologies, and applications.arXiv preprint arXiv:2506.12594(2025)

work page arXiv 2025
[79]

Lee, J.et al.Gemini Embedding: Generalizable embeddings from Gemini.arXiv preprint arXiv:2503.07891(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

openproblems (2025)

Gigante, S., Cannoodt, R.et al. openproblems (2025). URL https://github.com/ openproblems-bio/openproblems. 26 An AI system to help scientists write expert-level empirical software

work page 2025

Showing first 80 references.

[1] [1]

A., Cardille, J

Fortin, J. A., Cardille, J. A. & Perez, E. Multi-sensor detection of forest-cover change across 45 years in Mato Grosso, Brazil.Remote Sens. Environ.238, 111266 (2020)

work page 2020

[2] [2]

& Kohn, W

Hohenberg, P. & Kohn, W. Inhomogeneous electron gas.Phys. Rev.136, B864 (1964)

work page 1964

[3] [3]

& Sham, L

Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Phys. Rev.140, A1133 (1965)

work page 1965

[4] [4]

& Levitt, M

Warshel, A. & Levitt, M. Theoretical studies of enzymic reactions: dielectric, electrostatic and steric stabilization of the carbonium ion in the reaction of lysozyme.J. Mol. Biol.103, 227–249 (1976). 22 An AI system to help scientists write expert-level empirical software

work page 1976

[5] [5]

Jumper, J.et al.Highly accurate protein structure prediction with AlphaFold.Nature596, 583–589 (2021)

work page 2021

[6] [6]

Baek, M.et al.Accurate prediction of protein structures and interactions using a three-track neural network.Science373, 871–876 (2021)

work page 2021

[7] [7]

Hourdin, F.et al.The art and science of climate model tuning.Bull. Am. Meteorol. Soc.98, 589–602 (2017)

work page 2017

[8] [8]

Basic philosophy of CFD

Anderson Jr., J. Basic philosophy of CFD. InComputational Fluid Dynamics, 3–14 (Springer, 2009)

work page 2009

[9] [9]

Silver, N.The signal and the noise: why so many predictions fail-but some don’t(Penguin, 2012)

work page 2012

[10] [10]

D.Making sense of chaos: a better economics for a better world(Yale Univ

Farmer, J. D.Making sense of chaos: a better economics for a better world(Yale Univ. Press, 2024)

work page 2024

[11] [11]

& Blanchard, O

Bernanke, B. & Blanchard, O. What caused the US pandemic-era inflation?Am. Econ. J. Macroecon.17, 1–35 (2025)

work page 2025

[12] [12]

Silver, D.et al.Mastering the game of Go with deep neural networks and tree search.Nature 529, 484–489 (2016)

work page 2016

[13] [13]

Silver, D.et al.Mastering the game of Go without human knowledge.Nature550, 354–359 (2017)

work page 2017

[14] [14]

Jiang, Z.et al.AIDE: AI-driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Novikov, A.et al.AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Romera-Paredes, B.et al.Mathematical discoveries from program search with large language models.Nature625, 468–475 (2024)

work page 2024

[17] [17]

& Tan, K

Wu, X., Wu, S.-h., Wu, J., Feng, L. & Tan, K. C. Evolutionary computation in the era of large language model: survey and roadmap.IEEE Trans. Evol. Comput.(2024)

work page 2024

[18] [18]

Automated Design of Agentic Systems

Hu, S., Lu, C. & Clune, J. Automated design of agentic systems.arXiv preprint arXiv:2408.08435 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Cell186, 5876–5891.e20 (2023)

Xu, C.et al.Automatic cell-type harmonization and integration across Human Cell Atlas datasets. Cell186, 5876–5891.e20 (2023)

work page 2023

[20] [20]

Regev, A.et al.The Human Cell Atlas.eLife6, e27041 (2017)

work page 2017

[21] [21]

COVID-19 forecast hub (2025)

Centers for Disease Control and Prevention. COVID-19 forecast hub (2025). URL https: //github.com/cdcgov/covid19-forecast-hub?tab=readme-ov-file

work page 2025

[22] [22]

& Zhou, W

Shao, Z., Yang, K. & Zhou, W. Performance evaluation of single-label and multi-label remote sensing image retrieval using a dense labeling dataset.Remote Sens.10, 964 (2018)

work page 2018

[23] [23]

arXiv preprint arXiv:2503.02618(2025)

Lueckmann, J.-M.et al.ZAPBench: a benchmark for whole-brain activity prediction in zebrafish. arXiv preprint arXiv:2503.02618(2025)

work page arXiv 2025

[24] [24]

arXiv preprint arXiv:2410.10393(2024)

Aksu, T.et al.GIFT-Eval: a benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393(2024). URL https://huggingface.co/spaces/Salesforce/ GIFT-Eval. 23 An AI system to help scientists write expert-level empirical software

work page arXiv 2024

[25] [25]

and Transl

Jovic, D.et al.Single-cell RNA sequencing technologies and applications: a brief overview.Clin. and Transl. Med.12, e694 (2022)

work page 2022

[26] [26]

& Teichmann, S

Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of single-cell RNA-seq in the past decade.Nat. Protoc.13, 599–604 (2018)

work page 2018

[27] [27]

CZI Cell Science Programet al.CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data.Nucleic Acids Res.53, D886–D900 (2025)

work page 2025

[28] [28]

Zhang, J.et al.Tahoe-100M: a giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling.bioRxiv2025–02 (2025)

work page 2025

[29] [29]

& Satija, R

Stuart, T. & Satija, R. Integrative single-cell analysis.Nat. Rev. Genet.20, 257–272 (2019)

work page 2019

[30] [30]

& Oshlack, A

Zappia, L., Phipson, B. & Oshlack, A. Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database.PLoS Comput. Biol.14, e1006245 (2018)

work page 2018

[31] [31]

Tran, H. T. N.et al.A benchmark of batch-effect correction methods for single-cell RNA sequencing data.Genome Biol.21, 1–32 (2020)

work page 2020

[32] [32]

Chazarra-Gil, R., van Dongen, S., Kiselev, V. Y. & Hemberg, M. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench.Nucleic Acids Res.49, e42 (2021)

work page 2021

[33] [33]

D.et al.Benchmarking atlas-level data integration in single-cell genomics.Nat

Luecken, M. D.et al.Benchmarking atlas-level data integration in single-cell genomics.Nat. Methods19, 41–50 (2022)

work page 2022

[34] [34]

D.et al.Defining and benchmarking open problems in single-cell analysis.Nat

Luecken, M. D.et al.Defining and benchmarking open problems in single-cell analysis.Nat. Biotechnol.43, 1035–1040 (2025)

work page 2025

[35] [35]

Gemini Deep Research (2025)

Google. Gemini Deep Research (2025). URL https://gemini.google/overview/ deep-research/?hl=en

work page 2025

[36] [36]

Gottweis, J.et al.Towards an AI co-scientist.arXiv preprint arXiv:2502.18864(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

E., Li, C

Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods.Biostatistics8, 118–127 (2007)

work page 2007

[38] [38]

Polański, K.et al.BBKNN: fast batch alignment of single cell transcriptomes.Bioinformatics36, 964–965 (2019)

work page 2019

[39] [39]

Chandrashekar, A.et al.TabVI: leveraging lightweight transformer architectures to learn biologically meaningful cellular representations.bioRxiv2025–02 (2025)

work page 2025

[40] [40]

& Newsam, S

Yang, Y. & Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proc. 18th SIGSPATIAL Int. Conf. on Adv. in Geogr. Inf. Syst., 270–279 (Association for Computing Machinery, 2010)

work page 2010

[41] [41]

Russakovsky, O.et al.ImageNet large scale visual recognition challenge.Int. J. Comput. Vis. 115, 211–252 (2015)

work page 2015

[42] [42]

& Hinton, G

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks.Adv. Neural Inf. Process. Syst.25(2012)

work page 2012

[43] [43]

Zhong, B., Du, J., Liu, M., Yang, A. & Wu, J. Region-enhancing network for semantic segmenta- tion of remote-sensing imagery.Sensors21(2021). 24 An AI system to help scientists write expert-level empirical software

work page 2021

[44] [44]

Zhang, Z., Liu, B. & Li, Y. FURSformer: semantic segmentation network for remote sensing images with fused heterogeneous features.Electronics12(2023)

work page 2023

[45] [45]

Atiampo, A. K. & Diédié, G. H. F. New fusion approach of spatial and channel attention for semantic segmentation of very high spatial resolution remote sensing images.Open J. Appl. Sci. 14, 288–319 (2024)

work page 2024

[46] [46]

& Feng, S

Sun, Y., Bi, F., Gao, Y., Chen, L. & Feng, S. A multi-attention UNet for semantic segmentation in remote sensing images.Symmetry14, 906 (2022)

work page 2022

[47] [47]

M., Mohamed, M

Elgamily, K. M., Mohamed, M. A., Abou-Taleb, A. M. & Ata, M. M. A novel W13 deep CNN structure for improved semantic segmentation of multiple objects in remote sensing imagery. Neural Comput. Appl.37, 5397–5427 (2025)

work page 2025

[48] [48]

Immer, A.et al.Forecasting whole-brain neuronal activity from volumetric video.arXiv preprint arXiv:2503.00073(2025)

work page arXiv 2025

[49] [49]

Zeng, A., Chen, M., Zhang, L. & Xu, Q. Are transformers effective for time series forecasting? In Proc AAAI Conf. Artif. Intell., vol. 37, 11121–11128 (2023)

work page 2023

[50] [50]

Das, A.et al.Long-term forecasting with TiDE: Time-series Dense Encoder.Trans. Mach. Learn. Res.(2023)

work page 2023

[51] [51]

Chen, S.-A., Li, C.-L., Yoder, N., Arik, S. O. & Pfister, T. TSMixer: An All-MLP architecture for time series forecasting.Trans. Mach. Learn. Res.(2023)

work page 2023

[52] [52]

& Courville, A

Perez, E., Strub, F., De Vries, H., Dumoulin, V. & Courville, A. FiLM: Visual reasoning with a general conditioning layer. InProc AAAI Conf. Artif. Intell., vol. 32 (2018)

work page 2018

[53] [53]

Deistler, M.et al.Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics.bioRxiv2024–08 (2024)

work page 2024

[54] [54]

B., Müller, S., Salinas, D

Hoo, S. B., Müller, S., Salinas, D. & Hutter, F. From tables to time: how TabPFN-v2 outperforms specialized time series forecasting models.arXiv preprint arXiv:2501.02945(2025)

work page arXiv 2025

[55] [55]

Liu, Y.et al.Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

F.et al.Chronos: learning the language of time series.Trans

Ansari, A. F.et al.Chronos: learning the language of time series.Trans. Mach. Learn. Res. (2024)

work page 2024

[57] [57]

N., Carpov, D., Chapados, N

Oreshkin, B. N., Carpov, D., Chapados, N. & Bengio, Y. N-BEATS: neural basis expansion analysis for interpretable time series forecasting.arXiv preprint arXiv:1905.10437(2019)

work page arXiv 1905

[58] [58]

Ho, S. L. & Xie, M. The use of ARIMA models for reliability forecasting and analysis.Comput. Ind. Eng.35, 213–216 (1998)

work page 1998

[59] [59]

Piessens, R., de Doncker-Kapenga, E., Überhuber, C. W. & Kahaner, D.QUADPACK: a subroutine package for automatic integration(Springer-Verlag, 1983)

work page 1983

[60] [60]

& Ryzhik, I.Table of integrals, series, and products, 8th edn(Academic Press, 1994)

Gradshteyn, I. & Ryzhik, I.Table of integrals, series, and products, 8th edn(Academic Press, 1994)

work page 1994

[61] [61]

Koza, J. R. Genetic programming as a means for programming computers by natural selection. Stat. Comput.4, 87–112 (1994). 25 An AI system to help scientists write expert-level empirical software

work page 1994

[62] [62]

& Sloane, A

Mernik, M., Heering, J. & Sloane, A. M. When and how to develop domain-specific languages. ACM computing surveys (CSUR)37, 316–344 (2005)

work page 2005

[63] [63]

Generative programming: Methods, techniques, and applications tutorial abstract

Czarnecki, K. Generative programming: Methods, techniques, and applications tutorial abstract. InInternational Conference on Software Reuse, 351–352 (Springer, 2002)

work page 2002

[64] [64]

Chen, M.et al.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[65] [65]

Li, Y.et al.Competition-level code generation with AlphaCode.Science378, 1092–1097 (2022)

work page 2022

[66] [66]

& Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer Nature, 2019)

Hutter, F., Kotthoff, L. & Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer Nature, 2019)

work page 2019

[67] [67]

Merchant, A.et al.Scaling deep learning for materials discovery.Nature624, 80–85 (2023)

work page 2023

[68] [68]

Xiao, Y.et al.CellAgent: An LLM-driven multi-agent framework for automated single-cell data analysis.arXiv preprint arXiv:2407.09811(2024)

work page arXiv 2024

[69] [69]

bioRxiv2025–03 (2025)

Zhang, H.et al.CompBioAgent: An LLM-powered agent for single-cell RNA-seq data exploration. bioRxiv2025–03 (2025)

work page 2025

[70] [70]

Sci.11, 2407094 (2024)

Zhou, J.et al.An AI agent for fully automated multi-omic analyses.Adv. Sci.11, 2407094 (2024)

work page 2024

[71] [71]

Xin, Q.et al.BioInformatics Agent (BIA): unleashing the power of large language models to reshape bioinformatics workflow.bioRxiv2024–05 (2024)

work page 2024

[72] [72]

Alber, S.et al.CellVoyager: AI compbio agent generates new insights by autonomously analyzing biological data.bioRxiv2025–06 (2025)

work page 2025

[73] [73]

K., Cucerzan, S

Baek, J., Jauhar, S. K., Cucerzan, S. & Hwang, S. J. ResearchAgent: iterative research idea generationoverscientificliteraturewithlargelanguagemodels.arXivpreprintarXiv:2404.07738 (2024)

work page arXiv 2024

[74] [74]

Lu, C.et al.The AI Scientist: towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Du, M., Xu, B., Zhu, C., Wang, X. & Mao, Z. DeepResearch Bench: a comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

Perplexity Deep Research (2025)

Perplexity. Perplexity Deep Research (2025). URL https://www.perplexity.ai/hub/blog/ introducing-perplexity-deep-research

work page 2025

[77] [77]

Coelho, J.et al.DeepResearchGym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253(2025)

work page arXiv 2025

[78] [78]

& Peng, J

Xu, R. & Peng, J. A comprehensive survey of deep research: Systems, methodologies, and applications.arXiv preprint arXiv:2506.12594(2025)

work page arXiv 2025

[79] [79]

Lee, J.et al.Gemini Embedding: Generalizable embeddings from Gemini.arXiv preprint arXiv:2503.07891(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [80]

openproblems (2025)

Gigante, S., Cannoodt, R.et al. openproblems (2025). URL https://github.com/ openproblems-bio/openproblems. 26 An AI system to help scientists write expert-level empirical software

work page 2025