GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

Haohan Wang; Haoyang Liu; Yijiang Li

arxiv: 2507.21035 · v3 · pith:5BDYTGWOnew · submitted 2025-07-28 · 💻 cs.AI · cs.LG· cs.MA· q-bio.GN

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

Haoyang Liu , Yijiang Li , Haohan Wang This is my paper

Pith reviewed 2026-05-22 00:33 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MAq-bio.GN

keywords gene expression analysismulti-agent systemslarge language modelstranscriptomic datadata preprocessinggene identificationscientific discovery automation

0 comments

The pith

A team of six LLM agents uses typed messaging and guided action units to automate gene expression analysis from raw transcriptomic files.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a collaborative setup in which specialized large language model agents share an analytic workspace and communicate through typed messages to handle the full pipeline of gene expression work. At the center is a planning process that turns broad task instructions into discrete Action Units, letting agents at any point advance, revise, bypass, or backtrack so the sequence stays logically sound while fitting the quirks of real genomic data files. This structure is shown to deliver 89.13 percent composite similarity correlation on data preprocessing and 60.48 percent F1 on gene identification, both well above the previous best automated results. A reader would care because the approach suggests a practical middle path between rigid scripts that fail on edge cases and free-running agents that lose precision, potentially letting more labs extract reliable biological signals without constant expert oversight.

Core claim

By coordinating six LLM-based agents through typed message-passing on a shared canvas, the system lets programming agents convert high-level guidelines into Action Units and then choose at each step to advance, revise, bypass, or backtrack, preserving overall coherence while adapting to the particular demands of large semi-structured transcriptomic datasets; the result is higher benchmark performance than prior automation methods together with gene-phenotype links that match literature reports after latent confounders are taken into account.

What carries the argument

The guided-planning framework that decomposes tasks into Action Units and supplies explicit decision points for agents to advance, revise, bypass, or backtrack.

If this is right

The agents surface gene-phenotype associations that align with published findings while adjusting for latent confounders.
The method processes multiple large semi-structured files without the breakdowns typical of fixed workflows.
Benchmark gains of roughly ten points in preprocessing correlation and sixteen points in gene identification F1 follow directly from the collaborative structure.
Logical coherence is maintained across steps even when individual agents act with some autonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent-collaboration pattern could be tested on other high-dimensional biological datasets such as proteomics or single-cell RNA profiles.
Adding explicit checks for biological plausibility at backtrack points might further lower the chance of downstream errors.
If the planning decisions generalize, the approach could shorten the time between raw data arrival and publishable biological insight in many labs.

Load-bearing premise

The LLM agents will generate correct analysis code and keep the multi-step process logically consistent without introducing errors that produce invalid biological conclusions.

What would settle it

Execute the code produced by the agents on a publicly available gene-expression dataset whose correct preprocessing steps and gene-phenotype associations have already been established by independent expert analysis, then check whether the outputs match those established results within expected tolerance.

Figures

Figures reproduced from arXiv: 2507.21035 by Haohan Wang, Haoyang Liu, Yijiang Li.

**Figure 2.** Figure 2: Planning, memory, and self-correction mechanisms of a single programming agent in our [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Individual task performance of GenoMAS and various baselines on dataset filtering, selection, and preprocessing tasks. (a) Dataset filtering (DF) and selection (DS) accuracy of different methods, and different ablation settings of our GenoMAS method. (b) Data preprocessing performance across three data types (Linked, Gene, Trait) and three metrics (AJ: Attribute Jaccard, SJ: Sample Jaccard, CSC: Composite… view at source ↗

**Figure 4.** Figure 4: Memory reuse efficiency in GenoMAS. (a) Cumulative time savings through memory reuse and (b) memory reuse rate evolution across programming steps. The system rapidly achieves high efficiency, stabilizing around 65% reuse rate after initial learning. 6 Qualitative Studies Autonomous agent behaviors enhance workflow robustness Analysis of GenoMAS execution patterns reveals that agents autonomously adapt the… view at source ↗

**Figure 5.** Figure 5: Agent collaboration patterns in GenoMAS. (a) Network topology showing agent communication structure with node size proportional to message volume. Edge thickness indicates interaction frequency. (b) Distribution of message types across agent pairs revealing asymmetric communication patterns, with programming agents predominantly sending validation requests while advisory agents respond with feedback. GEO … view at source ↗

read the original abstract

Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenoMAS gives a concrete six-agent LLM setup with typed messages and a guided planner for transcriptomic workflows, plus benchmark gains and released code, but the evaluation leaves key controls and error checks unclear.

read the letter

The main takeaway is that GenoMAS coordinates six specialized LLM agents through typed message passing and a guided planning loop. The agents break tasks into Action Units and can advance, revise, bypass, or backtrack, which lets the system handle the quirks of raw transcriptomic files better than rigid scripts or fully free agents. On GenoTEX it reports 89.13% composite similarity correlation for preprocessing and 60.48% F1 for gene identification, beating the prior best by roughly 10-17 points, and the authors say the outputs include biologically plausible associations after confounder adjustment. They also release the code on GitHub, which is straightforward to check.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GenoMAS, a multi-agent framework with six specialized LLM agents that collaborate via typed message-passing protocols to perform gene expression analysis on raw transcriptomic data. Central to the system is a guided-planning framework in which programming agents decompose high-level guidelines into Action Units and dynamically choose to advance, revise, bypass, or backtrack. On the GenoTEX benchmark the framework reports 89.13% Composite Similarity Correlation for data preprocessing and 60.48% F1 for gene identification, exceeding the best prior art by 10.61% and 16.85% respectively, while surfacing biologically plausible gene-phenotype associations after latent-confounder adjustment. Code is released at the cited GitHub repository.

Significance. If the performance claims hold under rigorous controls, the work offers a practical middle path between rigid pipelines and fully autonomous agents for scientific code generation in genomics. The explicit code release constitutes a clear strength for reproducibility and community scrutiny.

major comments (2)

[§4] §4 (Experimental evaluation): the reported 89.13% CSC and 60.48% F1 scores are presented without any description of baseline re-implementations, hyper-parameter settings for the compared methods, statistical significance tests, or error bars; these omissions render the claimed margins (10.61% and 16.85%) impossible to assess for robustness.
[§3.2] §3.2 (Guided-planning framework): the Action Unit mechanism and the four-way decision rule (advance/revise/bypass/backtrack) are described only at a high level; without pseudocode, formal invariants, or concrete traces showing how hallucinations are detected and corrected, it is unclear whether the framework actually guarantees logical coherence across multi-step genomic analyses.

minor comments (2)

[Abstract] The abstract states that associations are 'corroborated by the literature' yet provides no citation list or overlap statistics; a supplementary table mapping discovered genes to supporting PubMed IDs would strengthen the biological-plausibility claim.
[§3.1] Notation for the typed message-passing protocol is introduced without an explicit schema or example message; adding a small table of message types and their fields would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where additional information is warranted, and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experimental evaluation): the reported 89.13% CSC and 60.48% F1 scores are presented without any description of baseline re-implementations, hyper-parameter settings for the compared methods, statistical significance tests, or error bars; these omissions render the claimed margins (10.61% and 16.85%) impossible to assess for robustness.

Authors: We agree that the current presentation of results in Section 4 lacks sufficient implementation details to allow independent assessment of robustness. In the revised manuscript we will add: (i) explicit descriptions of how each baseline was re-implemented (including any adaptations required for the GenoTEX benchmark), (ii) the hyper-parameter values and search ranges used for all compared methods, (iii) results of statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with reported p-values), and (iv) error bars or standard deviations obtained from multiple independent runs. These additions will make the reported margins (10.61% and 16.85%) directly evaluable. revision: yes
Referee: [§3.2] §3.2 (Guided-planning framework): the Action Unit mechanism and the four-way decision rule (advance/revise/bypass/backtrack) are described only at a high level; without pseudocode, formal invariants, or concrete traces showing how hallucinations are detected and corrected, it is unclear whether the framework actually guarantees logical coherence across multi-step genomic analyses.

Authors: Section 3.2 currently emphasizes the conceptual design to keep the exposition accessible. We acknowledge that this leaves the operational details underspecified. In the revision we will insert: (i) pseudocode for Action Unit decomposition and the four-way decision procedure, (ii) the key invariants the framework is designed to maintain (e.g., type consistency of messages and non-regression of data-preprocessing state), and (iii) two or three concrete execution traces drawn from our GenoTEX runs that illustrate how the revise or backtrack actions detect and mitigate hallucinations or logical inconsistencies. These additions will clarify how coherence is preserved without claiming formal guarantees beyond the empirical behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The manuscript describes a multi-agent LLM framework for gene expression analysis and reports empirical performance on the external GenoTEX benchmark (89.13% CSC for preprocessing, 60.48% F1 for gene identification). No equations, derivations, or first-principles claims appear that could reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The guided-planning and typed message-passing components are presented as design choices whose validity is assessed via external benchmark comparison and released code, not by internal construction. This is the most common honest finding for applied systems papers whose central results are benchmark gains rather than closed-form derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the unproven assumption that current LLMs can reliably act as domain-expert coding agents for genomic data; it introduces the guided-planning mechanism and the six-agent division as new constructs without independent falsifiable evidence beyond the single benchmark.

axioms (1)

domain assumption LLM agents can be specialized via prompting and coordinated through typed messages to perform rigorous scientific data analysis without introducing critical errors
This assumption underpins the entire multi-agent orchestration described in the abstract.

invented entities (1)

Guided-planning framework with Action Units no independent evidence
purpose: To let agents decompose tasks and dynamically choose advance, revise, bypass, or backtrack actions
New planning construct introduced to balance structure and adaptability

pith-pipeline@v0.9.0 · 5803 in / 1473 out tokens · 69191 ms · 2026-05-22T00:33:04.925041+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols... guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F1 of 60.48% for gene identification

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

Reference graph

Works this paper leans on

203 extracted references · 203 canonical work pages · cited by 2 Pith papers · 33 internal anchors

[1]

Abusamra

H. Abusamra. A comparative study of feature selection and classification methods for gene ex- pression data of glioma. Procedia Computer Science, 23:5–14, 2013

work page 2013
[2]

Aittokallio

T. Aittokallio. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Briefings in bioinformatics, 11(2):253–264, 2010

work page 2010
[3]

Angermueller, T

C. Angermueller, T. P ¨arnamaa, L. Parts, and O. Stegle. Deep learning for computational biology. Molecular systems biology, 12(7):878, 2016

work page 2016
[4]

Claude code: Agentic coding tool, 2024

Anthropic. Claude code: Agentic coding tool, 2024. URL https://www.anthropic.com/ claude/code. Command line tool for agentic coding

work page 2024
[5]

Introducing claude 4: Our most intelligent model, 2024

Anthropic. Introducing claude 4: Our most intelligent model, 2024. URL https://www. anthropic.com/claude. Accessed: 2025-01-22

work page 2024
[6]

Cursor: The ai code editor, 2024

Anysphere. Cursor: The ai code editor, 2024. URL https://cursor.com. AI-powered code editor

work page 2024
[7]

K. Baba, C. Liu, S. Kurita, and A. Sannai. Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923, 2025

work page arXiv 2025
[8]

J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang. Researchagent: Iterative research idea gener- ation over scientific literature with large language models. arXiv preprint arXiv:2404.07738, 2024

work page arXiv 2024
[9]

J. L. Ballard, Z. Wang, W. Li, L. Shen, and Q. Long. Deep learning-based approaches for multi-omics data integration and analysis. BioData Mining , 17(1):38, 2024. doi: 10.1186/ s13040-024-00391-z

work page 2024
[10]

Besta, N

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi, J. Gajda, T. Lehmann, M. Pod- stawski, H. Niewiadomski, P . Nyczyk, and T. Hoefler. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv: 2308.09687, 2023

work page arXiv 2023
[11]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P . Schwaller. Chemcrow: Aug- menting large-language models with chemistry tools. arXiv preprint arXiv: 2304.05376, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

G. R. Brown, V . Hem, K. S. Katz, M. Ovetsky, C. Wallin, O. Ermolaeva, I. Tolstoy, T. Tatusova, K. D. Pruitt, and D. R. Maglott. Gene: a gene-centered information resource at NCBI. Nucleic Acids Research, 43(D1):D36–D42, 2015. doi: 10.1093/nar/gku1055. URL https://doi.org/ 10.1093/nar/gku1055

work page doi:10.1093/nar/gku1055 2015
[13]

Bruning, W

O. Bruning, W. Rodenburg, P . F. Wackers, C. Van Oostrom, M. J. Jonker, R. J. Dekker, H. Rauwerda, W. A. Ensink, A. De Vries, and T. M. Breit. Confounding factors in the transcriptome analysis of an in-vivo exposure experiment. PLoS One, 11(1):e0145252, 2016

work page 2016
[14]

S. A. Byron, K. R. Van Keuren-Jensen, D. M. Engelthaler, J. D. Carpten, and D. W. Craig. Translat- ing rna sequencing into clinical diagnostics: opportunities and challenges. Nature Reviews Genet- ics, 17(5):257–271, 2016. doi: 10.1038/nrg.2016.10

work page doi:10.1038/nrg.2016.10 2016
[15]

Why Do Multi-Agent LLM Systems Fail?

M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica. Why do multi-agent llm systems fail? arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

C ¸ etin and O

V . C ¸ etin and O. YILDIZ. A comprehensive review on data preprocessing techniques in data anal- ysis. Pamukkale ¨Universitesi M¨ uhendislik Bilimleri Dergisi, 28(2):299–312, 2022

work page 2022
[17]

I. S. Chan and G. S. Ginsburg. Personalized medicine: progress and promise. Annual review of genomics and human genetics, 12:217–244, 2011

work page 2011
[18]

K. Chen, Y. Zhou, X. Zhang, and H. Wang. Prompt stability matters: Evaluating and optimizing auto-generated prompt in general-purpose systems. arXiv preprint arXiv:2505.13546, 2025

work page arXiv 2025
[19]

X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou. Teaching large language models to self-debug. arXiv preprint arXiv: 2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

X. Chen, T. Wang, T. Liu, Z. Guo, X. Li, M. Qu, and T. Zhao. A survey on hypothesis generation for scientific discovery in the era of large language models. arXiv preprint arXiv:2504.05496, 2025. 14

work page arXiv 2025
[21]

Z. Chen, L. Cao, S. Madden, T. Kraska, Z. Shang, J. Fan, N. Tang, Z. Gu, C. Liu, and M. Cafarella. Seed: Domain-specific data curation with large language models. arXiv preprint arXiv:2310.00749, 2023

work page arXiv 2023
[22]

Clough and T

E. Clough and T. Barrett. The gene expression omnibus database. Methods in Molecular Biology , 1418:93–110, 2016. doi: 10.1007/978-1-4939-3578-9 5

work page doi:10.1007/978-1-4939-3578-9 2016
[23]

and Gaffney, Daniel J

A. Conesa, P . Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szcze´sniak, D. J. Gaffney, L. L. Elo, X. Zhang, and A. Mortazavi. A survey of best practices for rna-seq data analysis. Genome Biology, 17:13, 2016. doi: 10.1186/s13059-016-0881-8

work page doi:10.1186/s13059-016-0881-8 2016
[24]

J. P . Cook, A. Mahajan, and A. P . Morris. Guidance for the utility of linear models in meta-analysis of genetic association studies of binary phenotypes.European Journal of Human Genetics, 25(2):240– 245, 2017

work page 2017
[25]

T. Dai, S. Vijayakrishnan, F. T. Szczypi ´nski, J.-F. Ayme, E. Simaei, T. Fellowes, R. Clowes, L. Ko- topanov, C. E. Shields, Z. Zhou, J. W. Ward, and A. I. Cooper. Autonomous mobile robots for exploratory synthetic chemistry. Nature, pages 1–8, Nov. 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-08173-7

work page doi:10.1038/s41586-024-08173-7 2024
[26]

DeepMind

G. DeepMind. Gemini 2.5 flash, 2025. URL https://deepmind.google/models/gemini/ flash/. Fast performance thinking model for everyday tasks

work page 2025
[27]

DeepMind

G. DeepMind. Gemini 2.5 pro, 2025. URL https://deepmind.google/models/gemini/. Advanced thinking model with Deep Think mode for complex reasoning tasks

work page 2025
[28]

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via chatgpt. arXiv preprint arXiv: 2304.07590, 2023

work page arXiv 2023
[30]

Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv: 2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Edgar, M

R. Edgar, M. Domrachev, and A. E. Lash. Ncbi geo: archive for gene expression and epigenomics data sets: 23-year update. Nucleic Acids Research, 52(D1):D138–D144, 2024

work page 2024
[32]

Esp ´ın-P´erez, C

A. Esp ´ın-P´erez, C. Portier, M. Chadeau-Hyam, K. van Veldhoven, J. C. Kleinjans, and T. M. de Kok. Comparison of statistical methods and the use of quality control samples for batch effect correction in human transcriptome data. PloS one, 13(8):e0202947, 2018

work page 2018
[33]

G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. NEURIPS, 2023

work page 2023
[34]

J. A. Gagnon-Bartsch and T. P . Speed. Using control genes to correct for unwanted variation in microarray data. Biostatistics, 13(3):539–552, 2012. doi: 10.1093/biostatistics/kxr034

work page doi:10.1093/biostatistics/kxr034 2012
[35]

Ghosh and A

D. Ghosh and A. M. Chinnaiyan. Classification and selection of biomarkers in genomic data using lasso. Journal of Biomedicine and Biotechnology, 2005(2):147, 2005

work page 2005
[36]

G. S. Ginsburg and K. A. Phillips. Precision medicine: from science to value. Health Affairs, 37(5): 694–701, 2018. doi: 10.1377/hlthaff.2017.1624

work page doi:10.1377/hlthaff.2017.1624 2018
[37]

Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

T. Guo, K. Guo, B. Nan, Z. Liang, Z. Guo, N. V . Chawla, O. Wiest, and X. Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint 15 arXiv:2305.18365, 2023

work page arXiv 2023
[39]

M. A. Hamburg and F. S. Collins. The path to personalized medicine. New England Journal of Medicine, 363(4):301–304, 2010

work page 2010
[40]

J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating char- acteristic (roc) curve. Radiology, 143(1):29–36, 1982

work page 1982
[41]

S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Wang, and Z. Hu. Reasoning with language model is planning with world model. Conference on Empirical Methods in Natural Language Processing, 2023. doi: 10.48550/arXiv.2305.14992

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.14992 2023
[42]

C. R. Henderson. Estimation of genetic parameters. Biometrics, 6(2):186–190, 1950. doi: 10.2307/ 3001414

work page 1950
[43]

Hong and S

L. Hong and S. E. Page. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proceedings of the National Academy of Sciences, 101(46):16385–16389, 2004

work page 2004
[44]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. Metagpt: Meta programming for a multi- agent collaborative framework. arXiv preprint arXiv: 2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

S. Hong, Y. Lin, B. Liu, B. Liu, B. Wu, C. Zhang, C. Wei, D. Li, J. Chen, J. Zhang, J. Wang, L. Zhang, L. Zhang, M. Yang, M. Zhuge, T. Guo, T. Zhou, W. Tao, X. Tang, X. Lu, X. Zheng, X. Liang, Y. Fei, Y. Cheng, Z. Gou, Z. Xu, and C. Wu. Data interpreter: An llm agent for data science.arXiv preprint arXiv:2402.18679, 2024

work page arXiv 2024
[46]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Large Language Models Cannot Self-Correct Reasoning Yet

J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Huang, S

K. Huang, S. Zhang, H. Wang, A. Bhattacharjee, Y. Lu, M. Wen, J. Yang, and J. Ye. Biomni: A general-purpose biomedical ai agent. bioRxiv preprint bioRxiv:2025.05.30.656746, 2025

work page 2025
[49]

Huang, J

Y. Huang, J. Shi, Y. Li, C. Fan, S. Wu, Q. Zhang, Y. Liu, P . Zhou, Y. Wan, N. Z. Gong, and L. Sun. Metatool benchmark for large language models: Deciding whether to use tools and which to use,

work page
[50]

URL https://arxiv.org/abs/2310.03128

work page arXiv
[51]

S. Jia, T. Huo, and Y. Zeng. Llmatdesign: Autonomous materials discovery with large language models. arXiv preprint arXiv:2406.13163, 2024

work page arXiv 2024
[52]

W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007. doi: 10.1093/biostatistics/kxj037

work page doi:10.1093/biostatistics/kxj037 2007
[53]

I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics, 29(2):295–327, 2001

work page 2001
[54]

H. B. Kang, N. Soliman, M. Latzke, J. C. Chang, and J. Bragg. Comlittee: Literature discovery with personal elected author committees. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2023

work page 2023
[55]

M. M. R. Khondoker. Statistical methods for pre-processing microarray gene expression data . PhD thesis, University of Edinburgh, 2006

work page 2006
[56]

Kyalwazi, C

B. Kyalwazi, C. Yau, M. J. Campbell, T. F. Yoshimatsu, A. J. Chien, A. M. Wallace, A. Forero-Torres, L. Pusztai, E. D. Ellis, K. S. Albain, et al. Race, gene expression signatures, and clinical outcomes of patients with high-risk early breast cancer. JAMA Network Open, 6(12):e2349646–e2349646, 2023

work page 2023
[57]

Latif, R

E. Latif, R. Parasuraman, and X. Zhai. Physicsassistant: An llm-powered interactive learning robot for physics lab investigations. In 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), pages 864–871. IEEE, 2024

work page 2024
[58]

J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Bag- gerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high- throughput data. Nature Reviews Genetics, 11(10):733–739, 2010

work page 2010
[59]

Li and C

B. Li and C. N. Dewey. Rsem: accurate transcript quantification from rna-seq data with or without 16 a reference genome. BMC Bioinformatics, 12:323, 2011. doi: 10.1186/1471-2105-12-323

work page doi:10.1186/1471-2105-12-323 2011
[60]

H. Li, Y. Q. Chong, S. Stepputtis, J. Campbell, D. Hughes, M. Lewis, and K. Sycara. Theory of mind for multi-agent collaboration via large language models. arXiv preprint arXiv:2310.10701 , 2023

work page arXiv 2023
[61]

L. Li, W. Xu, J. Guo, R. Zhao, X. Li, Y. Yuan, B. Zhang, Y. Jiang, Y. Xin, R. Dang, et al. Chain of ideas: Revolutionizing research via novel idea development with llm agents. arXiv preprint arXiv:2410.13185, 2024

work page arXiv 2024
[62]

Y. Li, Y. Zhang, and X. Chen. Hle-bench: A holistic evaluation benchmark for large language models in higher-level reasoning. arXiv preprint arXiv:2406.10833, 2024

work page arXiv 2024
[63]

A. W.-C. Liew, N.-F. Law, and H. Yan. Missing value imputation for gene expression data: compu- tational techniques to recover missing data from available information. Briefings in bioinformatics, 12(5):498–513, 2011

work page 2011
[64]

Lippert, J

C. Lippert, J. Listgarten, Y. Liu, C. M. Kadie, R. I. Davidson, and D. Heckerman. Fast linear mixed models for genome-wide association studies. Nature methods, 8(10):833–835, 2011

work page 2011
[65]

B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P . Stone. Llm+p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv: 2304.11477, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

B. Liu, X. Li, J. Zhang, J. Wang, T. He, S. Hong, H. Liu, S. Zhang, K. Song, K. Zhu, Y. Cheng, S. Wang, X. Wang, Y. Luo, H. Jin, P . Zhang, O. Liu, J. Chen, H. Zhang, Z. Yu, H. Shi, B. Li, D. Wu, F. Teng, X. Jia, J. Xu, J. Xiang, Y. Lin, T. Liu, T. Liu, Y. Su, H. Sun, G. Berseth, J. Nie, I. Foster, L. Ward, Q. Wu, Y. Gu, M. Zhuge, X. Tang, H. Wang, J. You...

work page 2025
[67]

H. Liu, S. Chen, Y. Zhang, and H. Wang. Genotex: A benchmark for automated gene expres- sion data analysis in alignment with bioinformaticians, 2025. URL https://arxiv.org/abs/ 2406.15341

work page arXiv 2025
[68]

S. Liu, Y. Lu, S. Chen, X. Hu, J. Zhao, Y. Lu, and Y. Zhao. Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration. arXiv preprint arXiv:2411.15692, 2024

work page arXiv 2024
[69]

Z. Liu, Y. Zhang, P . Li, Y. Liu, and D. Yang. Dynamic LLM-agent network: An LLM-agent collab- oration framework with agent team optimization. arXiv preprint arXiv:2310.02170, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Z. Liu, Y. Huang, S. Raman, A. Anandamurthy, V . Makeeva, V . Subbotin, D. Grushevskaya, K. Raman, E. Kalabusheva, J. Bagaitkar, T. Cui, B. Ren, M. Shvedova, J. Attie, C. Weng, P . Dolzhenko, M. J. Martinez, and K. Zhang. Transcriptomics and epigenetic data integration learning module on google cloud. Briefings in Bioinformatics , 25(Supplement 1):bbae352...

work page doi:10.1093/bib/bbae352 2024
[71]

M. I. Love, W. Huber, and S. Anders. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12):550, 2014

work page 2014
[72]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

H. Ma, T. Hu, Z. Pu, L. Boyin, X. Ai, Y. Liang, and M. Chen. Coevolving with the other you: Fine- tuning llm with sequential cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 37:15497–15525, 2024

work page 2024
[74]

Ma, T.-H

P . Ma, T.-H. Wang, M. Guo, Z. Sun, J. B. Tenenbaum, D. Rus, C. Gan, and W. Matusik. Llm and simulation as bilevel optimizers: A new paradigm to advance physical scientific discovery. arXiv preprint arXiv:2405.09783, 2024

work page arXiv 2024
[75]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prab- humoye, Y. Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Madani, B

A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P . Mohr, J. M. Holton, J. L. Olmos Jr, 17 C. Xiong, Z. Z. Sun, R. Socher, et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pages 1–8, 2023

work page 2023
[77]

J. D. Martin-Rufino, A. Caulier, L. E. Torres, A. Babu, S. Li, S. H. Jung, D. B. Keskin, X. Wang, S. Saori, P . Giuliana, M. Gu, A. A. Thompson, V . G. Sankaran, and E. S. Lander. Transcription factor networks disproportionately enrich for heritability of blood cell phenotypes. Science, 388 (6666):52–59, 2025. doi: 10.1126/science.ads7951

work page doi:10.1126/science.ads7951 2025
[78]

X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023

work page arXiv 2023
[79]

Novita ai: Deploy ai models effortlessly with our simple api

Novita AI. Novita ai: Deploy ai models effortlessly with our simple api. https://novitaai. com, 2025. Accessed: 2025-02-17

work page 2025
[80]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. PREPRINT, 2023

work page 2023

Showing first 80 references.

[1] [1]

Abusamra

H. Abusamra. A comparative study of feature selection and classification methods for gene ex- pression data of glioma. Procedia Computer Science, 23:5–14, 2013

work page 2013

[2] [2]

Aittokallio

T. Aittokallio. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Briefings in bioinformatics, 11(2):253–264, 2010

work page 2010

[3] [3]

Angermueller, T

C. Angermueller, T. P ¨arnamaa, L. Parts, and O. Stegle. Deep learning for computational biology. Molecular systems biology, 12(7):878, 2016

work page 2016

[4] [4]

Claude code: Agentic coding tool, 2024

Anthropic. Claude code: Agentic coding tool, 2024. URL https://www.anthropic.com/ claude/code. Command line tool for agentic coding

work page 2024

[5] [5]

Introducing claude 4: Our most intelligent model, 2024

Anthropic. Introducing claude 4: Our most intelligent model, 2024. URL https://www. anthropic.com/claude. Accessed: 2025-01-22

work page 2024

[6] [6]

Cursor: The ai code editor, 2024

Anysphere. Cursor: The ai code editor, 2024. URL https://cursor.com. AI-powered code editor

work page 2024

[7] [7]

K. Baba, C. Liu, S. Kurita, and A. Sannai. Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923, 2025

work page arXiv 2025

[8] [8]

J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang. Researchagent: Iterative research idea gener- ation over scientific literature with large language models. arXiv preprint arXiv:2404.07738, 2024

work page arXiv 2024

[9] [9]

J. L. Ballard, Z. Wang, W. Li, L. Shen, and Q. Long. Deep learning-based approaches for multi-omics data integration and analysis. BioData Mining , 17(1):38, 2024. doi: 10.1186/ s13040-024-00391-z

work page 2024

[10] [10]

Besta, N

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi, J. Gajda, T. Lehmann, M. Pod- stawski, H. Niewiadomski, P . Nyczyk, and T. Hoefler. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv: 2308.09687, 2023

work page arXiv 2023

[11] [11]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P . Schwaller. Chemcrow: Aug- menting large-language models with chemistry tools. arXiv preprint arXiv: 2304.05376, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

G. R. Brown, V . Hem, K. S. Katz, M. Ovetsky, C. Wallin, O. Ermolaeva, I. Tolstoy, T. Tatusova, K. D. Pruitt, and D. R. Maglott. Gene: a gene-centered information resource at NCBI. Nucleic Acids Research, 43(D1):D36–D42, 2015. doi: 10.1093/nar/gku1055. URL https://doi.org/ 10.1093/nar/gku1055

work page doi:10.1093/nar/gku1055 2015

[13] [13]

Bruning, W

O. Bruning, W. Rodenburg, P . F. Wackers, C. Van Oostrom, M. J. Jonker, R. J. Dekker, H. Rauwerda, W. A. Ensink, A. De Vries, and T. M. Breit. Confounding factors in the transcriptome analysis of an in-vivo exposure experiment. PLoS One, 11(1):e0145252, 2016

work page 2016

[14] [14]

S. A. Byron, K. R. Van Keuren-Jensen, D. M. Engelthaler, J. D. Carpten, and D. W. Craig. Translat- ing rna sequencing into clinical diagnostics: opportunities and challenges. Nature Reviews Genet- ics, 17(5):257–271, 2016. doi: 10.1038/nrg.2016.10

work page doi:10.1038/nrg.2016.10 2016

[15] [15]

Why Do Multi-Agent LLM Systems Fail?

M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica. Why do multi-agent llm systems fail? arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

C ¸ etin and O

V . C ¸ etin and O. YILDIZ. A comprehensive review on data preprocessing techniques in data anal- ysis. Pamukkale ¨Universitesi M¨ uhendislik Bilimleri Dergisi, 28(2):299–312, 2022

work page 2022

[17] [17]

I. S. Chan and G. S. Ginsburg. Personalized medicine: progress and promise. Annual review of genomics and human genetics, 12:217–244, 2011

work page 2011

[18] [18]

K. Chen, Y. Zhou, X. Zhang, and H. Wang. Prompt stability matters: Evaluating and optimizing auto-generated prompt in general-purpose systems. arXiv preprint arXiv:2505.13546, 2025

work page arXiv 2025

[19] [19]

X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou. Teaching large language models to self-debug. arXiv preprint arXiv: 2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

X. Chen, T. Wang, T. Liu, Z. Guo, X. Li, M. Qu, and T. Zhao. A survey on hypothesis generation for scientific discovery in the era of large language models. arXiv preprint arXiv:2504.05496, 2025. 14

work page arXiv 2025

[21] [21]

Z. Chen, L. Cao, S. Madden, T. Kraska, Z. Shang, J. Fan, N. Tang, Z. Gu, C. Liu, and M. Cafarella. Seed: Domain-specific data curation with large language models. arXiv preprint arXiv:2310.00749, 2023

work page arXiv 2023

[22] [22]

Clough and T

E. Clough and T. Barrett. The gene expression omnibus database. Methods in Molecular Biology , 1418:93–110, 2016. doi: 10.1007/978-1-4939-3578-9 5

work page doi:10.1007/978-1-4939-3578-9 2016

[23] [23]

and Gaffney, Daniel J

A. Conesa, P . Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szcze´sniak, D. J. Gaffney, L. L. Elo, X. Zhang, and A. Mortazavi. A survey of best practices for rna-seq data analysis. Genome Biology, 17:13, 2016. doi: 10.1186/s13059-016-0881-8

work page doi:10.1186/s13059-016-0881-8 2016

[24] [24]

J. P . Cook, A. Mahajan, and A. P . Morris. Guidance for the utility of linear models in meta-analysis of genetic association studies of binary phenotypes.European Journal of Human Genetics, 25(2):240– 245, 2017

work page 2017

[25] [25]

T. Dai, S. Vijayakrishnan, F. T. Szczypi ´nski, J.-F. Ayme, E. Simaei, T. Fellowes, R. Clowes, L. Ko- topanov, C. E. Shields, Z. Zhou, J. W. Ward, and A. I. Cooper. Autonomous mobile robots for exploratory synthetic chemistry. Nature, pages 1–8, Nov. 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-08173-7

work page doi:10.1038/s41586-024-08173-7 2024

[26] [26]

DeepMind

G. DeepMind. Gemini 2.5 flash, 2025. URL https://deepmind.google/models/gemini/ flash/. Fast performance thinking model for everyday tasks

work page 2025

[27] [27]

DeepMind

G. DeepMind. Gemini 2.5 pro, 2025. URL https://deepmind.google/models/gemini/. Advanced thinking model with Deep Think mode for complex reasoning tasks

work page 2025

[28] [28]

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via chatgpt. arXiv preprint arXiv: 2304.07590, 2023

work page arXiv 2023

[30] [30]

Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv: 2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Edgar, M

R. Edgar, M. Domrachev, and A. E. Lash. Ncbi geo: archive for gene expression and epigenomics data sets: 23-year update. Nucleic Acids Research, 52(D1):D138–D144, 2024

work page 2024

[32] [32]

Esp ´ın-P´erez, C

A. Esp ´ın-P´erez, C. Portier, M. Chadeau-Hyam, K. van Veldhoven, J. C. Kleinjans, and T. M. de Kok. Comparison of statistical methods and the use of quality control samples for batch effect correction in human transcriptome data. PloS one, 13(8):e0202947, 2018

work page 2018

[33] [33]

G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. NEURIPS, 2023

work page 2023

[34] [34]

J. A. Gagnon-Bartsch and T. P . Speed. Using control genes to correct for unwanted variation in microarray data. Biostatistics, 13(3):539–552, 2012. doi: 10.1093/biostatistics/kxr034

work page doi:10.1093/biostatistics/kxr034 2012

[35] [35]

Ghosh and A

D. Ghosh and A. M. Chinnaiyan. Classification and selection of biomarkers in genomic data using lasso. Journal of Biomedicine and Biotechnology, 2005(2):147, 2005

work page 2005

[36] [36]

G. S. Ginsburg and K. A. Phillips. Precision medicine: from science to value. Health Affairs, 37(5): 694–701, 2018. doi: 10.1377/hlthaff.2017.1624

work page doi:10.1377/hlthaff.2017.1624 2018

[37] [37]

Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

T. Guo, K. Guo, B. Nan, Z. Liang, Z. Guo, N. V . Chawla, O. Wiest, and X. Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint 15 arXiv:2305.18365, 2023

work page arXiv 2023

[39] [39]

M. A. Hamburg and F. S. Collins. The path to personalized medicine. New England Journal of Medicine, 363(4):301–304, 2010

work page 2010

[40] [40]

J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating char- acteristic (roc) curve. Radiology, 143(1):29–36, 1982

work page 1982

[41] [41]

S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Wang, and Z. Hu. Reasoning with language model is planning with world model. Conference on Empirical Methods in Natural Language Processing, 2023. doi: 10.48550/arXiv.2305.14992

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.14992 2023

[42] [42]

C. R. Henderson. Estimation of genetic parameters. Biometrics, 6(2):186–190, 1950. doi: 10.2307/ 3001414

work page 1950

[43] [43]

Hong and S

L. Hong and S. E. Page. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proceedings of the National Academy of Sciences, 101(46):16385–16389, 2004

work page 2004

[44] [44]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. Metagpt: Meta programming for a multi- agent collaborative framework. arXiv preprint arXiv: 2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

S. Hong, Y. Lin, B. Liu, B. Liu, B. Wu, C. Zhang, C. Wei, D. Li, J. Chen, J. Zhang, J. Wang, L. Zhang, L. Zhang, M. Yang, M. Zhuge, T. Guo, T. Zhou, W. Tao, X. Tang, X. Lu, X. Zheng, X. Liang, Y. Fei, Y. Cheng, Z. Gou, Z. Xu, and C. Wu. Data interpreter: An llm agent for data science.arXiv preprint arXiv:2402.18679, 2024

work page arXiv 2024

[46] [46]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Large Language Models Cannot Self-Correct Reasoning Yet

J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Huang, S

K. Huang, S. Zhang, H. Wang, A. Bhattacharjee, Y. Lu, M. Wen, J. Yang, and J. Ye. Biomni: A general-purpose biomedical ai agent. bioRxiv preprint bioRxiv:2025.05.30.656746, 2025

work page 2025

[49] [49]

Huang, J

Y. Huang, J. Shi, Y. Li, C. Fan, S. Wu, Q. Zhang, Y. Liu, P . Zhou, Y. Wan, N. Z. Gong, and L. Sun. Metatool benchmark for large language models: Deciding whether to use tools and which to use,

work page

[50] [50]

URL https://arxiv.org/abs/2310.03128

work page arXiv

[51] [51]

S. Jia, T. Huo, and Y. Zeng. Llmatdesign: Autonomous materials discovery with large language models. arXiv preprint arXiv:2406.13163, 2024

work page arXiv 2024

[52] [52]

W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007. doi: 10.1093/biostatistics/kxj037

work page doi:10.1093/biostatistics/kxj037 2007

[53] [53]

I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics, 29(2):295–327, 2001

work page 2001

[54] [54]

H. B. Kang, N. Soliman, M. Latzke, J. C. Chang, and J. Bragg. Comlittee: Literature discovery with personal elected author committees. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2023

work page 2023

[55] [55]

M. M. R. Khondoker. Statistical methods for pre-processing microarray gene expression data . PhD thesis, University of Edinburgh, 2006

work page 2006

[56] [56]

Kyalwazi, C

B. Kyalwazi, C. Yau, M. J. Campbell, T. F. Yoshimatsu, A. J. Chien, A. M. Wallace, A. Forero-Torres, L. Pusztai, E. D. Ellis, K. S. Albain, et al. Race, gene expression signatures, and clinical outcomes of patients with high-risk early breast cancer. JAMA Network Open, 6(12):e2349646–e2349646, 2023

work page 2023

[57] [57]

Latif, R

E. Latif, R. Parasuraman, and X. Zhai. Physicsassistant: An llm-powered interactive learning robot for physics lab investigations. In 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), pages 864–871. IEEE, 2024

work page 2024

[58] [58]

J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Bag- gerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high- throughput data. Nature Reviews Genetics, 11(10):733–739, 2010

work page 2010

[59] [59]

Li and C

B. Li and C. N. Dewey. Rsem: accurate transcript quantification from rna-seq data with or without 16 a reference genome. BMC Bioinformatics, 12:323, 2011. doi: 10.1186/1471-2105-12-323

work page doi:10.1186/1471-2105-12-323 2011

[60] [60]

H. Li, Y. Q. Chong, S. Stepputtis, J. Campbell, D. Hughes, M. Lewis, and K. Sycara. Theory of mind for multi-agent collaboration via large language models. arXiv preprint arXiv:2310.10701 , 2023

work page arXiv 2023

[61] [61]

L. Li, W. Xu, J. Guo, R. Zhao, X. Li, Y. Yuan, B. Zhang, Y. Jiang, Y. Xin, R. Dang, et al. Chain of ideas: Revolutionizing research via novel idea development with llm agents. arXiv preprint arXiv:2410.13185, 2024

work page arXiv 2024

[62] [62]

Y. Li, Y. Zhang, and X. Chen. Hle-bench: A holistic evaluation benchmark for large language models in higher-level reasoning. arXiv preprint arXiv:2406.10833, 2024

work page arXiv 2024

[63] [63]

A. W.-C. Liew, N.-F. Law, and H. Yan. Missing value imputation for gene expression data: compu- tational techniques to recover missing data from available information. Briefings in bioinformatics, 12(5):498–513, 2011

work page 2011

[64] [64]

Lippert, J

C. Lippert, J. Listgarten, Y. Liu, C. M. Kadie, R. I. Davidson, and D. Heckerman. Fast linear mixed models for genome-wide association studies. Nature methods, 8(10):833–835, 2011

work page 2011

[65] [65]

B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P . Stone. Llm+p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv: 2304.11477, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

B. Liu, X. Li, J. Zhang, J. Wang, T. He, S. Hong, H. Liu, S. Zhang, K. Song, K. Zhu, Y. Cheng, S. Wang, X. Wang, Y. Luo, H. Jin, P . Zhang, O. Liu, J. Chen, H. Zhang, Z. Yu, H. Shi, B. Li, D. Wu, F. Teng, X. Jia, J. Xu, J. Xiang, Y. Lin, T. Liu, T. Liu, Y. Su, H. Sun, G. Berseth, J. Nie, I. Foster, L. Ward, Q. Wu, Y. Gu, M. Zhuge, X. Tang, H. Wang, J. You...

work page 2025

[67] [67]

H. Liu, S. Chen, Y. Zhang, and H. Wang. Genotex: A benchmark for automated gene expres- sion data analysis in alignment with bioinformaticians, 2025. URL https://arxiv.org/abs/ 2406.15341

work page arXiv 2025

[68] [68]

S. Liu, Y. Lu, S. Chen, X. Hu, J. Zhao, Y. Lu, and Y. Zhao. Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration. arXiv preprint arXiv:2411.15692, 2024

work page arXiv 2024

[69] [69]

Z. Liu, Y. Zhang, P . Li, Y. Liu, and D. Yang. Dynamic LLM-agent network: An LLM-agent collab- oration framework with agent team optimization. arXiv preprint arXiv:2310.02170, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

Z. Liu, Y. Huang, S. Raman, A. Anandamurthy, V . Makeeva, V . Subbotin, D. Grushevskaya, K. Raman, E. Kalabusheva, J. Bagaitkar, T. Cui, B. Ren, M. Shvedova, J. Attie, C. Weng, P . Dolzhenko, M. J. Martinez, and K. Zhang. Transcriptomics and epigenetic data integration learning module on google cloud. Briefings in Bioinformatics , 25(Supplement 1):bbae352...

work page doi:10.1093/bib/bbae352 2024

[71] [71]

M. I. Love, W. Huber, and S. Anders. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12):550, 2014

work page 2014

[72] [72]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [73]

H. Ma, T. Hu, Z. Pu, L. Boyin, X. Ai, Y. Liang, and M. Chen. Coevolving with the other you: Fine- tuning llm with sequential cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 37:15497–15525, 2024

work page 2024

[74] [74]

Ma, T.-H

P . Ma, T.-H. Wang, M. Guo, Z. Sun, J. B. Tenenbaum, D. Rus, C. Gan, and W. Matusik. Llm and simulation as bilevel optimizers: A new paradigm to advance physical scientific discovery. arXiv preprint arXiv:2405.09783, 2024

work page arXiv 2024

[75] [75]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prab- humoye, Y. Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [76]

Madani, B

A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P . Mohr, J. M. Holton, J. L. Olmos Jr, 17 C. Xiong, Z. Z. Sun, R. Socher, et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pages 1–8, 2023

work page 2023

[77] [77]

J. D. Martin-Rufino, A. Caulier, L. E. Torres, A. Babu, S. Li, S. H. Jung, D. B. Keskin, X. Wang, S. Saori, P . Giuliana, M. Gu, A. A. Thompson, V . G. Sankaran, and E. S. Lander. Transcription factor networks disproportionately enrich for heritability of blood cell phenotypes. Science, 388 (6666):52–59, 2025. doi: 10.1126/science.ads7951

work page doi:10.1126/science.ads7951 2025

[78] [78]

X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023

work page arXiv 2023

[79] [79]

Novita ai: Deploy ai models effortlessly with our simple api

Novita AI. Novita ai: Deploy ai models effortlessly with our simple api. https://novitaai. com, 2025. Accessed: 2025-02-17

work page 2025

[80] [80]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. PREPRINT, 2023

work page 2023