pith. sign in

arxiv: 2507.21035 · v3 · pith:5BDYTGWOnew · submitted 2025-07-28 · 💻 cs.AI · cs.LG· cs.MA· q-bio.GN

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

Pith reviewed 2026-05-22 00:33 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MAq-bio.GN
keywords gene expression analysismulti-agent systemslarge language modelstranscriptomic datadata preprocessinggene identificationscientific discovery automation
0
0 comments X

The pith

A team of six LLM agents uses typed messaging and guided action units to automate gene expression analysis from raw transcriptomic files.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a collaborative setup in which specialized large language model agents share an analytic workspace and communicate through typed messages to handle the full pipeline of gene expression work. At the center is a planning process that turns broad task instructions into discrete Action Units, letting agents at any point advance, revise, bypass, or backtrack so the sequence stays logically sound while fitting the quirks of real genomic data files. This structure is shown to deliver 89.13 percent composite similarity correlation on data preprocessing and 60.48 percent F1 on gene identification, both well above the previous best automated results. A reader would care because the approach suggests a practical middle path between rigid scripts that fail on edge cases and free-running agents that lose precision, potentially letting more labs extract reliable biological signals without constant expert oversight.

Core claim

By coordinating six LLM-based agents through typed message-passing on a shared canvas, the system lets programming agents convert high-level guidelines into Action Units and then choose at each step to advance, revise, bypass, or backtrack, preserving overall coherence while adapting to the particular demands of large semi-structured transcriptomic datasets; the result is higher benchmark performance than prior automation methods together with gene-phenotype links that match literature reports after latent confounders are taken into account.

What carries the argument

The guided-planning framework that decomposes tasks into Action Units and supplies explicit decision points for agents to advance, revise, bypass, or backtrack.

If this is right

  • The agents surface gene-phenotype associations that align with published findings while adjusting for latent confounders.
  • The method processes multiple large semi-structured files without the breakdowns typical of fixed workflows.
  • Benchmark gains of roughly ten points in preprocessing correlation and sixteen points in gene identification F1 follow directly from the collaborative structure.
  • Logical coherence is maintained across steps even when individual agents act with some autonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent-collaboration pattern could be tested on other high-dimensional biological datasets such as proteomics or single-cell RNA profiles.
  • Adding explicit checks for biological plausibility at backtrack points might further lower the chance of downstream errors.
  • If the planning decisions generalize, the approach could shorten the time between raw data arrival and publishable biological insight in many labs.

Load-bearing premise

The LLM agents will generate correct analysis code and keep the multi-step process logically consistent without introducing errors that produce invalid biological conclusions.

What would settle it

Execute the code produced by the agents on a publicly available gene-expression dataset whose correct preprocessing steps and gene-phenotype associations have already been established by independent expert analysis, then check whether the outputs match those established results within expected tolerance.

Figures

Figures reproduced from arXiv: 2507.21035 by Haohan Wang, Haoyang Liu, Yijiang Li.

Figure 1
Figure 1. Figure 1: Multi-agent collaboration in our GenoMAS method. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Planning, memory, and self-correction mechanisms of a single programming agent in our [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Individual task performance of GenoMAS and various baselines on dataset filtering, se￾lection, and preprocessing tasks. (a) Dataset filtering (DF) and selection (DS) accuracy of different methods, and different ablation settings of our GenoMAS method. (b) Data preprocessing performance across three data types (Linked, Gene, Trait) and three metrics (AJ: Attribute Jaccard, SJ: Sample Jaccard, CSC: Composite… view at source ↗
Figure 4
Figure 4. Figure 4: Memory reuse efficiency in GenoMAS. (a) Cumulative time savings through memory reuse and (b) memory reuse rate evolution across programming steps. The system rapidly achieves high efficiency, stabilizing around 65% reuse rate after initial learning. 6 Qualitative Studies Autonomous agent behaviors enhance workflow robustness Analysis of GenoMAS execution pat￾terns reveals that agents autonomously adapt the… view at source ↗
Figure 5
Figure 5. Figure 5: Agent collaboration patterns in GenoMAS. (a) Network topology showing agent communi￾cation structure with node size proportional to message volume. Edge thickness indicates interaction frequency. (b) Distribution of message types across agent pairs revealing asymmetric communication patterns, with programming agents predominantly sending validation requests while advisory agents respond with feedback. GEO … view at source ↗
read the original abstract

Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GenoMAS, a multi-agent framework with six specialized LLM agents that collaborate via typed message-passing protocols to perform gene expression analysis on raw transcriptomic data. Central to the system is a guided-planning framework in which programming agents decompose high-level guidelines into Action Units and dynamically choose to advance, revise, bypass, or backtrack. On the GenoTEX benchmark the framework reports 89.13% Composite Similarity Correlation for data preprocessing and 60.48% F1 for gene identification, exceeding the best prior art by 10.61% and 16.85% respectively, while surfacing biologically plausible gene-phenotype associations after latent-confounder adjustment. Code is released at the cited GitHub repository.

Significance. If the performance claims hold under rigorous controls, the work offers a practical middle path between rigid pipelines and fully autonomous agents for scientific code generation in genomics. The explicit code release constitutes a clear strength for reproducibility and community scrutiny.

major comments (2)
  1. [§4] §4 (Experimental evaluation): the reported 89.13% CSC and 60.48% F1 scores are presented without any description of baseline re-implementations, hyper-parameter settings for the compared methods, statistical significance tests, or error bars; these omissions render the claimed margins (10.61% and 16.85%) impossible to assess for robustness.
  2. [§3.2] §3.2 (Guided-planning framework): the Action Unit mechanism and the four-way decision rule (advance/revise/bypass/backtrack) are described only at a high level; without pseudocode, formal invariants, or concrete traces showing how hallucinations are detected and corrected, it is unclear whether the framework actually guarantees logical coherence across multi-step genomic analyses.
minor comments (2)
  1. [Abstract] The abstract states that associations are 'corroborated by the literature' yet provides no citation list or overlap statistics; a supplementary table mapping discovered genes to supporting PubMed IDs would strengthen the biological-plausibility claim.
  2. [§3.1] Notation for the typed message-passing protocol is introduced without an explicit schema or example message; adding a small table of message types and their fields would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where additional information is warranted, and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental evaluation): the reported 89.13% CSC and 60.48% F1 scores are presented without any description of baseline re-implementations, hyper-parameter settings for the compared methods, statistical significance tests, or error bars; these omissions render the claimed margins (10.61% and 16.85%) impossible to assess for robustness.

    Authors: We agree that the current presentation of results in Section 4 lacks sufficient implementation details to allow independent assessment of robustness. In the revised manuscript we will add: (i) explicit descriptions of how each baseline was re-implemented (including any adaptations required for the GenoTEX benchmark), (ii) the hyper-parameter values and search ranges used for all compared methods, (iii) results of statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with reported p-values), and (iv) error bars or standard deviations obtained from multiple independent runs. These additions will make the reported margins (10.61% and 16.85%) directly evaluable. revision: yes

  2. Referee: [§3.2] §3.2 (Guided-planning framework): the Action Unit mechanism and the four-way decision rule (advance/revise/bypass/backtrack) are described only at a high level; without pseudocode, formal invariants, or concrete traces showing how hallucinations are detected and corrected, it is unclear whether the framework actually guarantees logical coherence across multi-step genomic analyses.

    Authors: Section 3.2 currently emphasizes the conceptual design to keep the exposition accessible. We acknowledge that this leaves the operational details underspecified. In the revision we will insert: (i) pseudocode for Action Unit decomposition and the four-way decision procedure, (ii) the key invariants the framework is designed to maintain (e.g., type consistency of messages and non-regression of data-preprocessing state), and (iii) two or three concrete execution traces drawn from our GenoTEX runs that illustrate how the revise or backtrack actions detect and mitigate hallucinations or logical inconsistencies. These additions will clarify how coherence is preserved without claiming formal guarantees beyond the empirical behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The manuscript describes a multi-agent LLM framework for gene expression analysis and reports empirical performance on the external GenoTEX benchmark (89.13% CSC for preprocessing, 60.48% F1 for gene identification). No equations, derivations, or first-principles claims appear that could reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The guided-planning and typed message-passing components are presented as design choices whose validity is assessed via external benchmark comparison and released code, not by internal construction. This is the most common honest finding for applied systems papers whose central results are benchmark gains rather than closed-form derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the unproven assumption that current LLMs can reliably act as domain-expert coding agents for genomic data; it introduces the guided-planning mechanism and the six-agent division as new constructs without independent falsifiable evidence beyond the single benchmark.

axioms (1)
  • domain assumption LLM agents can be specialized via prompting and coordinated through typed messages to perform rigorous scientific data analysis without introducing critical errors
    This assumption underpins the entire multi-agent orchestration described in the abstract.
invented entities (1)
  • Guided-planning framework with Action Units no independent evidence
    purpose: To let agents decompose tasks and dynamically choose advance, revise, bypass, or backtrack actions
    New planning construct introduced to balance structure and adaptability

pith-pipeline@v0.9.0 · 5803 in / 1473 out tokens · 69191 ms · 2026-05-22T00:33:04.925041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

    cs.CL 2026-05 unverdicted novelty 7.0

    An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

  2. Heterogeneous Scientific Foundation Model Collaboration

    cs.AI 2026-04 unverdicted novelty 5.0

    Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

Reference graph

Works this paper leans on

203 extracted references · 203 canonical work pages · cited by 2 Pith papers · 33 internal anchors

  1. [1]

    Abusamra

    H. Abusamra. A comparative study of feature selection and classification methods for gene ex- pression data of glioma. Procedia Computer Science, 23:5–14, 2013

  2. [2]

    Aittokallio

    T. Aittokallio. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Briefings in bioinformatics, 11(2):253–264, 2010

  3. [3]

    Angermueller, T

    C. Angermueller, T. P ¨arnamaa, L. Parts, and O. Stegle. Deep learning for computational biology. Molecular systems biology, 12(7):878, 2016

  4. [4]

    Claude code: Agentic coding tool, 2024

    Anthropic. Claude code: Agentic coding tool, 2024. URL https://www.anthropic.com/ claude/code. Command line tool for agentic coding

  5. [5]

    Introducing claude 4: Our most intelligent model, 2024

    Anthropic. Introducing claude 4: Our most intelligent model, 2024. URL https://www. anthropic.com/claude. Accessed: 2025-01-22

  6. [6]

    Cursor: The ai code editor, 2024

    Anysphere. Cursor: The ai code editor, 2024. URL https://cursor.com. AI-powered code editor

  7. [7]

    K. Baba, C. Liu, S. Kurita, and A. Sannai. Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923, 2025

  8. [8]

    J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang. Researchagent: Iterative research idea gener- ation over scientific literature with large language models. arXiv preprint arXiv:2404.07738, 2024

  9. [9]

    J. L. Ballard, Z. Wang, W. Li, L. Shen, and Q. Long. Deep learning-based approaches for multi-omics data integration and analysis. BioData Mining , 17(1):38, 2024. doi: 10.1186/ s13040-024-00391-z

  10. [10]

    Besta, N

    M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi, J. Gajda, T. Lehmann, M. Pod- stawski, H. Niewiadomski, P . Nyczyk, and T. Hoefler. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv: 2308.09687, 2023

  11. [11]

    A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P . Schwaller. Chemcrow: Aug- menting large-language models with chemistry tools. arXiv preprint arXiv: 2304.05376, 2023

  12. [12]

    G. R. Brown, V . Hem, K. S. Katz, M. Ovetsky, C. Wallin, O. Ermolaeva, I. Tolstoy, T. Tatusova, K. D. Pruitt, and D. R. Maglott. Gene: a gene-centered information resource at NCBI. Nucleic Acids Research, 43(D1):D36–D42, 2015. doi: 10.1093/nar/gku1055. URL https://doi.org/ 10.1093/nar/gku1055

  13. [13]

    Bruning, W

    O. Bruning, W. Rodenburg, P . F. Wackers, C. Van Oostrom, M. J. Jonker, R. J. Dekker, H. Rauwerda, W. A. Ensink, A. De Vries, and T. M. Breit. Confounding factors in the transcriptome analysis of an in-vivo exposure experiment. PLoS One, 11(1):e0145252, 2016

  14. [14]

    S. A. Byron, K. R. Van Keuren-Jensen, D. M. Engelthaler, J. D. Carpten, and D. W. Craig. Translat- ing rna sequencing into clinical diagnostics: opportunities and challenges. Nature Reviews Genet- ics, 17(5):257–271, 2016. doi: 10.1038/nrg.2016.10

  15. [15]

    Why Do Multi-Agent LLM Systems Fail?

    M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica. Why do multi-agent llm systems fail? arXiv preprint arXiv:2503.13657, 2025

  16. [16]

    C ¸ etin and O

    V . C ¸ etin and O. YILDIZ. A comprehensive review on data preprocessing techniques in data anal- ysis. Pamukkale ¨Universitesi M¨ uhendislik Bilimleri Dergisi, 28(2):299–312, 2022

  17. [17]

    I. S. Chan and G. S. Ginsburg. Personalized medicine: progress and promise. Annual review of genomics and human genetics, 12:217–244, 2011

  18. [18]

    K. Chen, Y. Zhou, X. Zhang, and H. Wang. Prompt stability matters: Evaluating and optimizing auto-generated prompt in general-purpose systems. arXiv preprint arXiv:2505.13546, 2025

  19. [19]

    X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou. Teaching large language models to self-debug. arXiv preprint arXiv: 2304.05128, 2023

  20. [20]

    X. Chen, T. Wang, T. Liu, Z. Guo, X. Li, M. Qu, and T. Zhao. A survey on hypothesis generation for scientific discovery in the era of large language models. arXiv preprint arXiv:2504.05496, 2025. 14

  21. [21]

    Z. Chen, L. Cao, S. Madden, T. Kraska, Z. Shang, J. Fan, N. Tang, Z. Gu, C. Liu, and M. Cafarella. Seed: Domain-specific data curation with large language models. arXiv preprint arXiv:2310.00749, 2023

  22. [22]

    Clough and T

    E. Clough and T. Barrett. The gene expression omnibus database. Methods in Molecular Biology , 1418:93–110, 2016. doi: 10.1007/978-1-4939-3578-9 5

  23. [23]

    and Gaffney, Daniel J

    A. Conesa, P . Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szcze´sniak, D. J. Gaffney, L. L. Elo, X. Zhang, and A. Mortazavi. A survey of best practices for rna-seq data analysis. Genome Biology, 17:13, 2016. doi: 10.1186/s13059-016-0881-8

  24. [24]

    J. P . Cook, A. Mahajan, and A. P . Morris. Guidance for the utility of linear models in meta-analysis of genetic association studies of binary phenotypes.European Journal of Human Genetics, 25(2):240– 245, 2017

  25. [25]

    T. Dai, S. Vijayakrishnan, F. T. Szczypi ´nski, J.-F. Ayme, E. Simaei, T. Fellowes, R. Clowes, L. Ko- topanov, C. E. Shields, Z. Zhou, J. W. Ward, and A. I. Cooper. Autonomous mobile robots for exploratory synthetic chemistry. Nature, pages 1–8, Nov. 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-08173-7

  26. [26]

    DeepMind

    G. DeepMind. Gemini 2.5 flash, 2025. URL https://deepmind.google/models/gemini/ flash/. Fast performance thinking model for everyday tasks

  27. [27]

    DeepMind

    G. DeepMind. Gemini 2.5 pro, 2025. URL https://deepmind.google/models/gemini/. Advanced thinking model with Deep Think mode for complex reasoning tasks

  28. [28]

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang...

  29. [29]

    Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via chatgpt. arXiv preprint arXiv: 2304.07590, 2023

  30. [30]

    Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv: 2305.14325, 2023

  31. [31]

    Edgar, M

    R. Edgar, M. Domrachev, and A. E. Lash. Ncbi geo: archive for gene expression and epigenomics data sets: 23-year update. Nucleic Acids Research, 52(D1):D138–D144, 2024

  32. [32]

    Esp ´ın-P´erez, C

    A. Esp ´ın-P´erez, C. Portier, M. Chadeau-Hyam, K. van Veldhoven, J. C. Kleinjans, and T. M. de Kok. Comparison of statistical methods and the use of quality control samples for batch effect correction in human transcriptome data. PloS one, 13(8):e0202947, 2018

  33. [33]

    G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. NEURIPS, 2023

  34. [34]

    J. A. Gagnon-Bartsch and T. P . Speed. Using control genes to correct for unwanted variation in microarray data. Biostatistics, 13(3):539–552, 2012. doi: 10.1093/biostatistics/kxr034

  35. [35]

    Ghosh and A

    D. Ghosh and A. M. Chinnaiyan. Classification and selection of biomarkers in genomic data using lasso. Journal of Biomedicine and Biotechnology, 2005(2):147, 2005

  36. [36]

    G. S. Ginsburg and K. A. Phillips. Precision medicine: from science to value. Health Affairs, 37(5): 694–701, 2018. doi: 10.1377/hlthaff.2017.1624

  37. [37]

    Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

  38. [38]

    T. Guo, K. Guo, B. Nan, Z. Liang, Z. Guo, N. V . Chawla, O. Wiest, and X. Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint 15 arXiv:2305.18365, 2023

  39. [39]

    M. A. Hamburg and F. S. Collins. The path to personalized medicine. New England Journal of Medicine, 363(4):301–304, 2010

  40. [40]

    J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating char- acteristic (roc) curve. Radiology, 143(1):29–36, 1982

  41. [41]

    S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Wang, and Z. Hu. Reasoning with language model is planning with world model. Conference on Empirical Methods in Natural Language Processing, 2023. doi: 10.48550/arXiv.2305.14992

  42. [42]

    C. R. Henderson. Estimation of genetic parameters. Biometrics, 6(2):186–190, 1950. doi: 10.2307/ 3001414

  43. [43]

    Hong and S

    L. Hong and S. E. Page. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proceedings of the National Academy of Sciences, 101(46):16385–16389, 2004

  44. [44]

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. Metagpt: Meta programming for a multi- agent collaborative framework. arXiv preprint arXiv: 2308.00352, 2023

  45. [45]

    S. Hong, Y. Lin, B. Liu, B. Liu, B. Wu, C. Zhang, C. Wei, D. Li, J. Chen, J. Zhang, J. Wang, L. Zhang, L. Zhang, M. Yang, M. Zhuge, T. Guo, T. Zhou, W. Tao, X. Tang, X. Lu, X. Zheng, X. Liang, Y. Fei, Y. Cheng, Z. Gou, Z. Xu, and C. Wu. Data interpreter: An llm agent for data science.arXiv preprint arXiv:2402.18679, 2024

  46. [46]

    S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435, 2024

  47. [47]

    Large Language Models Cannot Self-Correct Reasoning Yet

    J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023

  48. [48]

    Huang, S

    K. Huang, S. Zhang, H. Wang, A. Bhattacharjee, Y. Lu, M. Wen, J. Yang, and J. Ye. Biomni: A general-purpose biomedical ai agent. bioRxiv preprint bioRxiv:2025.05.30.656746, 2025

  49. [49]

    Huang, J

    Y. Huang, J. Shi, Y. Li, C. Fan, S. Wu, Q. Zhang, Y. Liu, P . Zhou, Y. Wan, N. Z. Gong, and L. Sun. Metatool benchmark for large language models: Deciding whether to use tools and which to use,

  50. [50]

    URL https://arxiv.org/abs/2310.03128

  51. [51]

    S. Jia, T. Huo, and Y. Zeng. Llmatdesign: Autonomous materials discovery with large language models. arXiv preprint arXiv:2406.13163, 2024

  52. [52]

    W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007. doi: 10.1093/biostatistics/kxj037

  53. [53]

    I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics, 29(2):295–327, 2001

  54. [54]

    H. B. Kang, N. Soliman, M. Latzke, J. C. Chang, and J. Bragg. Comlittee: Literature discovery with personal elected author committees. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2023

  55. [55]

    M. M. R. Khondoker. Statistical methods for pre-processing microarray gene expression data . PhD thesis, University of Edinburgh, 2006

  56. [56]

    Kyalwazi, C

    B. Kyalwazi, C. Yau, M. J. Campbell, T. F. Yoshimatsu, A. J. Chien, A. M. Wallace, A. Forero-Torres, L. Pusztai, E. D. Ellis, K. S. Albain, et al. Race, gene expression signatures, and clinical outcomes of patients with high-risk early breast cancer. JAMA Network Open, 6(12):e2349646–e2349646, 2023

  57. [57]

    Latif, R

    E. Latif, R. Parasuraman, and X. Zhai. Physicsassistant: An llm-powered interactive learning robot for physics lab investigations. In 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), pages 864–871. IEEE, 2024

  58. [58]

    J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Bag- gerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high- throughput data. Nature Reviews Genetics, 11(10):733–739, 2010

  59. [59]

    Li and C

    B. Li and C. N. Dewey. Rsem: accurate transcript quantification from rna-seq data with or without 16 a reference genome. BMC Bioinformatics, 12:323, 2011. doi: 10.1186/1471-2105-12-323

  60. [60]

    H. Li, Y. Q. Chong, S. Stepputtis, J. Campbell, D. Hughes, M. Lewis, and K. Sycara. Theory of mind for multi-agent collaboration via large language models. arXiv preprint arXiv:2310.10701 , 2023

  61. [61]

    L. Li, W. Xu, J. Guo, R. Zhao, X. Li, Y. Yuan, B. Zhang, Y. Jiang, Y. Xin, R. Dang, et al. Chain of ideas: Revolutionizing research via novel idea development with llm agents. arXiv preprint arXiv:2410.13185, 2024

  62. [62]

    Y. Li, Y. Zhang, and X. Chen. Hle-bench: A holistic evaluation benchmark for large language models in higher-level reasoning. arXiv preprint arXiv:2406.10833, 2024

  63. [63]

    A. W.-C. Liew, N.-F. Law, and H. Yan. Missing value imputation for gene expression data: compu- tational techniques to recover missing data from available information. Briefings in bioinformatics, 12(5):498–513, 2011

  64. [64]

    Lippert, J

    C. Lippert, J. Listgarten, Y. Liu, C. M. Kadie, R. I. Davidson, and D. Heckerman. Fast linear mixed models for genome-wide association studies. Nature methods, 8(10):833–835, 2011

  65. [65]

    B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P . Stone. Llm+p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv: 2304.11477, 2023

  66. [66]

    B. Liu, X. Li, J. Zhang, J. Wang, T. He, S. Hong, H. Liu, S. Zhang, K. Song, K. Zhu, Y. Cheng, S. Wang, X. Wang, Y. Luo, H. Jin, P . Zhang, O. Liu, J. Chen, H. Zhang, Z. Yu, H. Shi, B. Li, D. Wu, F. Teng, X. Jia, J. Xu, J. Xiang, Y. Lin, T. Liu, T. Liu, Y. Su, H. Sun, G. Berseth, J. Nie, I. Foster, L. Ward, Q. Wu, Y. Gu, M. Zhuge, X. Tang, H. Wang, J. You...

  67. [67]

    H. Liu, S. Chen, Y. Zhang, and H. Wang. Genotex: A benchmark for automated gene expres- sion data analysis in alignment with bioinformaticians, 2025. URL https://arxiv.org/abs/ 2406.15341

  68. [68]

    S. Liu, Y. Lu, S. Chen, X. Hu, J. Zhao, Y. Lu, and Y. Zhao. Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration. arXiv preprint arXiv:2411.15692, 2024

  69. [69]

    Z. Liu, Y. Zhang, P . Li, Y. Liu, and D. Yang. Dynamic LLM-agent network: An LLM-agent collab- oration framework with agent team optimization. arXiv preprint arXiv:2310.02170, 2023

  70. [70]

    Z. Liu, Y. Huang, S. Raman, A. Anandamurthy, V . Makeeva, V . Subbotin, D. Grushevskaya, K. Raman, E. Kalabusheva, J. Bagaitkar, T. Cui, B. Ren, M. Shvedova, J. Attie, C. Weng, P . Dolzhenko, M. J. Martinez, and K. Zhang. Transcriptomics and epigenetic data integration learning module on google cloud. Briefings in Bioinformatics , 25(Supplement 1):bbae352...

  71. [71]

    M. I. Love, W. Huber, and S. Anders. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12):550, 2014

  72. [72]

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

  73. [73]

    H. Ma, T. Hu, Z. Pu, L. Boyin, X. Ai, Y. Liang, and M. Chen. Coevolving with the other you: Fine- tuning llm with sequential cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 37:15497–15525, 2024

  74. [74]

    Ma, T.-H

    P . Ma, T.-H. Wang, M. Guo, Z. Sun, J. B. Tenenbaum, D. Rus, C. Gan, and W. Matusik. Llm and simulation as bilevel optimizers: A new paradigm to advance physical scientific discovery. arXiv preprint arXiv:2405.09783, 2024

  75. [75]

    Self-Refine: Iterative Refinement with Self-Feedback

    A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prab- humoye, Y. Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

  76. [76]

    Madani, B

    A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P . Mohr, J. M. Holton, J. L. Olmos Jr, 17 C. Xiong, Z. Z. Sun, R. Socher, et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pages 1–8, 2023

  77. [77]

    J. D. Martin-Rufino, A. Caulier, L. E. Torres, A. Babu, S. Li, S. H. Jung, D. B. Keskin, X. Wang, S. Saori, P . Giuliana, M. Gu, A. A. Thompson, V . G. Sankaran, and E. S. Lander. Transcription factor networks disproportionately enrich for heritability of blood cell phenotypes. Science, 388 (6666):52–59, 2025. doi: 10.1126/science.ads7951

  78. [78]

    X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023

  79. [79]

    Novita ai: Deploy ai models effortlessly with our simple api

    Novita AI. Novita ai: Deploy ai models effortlessly with our simple api. https://novitaai. com, 2025. Accessed: 2025-02-17

  80. [80]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. PREPRINT, 2023

Showing first 80 references.