SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

Chuan Qin; Hengshu Zhu; Jinmiao Chen; Meng Xiao; Qingqing Long; Xiaohan Huang; Yuanchun Zhou

arxiv: 2601.12805 · v3 · pith:EIKF5R2Enew · submitted 2026-01-19 · 🧬 q-bio.GN · cs.AI· cs.CL

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

Xiaohan Huang , Meng Xiao , Chuan Qin , Qingqing Long , Jinmiao Chen , Yuanchun Zhou , Hengshu Zhu This is my paper

Pith reviewed 2026-05-25 07:02 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.AIcs.CL

keywords LLM benchmarkinggene function inferencebiomedical AIhallucination evaluationcell atlas interpretationbiological reasoningliterature groundingSciHorizon-GENE

0 comments

The pith

LLMs display wide variation in gene reasoning and often fail to produce complete, literature-grounded functional interpretations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SciHorizon-GENE, a benchmark built from authoritative databases that holds curated knowledge on more than 190,000 human genes and over 540,000 questions spanning gene-to-function reasoning. It tests models on four targeted dimensions: sensitivity to research attention, hallucination rates, answer completeness, and dependence on literature. Systematic runs across many general-purpose and biomedical LLMs uncover large differences in performance and repeated shortfalls in faithful, complete outputs. These results matter because accurate gene-level reasoning is required for safe application of LLMs to cell atlas interpretation and related biological tasks. The benchmark supplies a concrete way to measure progress toward reliable knowledge-enhanced pipelines.

Core claim

SciHorizon-GENE integrates knowledge for over 190K human genes into more than 540K questions and evaluates LLMs on research attention sensitivity, hallucination tendency, answer completeness, and literature influence, exposing substantial heterogeneity in gene-level reasoning capabilities together with persistent shortfalls in faithful, complete, and literature-grounded functional interpretations.

What carries the argument

The SciHorizon-GENE benchmark, which organizes questions around four biologically critical perspectives to expose failure modes in gene-to-function reasoning.

If this is right

LLMs require explicit model selection and validation before deployment in knowledge-enhanced cell atlas interpretation.
Development efforts should target improvements in faithfulness, completeness, and literature grounding for gene-level outputs.
The benchmark supplies a reusable test bed for tracking progress in biological reasoning capabilities.
Heterogeneity across models indicates that current general-purpose and biomedical LLMs are not interchangeable for gene-function tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark to multi-gene or pathway-level questions could reveal whether the observed gaps scale with task complexity.
The four evaluation perspectives might be adapted to measure similar reasoning limits in other scientific domains that rely on curated knowledge bases.
If the heterogeneity persists after retrieval augmentation, it would point to deeper architectural constraints rather than simple knowledge gaps.

Load-bearing premise

The questions drawn from authoritative databases accurately represent the range of gene-to-function scenarios and failure modes that would affect LLM use in biological interpretation pipelines.

What would settle it

A single new LLM that scores uniformly high on all four evaluation perspectives across the full set of 540K questions without extra training or retrieval would contradict the reported persistent challenges.

Figures

Figures reproduced from arXiv: 2601.12805 by Chuan Qin, Hengshu Zhu, Jinmiao Chen, Meng Xiao, Qingqing Long, Xiaohan Huang, Yuanchun Zhou.

**Figure 1.** Figure 1: Observations of LLM behavior on gene-related tasks, motivating the need for our gene-centric benchmark. (a) Model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The benchmark integrates curated biological databases and verified literature sources to construct gene nodes. These [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: PubMed reference count distribution for human [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Model performance on three tasks for high- and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Completeness evaluation of LLMs. All questions [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Gene Ontology answering perfor [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of functional summary answering per [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 10.** Figure 10: Each example corresponds to a specific evaluation perspective. A and B indicate variants within the same genomic [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt templates for all question types, including the unified system prompt and task-specific instruction prompts. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

read the original abstract

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SciHorizon-GENE offers a large new benchmark for LLM gene reasoning but the abstract leaves the question curation and validation steps unshown.

read the letter

The paper's main contribution is SciHorizon-GENE itself: a benchmark built from external databases covering 190K genes and 540K questions, aimed at four angles—research attention sensitivity, hallucination, completeness, and literature influence. That scale and those targeted perspectives are new for gene-to-function LLM testing and directly address a stated gap in biomedical interpretation tasks like cell type annotation. Credit to the authors for grounding it in authoritative sources rather than made-up examples and for framing the evaluation around failure modes that matter for safe use in life sciences pipelines. The abstract also states they ran a wide set of general and biomedical LLMs and saw heterogeneity plus ongoing problems with faithful outputs. That setup could be useful for model selection work if the numbers hold. The soft spot is the lack of any reported detail on how the questions were generated, sampled, or checked for representativeness. No mention of expert review, coverage of mechanism-oriented cases, or controls for database artifacts. Without that, the central claims about LLM limitations rest on unverified assumptions about what the benchmark actually measures. The stress-test note on real-world workflow alignment is fair based on what's shown. This is the sort of paper that belongs in a reading group for people doing LLM evaluation in biology, mainly to see the methods and raw results sections. It deserves peer review because the scale and framing fill an underexplored spot, even though the current description is too high-level to judge soundness yet.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SciHorizon-GENE, a large-scale benchmark constructed from authoritative biological databases containing curated knowledge for over 190K human genes and more than 540K questions. It evaluates state-of-the-art LLMs on four perspectives—research attention sensitivity, hallucination tendency, answer completeness, and literature influence—revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations relevant to cell atlas interpretation.

Significance. If the benchmark questions validly capture real-world failure modes in gene-to-function reasoning, the findings would provide a valuable systematic foundation for analyzing LLM behavior at the gene scale and informing model selection and development in biomedical applications.

major comments (1)

[Abstract] Abstract: The central claims of substantial heterogeneity and persistent challenges in LLM gene-to-function reasoning rest on SciHorizon-GENE accurately representing diverse real-world scenarios. The description provides no details on question validation, bias controls, sampling strategy, coverage of mechanism-oriented analysis, or validation of the curation process against expert workflows in cell-type annotation tasks.

minor comments (1)

[Abstract] Abstract: Specific performance metrics, example questions, and quantitative results are absent, making it difficult to assess the scale of reported heterogeneity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback emphasizing the need for greater methodological transparency to support the benchmark's claims. We address the single major comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of substantial heterogeneity and persistent challenges in LLM gene-to-function reasoning rest on SciHorizon-GENE accurately representing diverse real-world scenarios. The description provides no details on question validation, bias controls, sampling strategy, coverage of mechanism-oriented analysis, or validation of the curation process against expert workflows in cell-type annotation tasks.

Authors: The abstract is intentionally brief. The full manuscript (Section 3) describes construction from authoritative databases covering >190K genes and >540K questions, with explicit inclusion of mechanism-oriented scenarios via pathway, interaction, and regulatory data relevant to cell-type annotation. Sampling is exhaustive (all curated entries) rather than subsampled. We acknowledge the abstract and main text lack explicit subsections on question validation procedures, bias controls, and direct comparison to expert cell-type annotation workflows. We will add a dedicated 'Benchmark Validation and Bias Controls' subsection detailing curation validation steps, relevance checks for cell atlas tasks, and any bias mitigation (e.g., source diversity), plus a brief reference in the abstract. This addresses the concern without changing results. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed from external databases; LLM evaluations independent of paper inputs

full rationale

The paper introduces SciHorizon-GENE as a benchmark built directly from authoritative external biological databases covering 190K genes and 540K questions. The central claims concern observed heterogeneity in LLM performance across four perspectives (research attention sensitivity, hallucination tendency, answer completeness, literature influence) when evaluated on this benchmark. No equations, parameter fits, self-citations, or ansatzes are invoked as load-bearing steps in the derivation chain. The benchmark construction and evaluation results do not reduce to the paper's own inputs by definition or construction, satisfying the criteria for a self-contained, non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; main unverified premise is that database-curated questions capture real biological reasoning failure modes. No free parameters or invented entities are described.

axioms (1)

domain assumption Authoritative biological databases provide accurate and representative gene knowledge sufficient for constructing a benchmark that reflects real-world functional interpretation needs.
The benchmark integrates curated knowledge for over 190K human genes from these databases.

pith-pipeline@v0.9.0 · 5781 in / 1168 out tokens · 44278 ms · 2026-05-25T07:02:27.323232+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 12 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Ajay Agrawal, John McHale, and Alexander Oettl. 2024. Artificial intelligence and scientific discovery: A model of prioritized search.Research Policy53, 5 (2024), 104989

work page 2024
[3]

Mistral AI. 2024. Ministral 8B. https://mistral.ai/news/ministraux

work page 2024
[4]

Mistral AI. 2024. Mistral Large. https://mistral.ai/news/mistral-large

work page 2024
[5]

Mistral AI. 2025. Mistral Medium 3.1. https://mistral.ai/news/mistral-medium-3- 1/

work page 2025
[6]

Mistral AI. 2025. Mistral Small 3.1. https://mistral.ai/news/mistral-small-3-1

work page 2025
[7]

Anthropic. 2024. Claude 3.5 Model Family. https://www.anthropic.com. Accessed: 2025-02-01

work page 2024
[8]

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature619, 7970 (2023), 533–538

work page 2023
[9]

Xunxin Cai, Chengrui Wang, Qingqing Long, Yuanchun Zhou, and Meng Xiao

work page
[10]

Knowledge hierarchy guided biological-medical dataset distillation for domain llm training.arXiv preprint arXiv:2501.15108(2025)

work page arXiv 2025
[11]

Zhiyuan Cao, Vipina K Keloth, Qianqian Xie, Lingfei Qian, Yuntian Liu, Yan Wang, Rui Shi, Weipeng Zhou, Gui Yang, Jeffrey Zhang, et al. 2025. The development landscape of large language models for biomedical applications.Annual Review of Biomedical Data Science8 (2025)

work page 2025
[12]

Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, et al. 2025. Bench- marking large language models for biomedical natural language processing applications and recommendations.Nature communications16, 1 (2025), 3280

work page 2025
[13]

Zhijian Chen, Chuan Hu, Min Wu, Qingqing Long, Xuezhi Wang, Yuanchun Zhou, and Meng Xiao. 2024. GeneSum: Large Language Model-based Gene Summary Extraction. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 1438–1443

work page 2024
[14]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

ContactDoctor. 2024. ContactDoctor-Bio-Medical: A High-Performance Biomed- ical Language Model. https://huggingface.co/ContactDoctor/Bio-Medical-Llama- 3-8B

work page 2024
[16]

ContactDoctor. 2025. Bio-Medical-CoT: Advanced Reasoning for Healthcare Applications. https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B- CoT-012025

work page 2025
[17]

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. 2024. Structured information extraction from scientific text with large language models.Nature communications15, 1 (2024), 1418

work page 2024
[18]

Gene Ontology Consortium. [n. d.]. Gene Ontology Resource. http:// geneontology.org/. Accessed: 2025-07-29

work page 2025
[19]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. 2023. MedAlpaca–an open-source collection of medical conversational AI models and training data.arXiv preprint arXiv:2304.08247(2023)

work page arXiv 2023
[21]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Lan- guage Understanding. InInternational Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[22]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Maria Jackson, Leah Marks, Gerhard HW May, and Joanna B Wilson. 2018. The genetic basis of disease.Essays in biochemistry62, 5 (2018), 643–723

work page 2018
[24]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D’Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Ver- spoor, Asa Ben-Hur, et al. 2016. An expanded evaluation of protein function prediction methods shows an improvement in accuracy.Genome biology17, 1 (2016), 184

work page 2016
[27]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences11, 14 (2021), 6421

work page 2021
[28]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu

work page
[29]

PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong K...

work page doi:10.18653/v1/d19-1259 2019
[30]

Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. 2023. BioASQ-QA: A manually curated corpus for Biomedical Question Answering.Scientific Data10, 1 (2023), 170

work page 2023
[31]

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-antoine Gourraud, Mick- aël Rouvier, and Richard Dufour. 2024. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. In62th Annual Meeting of the Association for Computational Linguistics (ACL’24). Bangkok, Thailand

work page 2024
[32]

Tuuli Lappalainen, Yang I Li, Sohini Ramachandran, and Alexander Gusev. 2024. Genetic and molecular architecture of complex traits.Cell187, 5 (2024), 1059– 1075

work page 2024
[33]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

work page 2004
[34]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Chengwu Liu, Ye Yuan, Yichun Yin, Yan Xu, Xin Xu, Zaoyu Chen, Yasheng Wang, Lifeng Shang, Qun Liu, and Ming Zhang. 2025. Safe: Enhancing Mathematical Rea- soning in Large Language Models via Retrospective Step-aware Formal Verifica- tion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wan...

work page doi:10.18653/v1/2025.acl-long.594 2025
[36]

Mohammad Lotfollahi, Yuhan Hao, Fabian J Theis, and Rahul Satija. 2024. The future of rapid and automated single-cell data analysis using reference mapping. Cell187, 10 (2024), 2343–2358

work page 2024
[37]

Jiarui Lu, Xiaoyin Chen, Stephen Zhewen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, and Jian Tang. 2025. Structure Language Models for Protein Conformation Generation. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=OzUNDnpQyd

work page 2025
[38]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation with large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1509–1520

work page 2024
[39]

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. 2025. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics23, 6 (2022), bbac409

work page 2022
[41]

Pingchuan Ma, Tsun-Hsuan Wang, Minghao Guo, Zhiqing Sun, Joshua B Tenen- baum, Daniela Rus, Chuang Gan, and Wojciech Matusik. 2024. LLM and sim- ulation as bilevel optimizers: a new paradigm to advance physical scientific discovery. InProceedings of the 41st International Conference on Machine Learning. 33940–33962

work page 2024
[42]

1999.Foundations of statistical natural language processing

Christopher Manning and Hinrich Schutze. 1999.Foundations of statistical natural language processing. MIT press

work page 1999
[43]

National Center for Biotechnology Information. [n. d.]. NCBI Gene. https: //www.ncbi.nlm.nih.gov/gene. Accessed: 2025-07-29

work page 2025
[44]

National Library of Medicine. [n. d.]. PubMed. https://pubmed.ncbi.nlm.nih.gov/. Accessed: 2025-07-29

work page 2025
[45]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022
[46]

Liu Pai, Wenyang Gao, Wenjie Dong, Lin Ai, Ziwei Gong, Songfang Huang, Li Zongsheng, Ehsan Hoque, Julia Hirschberg, and Yue Zhang. 2024. A survey on open information extraction from rule-based model to large language model. Findings of the association for computational linguistics: EMNLP 2024(2024), 9586– 9608

work page 2024
[47]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning. PMLR, 248– 260. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY SciHorizon Consortium et al

work page 2022
[48]

Chuan Qin, Xin Chen, Chengrui Wang, Pengmin Wu, Xi Chen, Yihang Cheng, Jingyi Zhao, Meng Xiao, Xiangchao Dong, Qingqing Long, et al. 2025. Scihorizon: Benchmarking ai-for-science readiness from scientific data to large language mod- els. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5754–5765

work page 2025
[49]

Chandan K Reddy and Parshin Shojaee. 2025. Towards scientific discovery with generative ai: Progress, opportunities, and challenges. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 28601–28609

work page 2025
[50]

Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, and Leslie A Lenert. 2025. The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.Journal of the American Medical Informatics Association32, 6 (2025), 1071–1086

work page 2025
[51]

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. 2025. Medgemma technical report.arXiv preprint arXiv:2507.05201(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Xinyi Shang, Xu Liao, Zhicheng Ji, and Wenpin Hou. 2025. Benchmarking large language models for genomic knowledge with GeneTuring.Briefings in Bioinformatics26, 5 (2025), bbaf492

work page 2025
[53]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

work page
[54]

Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

work page 2023
[55]

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al . 2025. Toward expert-level medical question answering with large language models. Nature Medicine(2025), 1–8

work page 2025
[56]

Gemma Team, Aishwarya Kamath, Johan Ferret, and etc. 2025. Gemma 3 Techni- cal Report. arXiv:2503.19786 [cs.CL] https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, et al. 2025. A survey for large language models in biomedicine.Artificial Intelligence in Medicine(2025), 103268

work page 2025
[58]

Chengrui Wang, Qingqing Long, Meng Xiao, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, and Yuanchun Zhou. 2024. Biorag: A rag-llm framework for biological question reasoning.arXiv preprint arXiv:2408.01107(2024)

work page arXiv 2024
[59]

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al . 2023. Scientific discovery in the age of artificial intelligence.Nature620, 7972 (2023), 47–60

work page 2023
[60]

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. 2024. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9440–9450

work page 2024
[61]

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2024. PMC-LLaMA: toward building open-source language models for medicine.Journal of the American Medical Informatics Association31, 9 (2024), 1833–1843

work page 2024
[62]

Shican Wu, Xiao Ma, Dehui Luo, Lulu Li, Xiangcheng Shi, Xin Chang, Xiaoyun Lin, Ran Luo, Chunlei Pei, Changying Du, et al. 2025. Automated literature research and review-generation method based on large language models.National Science Review12, 6 (2025), nwaf169

work page 2025
[63]

Meng Xiao, Xunxin Cai, Qingqing Long, Chengrui Wang, Yuanchun Zhou, and Hengshu Zhu. 2025. Knowledge-Driven Agentic Scientific Corpus Dis- tillation Framework for Biomedical Large Language Models Training. (2025). arXiv:2504.19565 [cs.CL] https://arxiv.org/abs/2504.19565

work page arXiv 2025
[64]

Fengli Xu, Qianyue Hao, Chenyang Shao, Zefang Zong, Yu Li, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Jie Feng, Chen Gao, and Yong Li. 2025. Toward Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models. 6, 10 (...

work page doi:10.1016/j.patter.2025.101370 2025
[65]

An Yang, Baosong Yang, Beichen Zhang, and etc. 2024. Qwen2.5 Technical Report. arXiv preprint arXiv:2412.15115(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan Wang, Yisong Yue, Yuxiao Dong, and Jie Tang. 2024. Sciinstruct: a self-reflective in- struction annotated dataset for training scientific language models.Advances in Neural Information Processing Systems37 (2024), 1443–1473

work page 2024
[67]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[68]

Yanbo Zhang, Sumeer A Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, et al

work page
[69]

Exploring the role of large language models in the scientific method: from hypothesis to discovery.npj Artificial Intelligence1, 1 (2025), 14

work page 2025
[70]

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882(2023)

work page arXiv 2023
[71]

Juexiao Zhou, Haoyang Li, Siyuan Chen, Zhangtianyi Chen, Zhongyi Han, and Xin Gao. 2025. Large language models in biomedicine and healthcare.npj Artificial Intelligence1, 1 (2025), 44

work page 2025
[72]

Xuechao Zou, Kai Li, Junliang Xing, Yu Zhang, Shiying Wang, Lei Jin, and Pin Tao

work page
[73]

DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal From Optical Satellite Images.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–14. doi:10.1109/TGRS.2024.3365806 SciHorizon-Gene: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding Conference acronym ’XX, June 03–05, 2018, Woodstock, NY...

work page doi:10.1109/tgrs.2024.3365806 2024

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Ajay Agrawal, John McHale, and Alexander Oettl. 2024. Artificial intelligence and scientific discovery: A model of prioritized search.Research Policy53, 5 (2024), 104989

work page 2024

[3] [3]

Mistral AI. 2024. Ministral 8B. https://mistral.ai/news/ministraux

work page 2024

[4] [4]

Mistral AI. 2024. Mistral Large. https://mistral.ai/news/mistral-large

work page 2024

[5] [5]

Mistral AI. 2025. Mistral Medium 3.1. https://mistral.ai/news/mistral-medium-3- 1/

work page 2025

[6] [6]

Mistral AI. 2025. Mistral Small 3.1. https://mistral.ai/news/mistral-small-3-1

work page 2025

[7] [7]

Anthropic. 2024. Claude 3.5 Model Family. https://www.anthropic.com. Accessed: 2025-02-01

work page 2024

[8] [8]

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature619, 7970 (2023), 533–538

work page 2023

[9] [9]

Xunxin Cai, Chengrui Wang, Qingqing Long, Yuanchun Zhou, and Meng Xiao

work page

[10] [10]

Knowledge hierarchy guided biological-medical dataset distillation for domain llm training.arXiv preprint arXiv:2501.15108(2025)

work page arXiv 2025

[11] [11]

Zhiyuan Cao, Vipina K Keloth, Qianqian Xie, Lingfei Qian, Yuntian Liu, Yan Wang, Rui Shi, Weipeng Zhou, Gui Yang, Jeffrey Zhang, et al. 2025. The development landscape of large language models for biomedical applications.Annual Review of Biomedical Data Science8 (2025)

work page 2025

[12] [12]

Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, et al. 2025. Bench- marking large language models for biomedical natural language processing applications and recommendations.Nature communications16, 1 (2025), 3280

work page 2025

[13] [13]

Zhijian Chen, Chuan Hu, Min Wu, Qingqing Long, Xuezhi Wang, Yuanchun Zhou, and Meng Xiao. 2024. GeneSum: Large Language Model-based Gene Summary Extraction. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 1438–1443

work page 2024

[14] [14]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

ContactDoctor. 2024. ContactDoctor-Bio-Medical: A High-Performance Biomed- ical Language Model. https://huggingface.co/ContactDoctor/Bio-Medical-Llama- 3-8B

work page 2024

[16] [16]

ContactDoctor. 2025. Bio-Medical-CoT: Advanced Reasoning for Healthcare Applications. https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B- CoT-012025

work page 2025

[17] [17]

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. 2024. Structured information extraction from scientific text with large language models.Nature communications15, 1 (2024), 1418

work page 2024

[18] [18]

Gene Ontology Consortium. [n. d.]. Gene Ontology Resource. http:// geneontology.org/. Accessed: 2025-07-29

work page 2025

[19] [19]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. 2023. MedAlpaca–an open-source collection of medical conversational AI models and training data.arXiv preprint arXiv:2304.08247(2023)

work page arXiv 2023

[21] [21]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Lan- guage Understanding. InInternational Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021

[22] [22]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Maria Jackson, Leah Marks, Gerhard HW May, and Joanna B Wilson. 2018. The genetic basis of disease.Essays in biochemistry62, 5 (2018), 643–723

work page 2018

[24] [24]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D’Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Ver- spoor, Asa Ben-Hur, et al. 2016. An expanded evaluation of protein function prediction methods shows an improvement in accuracy.Genome biology17, 1 (2016), 184

work page 2016

[27] [27]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences11, 14 (2021), 6421

work page 2021

[28] [28]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu

work page

[29] [29]

PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong K...

work page doi:10.18653/v1/d19-1259 2019

[30] [30]

Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. 2023. BioASQ-QA: A manually curated corpus for Biomedical Question Answering.Scientific Data10, 1 (2023), 170

work page 2023

[31] [31]

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-antoine Gourraud, Mick- aël Rouvier, and Richard Dufour. 2024. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. In62th Annual Meeting of the Association for Computational Linguistics (ACL’24). Bangkok, Thailand

work page 2024

[32] [32]

Tuuli Lappalainen, Yang I Li, Sohini Ramachandran, and Alexander Gusev. 2024. Genetic and molecular architecture of complex traits.Cell187, 5 (2024), 1059– 1075

work page 2024

[33] [33]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

work page 2004

[34] [34]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Chengwu Liu, Ye Yuan, Yichun Yin, Yan Xu, Xin Xu, Zaoyu Chen, Yasheng Wang, Lifeng Shang, Qun Liu, and Ming Zhang. 2025. Safe: Enhancing Mathematical Rea- soning in Large Language Models via Retrospective Step-aware Formal Verifica- tion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wan...

work page doi:10.18653/v1/2025.acl-long.594 2025

[36] [36]

Mohammad Lotfollahi, Yuhan Hao, Fabian J Theis, and Rahul Satija. 2024. The future of rapid and automated single-cell data analysis using reference mapping. Cell187, 10 (2024), 2343–2358

work page 2024

[37] [37]

Jiarui Lu, Xiaoyin Chen, Stephen Zhewen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, and Jian Tang. 2025. Structure Language Models for Protein Conformation Generation. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=OzUNDnpQyd

work page 2025

[38] [38]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation with large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1509–1520

work page 2024

[39] [39]

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. 2025. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics23, 6 (2022), bbac409

work page 2022

[41] [41]

Pingchuan Ma, Tsun-Hsuan Wang, Minghao Guo, Zhiqing Sun, Joshua B Tenen- baum, Daniela Rus, Chuang Gan, and Wojciech Matusik. 2024. LLM and sim- ulation as bilevel optimizers: a new paradigm to advance physical scientific discovery. InProceedings of the 41st International Conference on Machine Learning. 33940–33962

work page 2024

[42] [42]

1999.Foundations of statistical natural language processing

Christopher Manning and Hinrich Schutze. 1999.Foundations of statistical natural language processing. MIT press

work page 1999

[43] [43]

National Center for Biotechnology Information. [n. d.]. NCBI Gene. https: //www.ncbi.nlm.nih.gov/gene. Accessed: 2025-07-29

work page 2025

[44] [44]

National Library of Medicine. [n. d.]. PubMed. https://pubmed.ncbi.nlm.nih.gov/. Accessed: 2025-07-29

work page 2025

[45] [45]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022

[46] [46]

Liu Pai, Wenyang Gao, Wenjie Dong, Lin Ai, Ziwei Gong, Songfang Huang, Li Zongsheng, Ehsan Hoque, Julia Hirschberg, and Yue Zhang. 2024. A survey on open information extraction from rule-based model to large language model. Findings of the association for computational linguistics: EMNLP 2024(2024), 9586– 9608

work page 2024

[47] [47]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning. PMLR, 248– 260. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY SciHorizon Consortium et al

work page 2022

[48] [48]

Chuan Qin, Xin Chen, Chengrui Wang, Pengmin Wu, Xi Chen, Yihang Cheng, Jingyi Zhao, Meng Xiao, Xiangchao Dong, Qingqing Long, et al. 2025. Scihorizon: Benchmarking ai-for-science readiness from scientific data to large language mod- els. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5754–5765

work page 2025

[49] [49]

Chandan K Reddy and Parshin Shojaee. 2025. Towards scientific discovery with generative ai: Progress, opportunities, and challenges. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 28601–28609

work page 2025

[50] [50]

Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, and Leslie A Lenert. 2025. The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.Journal of the American Medical Informatics Association32, 6 (2025), 1071–1086

work page 2025

[51] [51]

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. 2025. Medgemma technical report.arXiv preprint arXiv:2507.05201(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Xinyi Shang, Xu Liao, Zhicheng Ji, and Wenpin Hou. 2025. Benchmarking large language models for genomic knowledge with GeneTuring.Briefings in Bioinformatics26, 5 (2025), bbaf492

work page 2025

[53] [53]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

work page

[54] [54]

Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

work page 2023

[55] [55]

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al . 2025. Toward expert-level medical question answering with large language models. Nature Medicine(2025), 1–8

work page 2025

[56] [56]

Gemma Team, Aishwarya Kamath, Johan Ferret, and etc. 2025. Gemma 3 Techni- cal Report. arXiv:2503.19786 [cs.CL] https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, et al. 2025. A survey for large language models in biomedicine.Artificial Intelligence in Medicine(2025), 103268

work page 2025

[58] [58]

Chengrui Wang, Qingqing Long, Meng Xiao, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, and Yuanchun Zhou. 2024. Biorag: A rag-llm framework for biological question reasoning.arXiv preprint arXiv:2408.01107(2024)

work page arXiv 2024

[59] [59]

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al . 2023. Scientific discovery in the age of artificial intelligence.Nature620, 7972 (2023), 47–60

work page 2023

[60] [60]

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. 2024. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9440–9450

work page 2024

[61] [61]

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2024. PMC-LLaMA: toward building open-source language models for medicine.Journal of the American Medical Informatics Association31, 9 (2024), 1833–1843

work page 2024

[62] [62]

Shican Wu, Xiao Ma, Dehui Luo, Lulu Li, Xiangcheng Shi, Xin Chang, Xiaoyun Lin, Ran Luo, Chunlei Pei, Changying Du, et al. 2025. Automated literature research and review-generation method based on large language models.National Science Review12, 6 (2025), nwaf169

work page 2025

[63] [63]

Meng Xiao, Xunxin Cai, Qingqing Long, Chengrui Wang, Yuanchun Zhou, and Hengshu Zhu. 2025. Knowledge-Driven Agentic Scientific Corpus Dis- tillation Framework for Biomedical Large Language Models Training. (2025). arXiv:2504.19565 [cs.CL] https://arxiv.org/abs/2504.19565

work page arXiv 2025

[64] [64]

Fengli Xu, Qianyue Hao, Chenyang Shao, Zefang Zong, Yu Li, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Jie Feng, Chen Gao, and Yong Li. 2025. Toward Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models. 6, 10 (...

work page doi:10.1016/j.patter.2025.101370 2025

[65] [65]

An Yang, Baosong Yang, Beichen Zhang, and etc. 2024. Qwen2.5 Technical Report. arXiv preprint arXiv:2412.15115(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan Wang, Yisong Yue, Yuxiao Dong, and Jie Tang. 2024. Sciinstruct: a self-reflective in- struction annotated dataset for training scientific language models.Advances in Neural Information Processing Systems37 (2024), 1443–1473

work page 2024

[67] [67]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[68] [68]

Yanbo Zhang, Sumeer A Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, et al

work page

[69] [69]

Exploring the role of large language models in the scientific method: from hypothesis to discovery.npj Artificial Intelligence1, 1 (2025), 14

work page 2025

[70] [70]

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882(2023)

work page arXiv 2023

[71] [71]

Juexiao Zhou, Haoyang Li, Siyuan Chen, Zhangtianyi Chen, Zhongyi Han, and Xin Gao. 2025. Large language models in biomedicine and healthcare.npj Artificial Intelligence1, 1 (2025), 44

work page 2025

[72] [72]

Xuechao Zou, Kai Li, Junliang Xing, Yu Zhang, Shiying Wang, Lei Jin, and Pin Tao

work page

[73] [73]

DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal From Optical Satellite Images.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–14. doi:10.1109/TGRS.2024.3365806 SciHorizon-Gene: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding Conference acronym ’XX, June 03–05, 2018, Woodstock, NY...

work page doi:10.1109/tgrs.2024.3365806 2024