Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

Biqing Qi; Bowen Zhou; Ermo Hua; Haoxiang Li; Hu Jinfang; Kai Tian; Kaiyan Zhang; Sihang Zeng; Zhang-Ren Chen

arxiv: 2407.08940 · v2 · pith:4QOLF2QMnew · submitted 2024-07-12 · 💻 cs.CL

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

Biqing Qi , Kaiyan Zhang , Kai Tian , Haoxiang Li , Zhang-Ren Chen , Sihang Zeng , Ermo Hua , Hu Jinfang

show 1 more author

Bowen Zhou

This is my paper

classification 💻 cs.CL

keywords biomedicalhypothesisgenerationknowledgellmstoolevaluationhypotheses

0 comments

read the original abstract

The rapid growth of biomedical knowledge has outpaced our ability to efficiently extract insights and generate novel hypotheses. Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction and potentially accelerate biomedical discovery. In this paper, we present a comprehensive evaluation of LLMs as biomedical hypothesis generators. We construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into training, seen, and unseen test sets based on publication date to mitigate data contamination. Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings. To enhance the exploration of uncertainty, a crucial aspect of scientific discovery, we incorporate tool use and multi-agent interactions in our evaluation framework. Furthermore, we propose four novel metrics grounded in extensive literature review to evaluate the quality of generated hypotheses, considering both LLM-based and human assessments. Our experiments yield two key findings: 1) LLMs can generate novel and validated hypotheses, even when tested on literature unseen during training, and 2) Increasing uncertainty through multi-agent interactions and tool use can facilitate diverse candidate generation and improve zero-shot hypothesis generation performance. However, we also observe that the integration of additional knowledge through few-shot learning and tool use may not always lead to performance gains, highlighting the need for careful consideration of the type and scope of external knowledge incorporated. These findings underscore the potential of LLMs as powerful aids in biomedical hypothesis generation and provide valuable insights to guide further research in this area.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DN-Hypo-Pipeline: An AI-Driven Workflow for Generating Hypotheses using Large Language Models and Scientific Explanations
cs.AI 2026-06 unverdicted novelty 6.0

DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer alg...
"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations
cs.CL 2026-04 unverdicted novelty 6.0

The paper introduces CoLabScience with PULI, a positive-unlabeled RL framework for proactive interventions in streaming biomedical dialogues, plus the BSDD benchmark dataset, claiming superior performance over baselines.
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
cs.CL 2025-03 accept novelty 3.0

A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
cs.CL 2025-02 unverdicted novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.