EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

Ashish Dalvi; Christopher D. Rosin; David E. Neal; Gene W. Yeo; Gino Prasad; Hsuan-lin Her; Jianyou Wang; Kaicheng Wang; Leon Bergen; Maxim Khan

arxiv: 2504.18736 · v2 · pith:ION6ONKCnew · submitted 2025-04-25 · 💻 cs.CL

EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

Jianyou Wang , Weili Cao , Kaicheng Wang , Xiaoyue Wang , Ashish Dalvi , Gino Prasad , Qishan Liang , Hsuan-lin Her

show 8 more authors

Ming Wang Qin Yang Gene W. Yeo David E. Neal Maxim Khan Christopher D. Rosin Ramamohan Paturi Leon Bergen

This is my paper

classification 💻 cs.CL

keywords evidenceevidencebenchbiomedicalhypothesespipelinerelevanttaskbenchmark

0 comments

read the original abstract

We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
cs.CL 2026-04 unverdicted novelty 5.0

DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
cs.CL 2026-04 unverdicted novelty 5.0

MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choi...