pith. sign in

arxiv: 2504.18736 · v2 · pith:ION6ONKCnew · submitted 2025-04-25 · 💻 cs.CL

EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

classification 💻 cs.CL
keywords evidenceevidencebenchbiomedicalhypothesespipelinerelevanttaskbenchmark
0
0 comments X
read the original abstract

We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

    cs.CL 2026-04 unverdicted novelty 5.0

    DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.

  2. MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

    cs.CL 2026-04 unverdicted novelty 5.0

    MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choi...