MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

· 2026 · cs.CV · arXiv 2604.18418

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench

representative citing papers

Workflow Closure Is Not Scientific Closure in Auto-Research Systems

cs.SE · 2026-05-25 · unverdicted · novelty 5.0

Survey of auto-research systems identifies objective, validation, and acceptance collapses, concluding that workflow closure does not equal scientific closure and advocating non-autonomous epistemic control.

citing papers explorer

Showing 1 of 1 citing paper.

Workflow Closure Is Not Scientific Closure in Auto-Research Systems cs.SE · 2026-05-25 · unverdicted · none · ref 73 · internal anchor
Survey of auto-research systems identifies objective, validation, and acceptance collapses, concluding that workflow closure does not equal scientific closure and advocating non-autonomous epistemic control.

MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

fields

years

verdicts

representative citing papers

citing papers explorer