R2MED: A Benchmark for Reasoning-Driven Medical Retrieval

Lei Li; Xiangxu Zhang; Xiao Zhou; Zheng Liu

arxiv: 2505.14558 · v2 · submitted 2025-05-20 · 💻 cs.IR

R2MED: A Benchmark for Reasoning-Driven Medical Retrieval

Xiangxu Zhang , Lei Li , Xiao Zhou , Zheng Liu This is my paper

Pith reviewed 2026-05-22 13:48 UTC · model grok-4.3

classification 💻 cs.IR

keywords medical retrievalreasoning benchmarkclinical evidenceinformation retrievalnDCG evaluationdiagnosis supportquery-document mismatchclinical decision making

0 comments

The pith

R2MED benchmark shows current retrieval models achieve only 31.4 nDCG@10 on reasoning-intensive medical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a new benchmark called R2MED to evaluate how well retrieval systems can find medical evidence that supports an inferred diagnosis from patient symptoms. In real clinical work, relevant documents often share little direct wording or meaning with the query because they relate to the diagnosis rather than the symptoms themselves. The benchmark includes 876 queries in three different retrieval tasks drawn from various medical scenarios and body systems. When testing 15 different retrieval systems, the top performer only reached 31.4 nDCG@10, and even methods that first generate reasoning steps only got to 41.4. This points to a clear shortfall in how today's techniques handle the kind of thinking doctors do.

Core claim

We present R2MED, the first benchmark designed specifically for reasoning-driven medical retrieval. It features 876 queries across Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval, sourced from five medical scenarios and twelve body systems. Testing reveals that the strongest retrieval model reaches just 31.4 nDCG@10, while incorporating large reasoning models for intermediate inference generation improves results only up to 41.4 nDCG@10, highlighting the gap between existing methods and the demands of clinical reasoning.

What carries the argument

R2MED benchmark with its three tasks that require retrieving documents aligned with inferred diagnoses rather than direct lexical or semantic matches to symptoms.

Load-bearing premise

That the core challenge in medical retrieval stems from evidence matching an inferred diagnosis instead of the patient's reported symptoms, creating low overlap.

What would settle it

Development of a retrieval model that scores above 70 nDCG@10 across the R2MED tasks without relying on additional medical knowledge injection would challenge the identified gap.

Figures

Figures reproduced from arXiv: 2505.14558 by Lei Li, Xiangxu Zhang, Xiao Zhou, Zheng Liu.

**Figure 2.** Figure 2: R2MED benchmark construction pipeline. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Average reranking performance on R2MED using three classic rerankers: MonoBERT, BGEReranker, and RankLLaMA. Detailed are in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Correlation between reasoning answer accu [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Instruction for filtering questions based on [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 5.** Figure 5: Instruction for filtering questions based on the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 8.** Figure 8: Instruction for filtering cases based on quality. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Instruction for rewriting question. A.2 Relevant Document Mining For each query, we use OpenAI o3 model to generate a step-by-step reasoning path, following the instructions detailed in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 11.** Figure 11: Instruction for relevance assessment on Q&A [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 10.** Figure 10: Instruction for generating reasoning path. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 12.** Figure 12: Instruction for relevance assessment on clini [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Instruction for relevance assessment on clini [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Annotation interface of R2MED [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Attribute distributions of R2MED showcase its diversity and comprehensiveness. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Instruction for body system annotation. D Dataset License and Usage D.1 Dataset License [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

read the original abstract

Current medical retrieval benchmarks primarily emphasize lexical or shallow semantic similarity, overlooking the reasoning-intensive demands that are central to clinical decision-making. In practice, physicians often retrieve authoritative medical evidence to support diagnostic hypotheses. Such evidence typically aligns with an inferred diagnosis rather than the surface form of a patient's symptoms, leading to low lexical or semantic overlap between queries and relevant documents. To address this gap, we introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval. It comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval. These tasks are drawn from five representative medical scenarios and twelve body systems, capturing the complexity and diversity of real-world medical information needs. We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10, demonstrating the benchmark's difficulty. Classical re-ranking and generation-augmented retrieval methods offer only modest improvements. Although large reasoning models improve performance via intermediate inference generation, the best results still peak at 41.4 nDCG@10. These findings underscore a substantial gap between current retrieval techniques and the reasoning demands of real clinical tasks. We release R2MED as a challenging benchmark to foster the development of next-generation medical retrieval systems with enhanced reasoning capabilities. Data and code are available at https://github.com/R2MED/R2MED

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R2MED gives a practical new benchmark for medical retrieval that targets inference over surface match, with clear performance gaps shown across 15 systems.

read the letter

Colleague, the core takeaway is that this paper delivers a benchmark of 876 queries across three tasks where retrieval must support inferred diagnoses rather than direct symptom matching, and even the strongest systems only reach 31.4 nDCG@10 while reasoning models top out at 41.4. That gap is the main signal they want to send about current tools falling short on clinical reasoning demands. They draw the queries from five scenarios and twelve body systems, which adds some real-world spread, and they test a range of classical and augmented retrievers to show that re-ranking and generation tricks give only modest lifts. Releasing the data and code is a straightforward plus for anyone who wants to run their own experiments or extend the set. The evaluation is broad enough to make the difficulty claim credible at first glance. The stress-test concern about overlap is worth a look in the full text. If the queries were not built by starting from diagnoses and then confirming low lexical or semantic match through explicit checks or statistics, the low scores could trace to domain sparsity or terminology issues instead of reasoning per se. More detail on how the three tasks were generated, how relevance was judged, and any inter-annotator numbers would tighten that part. Without those, the interpretation that the results specifically isolate unmet reasoning needs stays a bit loose. This paper is aimed at groups working on medical IR, clinical decision support, or benchmark design. Readers who need test sets that push past simple similarity will find usable material here. It has enough structure and released artifacts to merit a serious referee, even if the construction section needs expansion for full calibration. I would send it to peer review. The new dataset and the consistent underperformance numbers give it enough weight to justify referee time, with revisions focused on transparency around query design.

Referee Report

1 major / 1 minor

Summary. The paper introduces R2MED, a benchmark with 876 queries across three tasks (Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval) drawn from five medical scenarios and twelve body systems. It evaluates 15 retrieval systems and finds that even the best achieves only 31.4 nDCG@10, with large reasoning models reaching at most 41.4 nDCG@10 via intermediate inference generation, and claims this demonstrates a substantial gap between current techniques and the reasoning demands of real clinical tasks.

Significance. If the benchmark construction ensures that queries target inferred diagnoses with demonstrably low lexical/semantic overlap to gold documents, the work would be significant for medical IR by providing a challenging, reasoning-focused testbed that exposes limitations in lexical, semantic, and even reasoning-augmented retrieval. The release of data and code supports reproducibility and future work.

major comments (1)

[§3] §3 (or equivalent section on query construction for the three tasks): The central interpretation that low nDCG@10 scores (31.4 baseline, 41.4 with reasoning models) specifically demonstrate unmet reasoning requirements rests on queries having low overlap with relevant documents because they target inferred diagnoses rather than surface symptoms. The manuscript does not report lexical/semantic overlap statistics, detail how queries were generated to start from diagnoses, or describe expert review confirming the mismatch. Without this, alternative explanations such as domain terminology, data sparsity, or general medical knowledge gaps cannot be ruled out.

minor comments (1)

Provide more explicit details on the relevance judgment process and inter-annotator agreement to allow readers to calibrate the benchmark's difficulty and reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on R2MED. The concern about substantiating the low-overlap claim is well-taken, and we outline below how we will strengthen the manuscript while preserving the core contribution.

read point-by-point responses

Referee: [§3] §3 (or equivalent section on query construction for the three tasks): The central interpretation that low nDCG@10 scores (31.4 baseline, 41.4 with reasoning models) specifically demonstrate unmet reasoning requirements rests on queries having low overlap with relevant documents because they target inferred diagnoses rather than surface symptoms. The manuscript does not report lexical/semantic overlap statistics, detail how queries were generated to start from diagnoses, or describe expert review confirming the mismatch. Without this, alternative explanations such as domain terminology, data sparsity, or general medical knowledge gaps cannot be ruled out.

Authors: We agree that quantitative evidence of low lexical and semantic overlap would strengthen the central claim. In the revised manuscript we will expand §3 to include: (1) explicit description of the query-generation pipeline, in which medical experts first selected target diagnoses or clinical inferences from the source materials and then formulated queries that describe the preceding symptoms or presentation; (2) a table reporting average lexical overlap (token overlap, BM25 scores) and semantic overlap (cosine similarity of embeddings from a medical BERT model) between each query and its gold document(s), showing that the majority of pairs fall below the thresholds observed in existing medical IR benchmarks; and (3) a brief account of the expert review process, including the number of reviewers, their qualifications, and the criteria used to confirm that queries target inferred diagnoses rather than surface-level symptoms. These additions will help rule out alternative explanations such as domain terminology or data sparsity. We do not claim the new statistics will be exhaustive, but they will provide direct empirical support for the reasoning-driven nature of the benchmark. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or performance claims

full rationale

R2MED is an empirical benchmark paper whose central results are nDCG@10 scores obtained by running 15 external retrieval systems on a curated set of 876 queries. These scores are computed directly from standard evaluation protocols applied to the released dataset; they do not reduce to any fitted parameter, self-defined quantity, or prior self-citation inside the paper. The motivating claim of low lexical/semantic overlap is presented as a design rationale for the three tasks rather than a derived prediction, and the benchmark itself is independently verifiable through the public data and code. No equations, ansatzes, or uniqueness theorems appear that would create a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard information-retrieval evaluation practices and the domain assumption that the selected queries reflect genuine clinical reasoning needs.

axioms (1)

domain assumption nDCG@10 is an appropriate metric for ranking quality in medical retrieval
Invoked when reporting all performance numbers; standard in IR but assumes reliable graded relevance labels.

pith-pipeline@v0.9.0 · 5782 in / 1187 out tokens · 36060 ms · 2026-05-22T13:48:04.938486+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges
cs.IR 2026-04 unverdicted novelty 6.0

A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
GroupRank: A Groupwise Paradigm for Effective and Efficient Passage Reranking with LLMs
cs.IR 2025-11 unverdicted novelty 5.0

GroupRank uses groupwise LLM reranking with answer-free data synthesis and a group-ranking reward to reach 65.2 NDCG@10 on BRIGHT while providing 6.4x faster inference than listwise baselines.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. arXiv preprint arXiv:2402.09742. Giacomo Frisoni, Miki Mizutani, Gianluca Moro, and Lorenzo Valgimigli. 2022. Bioreader: a retrieval- enhanced text-to-text transformer for biomedical lit- erature. InProceedings of the 2022 conference on empirical methods in nat...

work page arXiv 2022
[2]

Precise zero-shot dense retrieval without relevance labels,

Precise zero-shot dense retrieval without rele- vance labels.arXiv preprint arXiv:2212.10496. Lorraine Goeuriot, Gareth JF Jones, Liadh Kelly, Hen- ning Müller, and Justin Zobel. 2016. Medical in- formation retrieval: introduction to the special issue. Information Retrieval Journal, 19:1–5. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande...

work page arXiv 2016
[3]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Ruihui Hou, Shencheng Chen, Yongqi Fan, Lifeng Zhu, Jing Sun, Jingping Liu, and Tong Ruan. 2024. Msdi- agnosis: An emr-based dataset for clinical multi-step diagnosis.arXiv preprint arXiv:2408.10039. IIYi. 2026. Iiyi online consultation platform. https: //bingli.iiyi.com...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[4]

Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, and Yu Wang

medikal: Integrating knowledge graphs as assistants of llms for enhanced clinical diagnosis on emrs.arXiv preprint arXiv:2406.14326. Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, and Yu Wang. 2025. Meds 3: To- wards medical slow thinking with self-evolved soft dual-sided process supervision.arXiv preprint arXiv:2501.12051. Bowen Jin, Hans...

work page arXiv 2025
[5]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval.Bioinformatics, 39(11):btad651. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi- cient memory management for large language model serving w...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Latent Retrieval for Weakly Supervised Open Domain Question Answering

Latent retrieval for weakly supervised open domain question answering.arXiv preprint arXiv:1906.00300. Chaofan Li, Zheng Liu, Jianlyv Chen, Defu Lian, and Yingxia Shao. 2025a. Reinforced information re- trieval.arXiv preprint arXiv:2502.11562. Lei Li, Xiangxu Zhang, Xiao Zhou, and Zheng Liu

work page internal anchor Pith review Pith/arXiv arXiv 1906
[7]

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou

Automir: Effective zero-shot medical informa- tion retrieval without relevance labels.arXiv preprint arXiv:2410.20050. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025b. Search-o1: Agentic search- enhanced large reasoning models.arXiv preprint arXiv:2501.05366. 10 Jimmy Lin, Xueguang Ma, Sheng...

work page arXiv
[8]

InProceedings of the 44th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356– 2362

Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. InProceedings of the 44th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356– 2362. Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. 2025a. Ho- pRAG: Multi-...

work page arXiv 2025
[9]

InProceedings of the 17th ACM conference on Information and knowledge management, pages 143–152

Medsearch: a specialized search engine for medical information retrieval. InProceedings of the 17th ACM conference on Information and knowledge management, pages 143–152. Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2024. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th Inter- national ACM SIGIR Conference on Rese...

work page arXiv 2024
[10]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin

Ms marco: A human-generated machine read- ing comprehension dataset. Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document ranking with a pretrained sequence-to- sequence model.arXiv preprint arXiv:2003.06713. Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with bert.arXiv preprint arXiv:1910.14424. Op...

work page arXiv 2020
[11]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others

Quantifying the reasoning abilities of llms on real-world clinical cases.arXiv preprint arXiv:2503.04691. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others. 2019. Language models are unsupervised multitask learn- ers.OpenAI blog, 1(8):9. Stephen Robertson, Hugo Zaragoza, and 1 others. 2009. The probabilistic rel...

work page arXiv 2019
[12]

11 Qwen Team

Bright: A realistic and challenging bench- mark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883. 11 Qwen Team. 2024. Qwen2.5: A party of foundation models. Qwen Team. 2025a. Qwen3: Think deeper, act faster. Qwen Team. 2025b. Qwq-32b: Embracing the power of reinforcement learning. Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- hishek Sr...

work page arXiv 2024
[13]

Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D Wang, Joyce C Ho, Chao Zhang, and Carl Yang

World Scientific. Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D Wang, Joyce C Ho, Chao Zhang, and Carl Yang. 2024. Bmretriever: Tuning large language models as better biomedical text retrievers.arXiv preprint arXiv:2404.18443. Wen-wai Yim, Asma Ben Abacha, Yujuan Fu, Zhaoyi Sun, Fei Xia, Meliha Yetisgen-Yildiz, and Martin Krallinger. 2024. ...

work page arXiv 2024
[14]

All docu- ments, from external webpages and Wikipedia, are segmented into smaller passages by sentence-level splitting and regrouped into chunks of approxi- mately 128 tokens

for the Bioinformatics dataset. All docu- ments, from external webpages and Wikipedia, are segmented into smaller passages by sentence-level splitting and regrouped into chunks of approxi- mately 128 tokens. Clinical Evidence Retrieval DatasetsThe clini- cal evidence retrieval task comprises three datasets, each representing a critical stage in clinical d...

work page 2025
[15]

The answer should be Laboratory Tests, Imaging Examinations, Endoscopic Examinations, or Other Examinations

Examinations Recommendation (EXM): The question asks for the most appropriate diagnostic test or examination to confirm a suspected condition. The answer should be Laboratory Tests, Imaging Examinations, Endoscopic Examinations, or Other Examinations

work page
[16]

The answer should be Disease Diagnosis, Syndrome Diagnosis, Etiological Diagnosis, or Functional Disorder Diagnosis

Diagnostic Reasoning (DIA): The question asks for the most likely disease, syndrome, etiology, or functional disorder affecting the patient. The answer should be Disease Diagnosis, Syndrome Diagnosis, Etiological Diagnosis, or Functional Disorder Diagnosis

work page
[17]

The answer should be Pharmacological Treatment, Surgical Treatment, Other Therapies, and Preventive Measures

Treatment Planning (TRT): The question asks for the best treatment plan, including pharmacological, surgical, or preventive measures. The answer should be Pharmacological Treatment, Surgical Treatment, Other Therapies, and Preventive Measures. **Task:** For each given Question-Answer (QA) pair, determine the most appropriate classification from the three ...

work page
[18]

Too Simple

Depth of Reasoning: The question should require deeper reasoning. If the question appears too simple, mark it as "Too Simple"

work page
[19]

incorrect options

Unambiguous Correct Answer: The question must have a unique and unambiguous correct answer. If the question asks for "incorrect options" or allows for multiple correct answers, mark it as "Ambiguous Answer"

work page
[20]

Not Reformulatable

Open-Ended Reformulation Feasibility: The question should be suitable for reformatting into an open-ended format. If the question cannot be easily reformulated into an open-ended problem and a clear ground-truth answer, mark it as "Not Reformulatable"

work page
[21]

Non-Medical Entity

Medical Entity as the Correct Answer: The correct answer must be a medical entity, such as a disease, drug, symptom, anatomical structure, laboratory test, imaging examination, or treatment method. If the correct option is an abstract concept, behavior, tool, or any non-medical entity, mark it as "Non-Medical Entity". For each question, provide one of the...

work page
[22]

Remove the multiple-choice options from the original question

work page
[23]

Which of the following

If the original question contains phrases like “Which of the following...”, rewrite it into a self-contained open-ended form, but only minimally modify the wording required to make it complete without the options. **Output Format:** Your output should follow the following format, do not output any additional content: - Open-ended Question: [your rewritten...

work page 2023
[24]

input_type

augments queries by incorporating poten- tial in-domain answers and prompting an LLM to rewrite the query in a retrieval-friendly form. For search-enhanced large reasoning models, we explore two recent approaches. Search-R1 (Jin et al., 2025) extends DeepSeek-R1 by employing reinforcement learning to enable the model to au- tonomously generate multiple se...

work page 2025
[25]

She presents with symptomatic hypotension and melena, which are suggestive of gastrointestinal bleeding

Patient Profile and Presentation: The patient is an 82-year-old woman with a history of moderate aortic stenosis and coronary artery disease (with a drug-eluting stent), currently on dual antiplatelet therapy (aspirin and clopidogrel). She presents with symptomatic hypotension and melena, which are suggestive of gastrointestinal bleeding

work page
[26]

High-dose proton-pump inhibitors were started, and her bleeding stabilized

Initial Management: She was resuscitated with intravenous fluids and blood products. High-dose proton-pump inhibitors were started, and her bleeding stabilized. Upper and lower endoscopies were performed but did not reveal any bleeding source

work page
[27]

The small bowel is not accessible via standard endoscopy, which is why further evaluation is necessary

Diagnostic Challenge: When both upper and lower endoscopies are unremarkable in a patient with suspected GI bleeding, the most likely source is the small bowel. The small bowel is not accessible via standard endoscopy, which is why further evaluation is necessary

work page
[28]

Next Diagnostic Step: The next appropriate test for evaluating small bowel bleeding is video-capsule endoscopy. This non-invasive procedure allows visualization of the entire small intestine and can detect lesions like angioectasia, tumors, or other small-bowel abnormalities that could be causing occult bleeding

work page
[29]

Final Answer: Video-capsule endoscopy

Conclusion:In summary, once upper and lower endoscopies are negative in a patient with GI bleeding, video-capsule endoscopy is the next best step in evaluation. Final Answer: Video-capsule endoscopy. Positive document (ID: wiki20220301en160_42780) Common reasons for using capsule endoscopy include diagnosis of unexplained bleeding, iron deficiency, or abd...

work page 2018
[30]

Achieve tumor volume reduction through neoadjuvant chemotherapy and interval debulking surgery

work page
[31]

- Rationale: - The patient had advanced-stage ovarian cancer with extensive metastases requiring aggressive primary treatment using neoadjuvant chemotherapy followed by surgery

Prevent recurrence and extend progression-free survival through maintenance therapy. - Rationale: - The patient had advanced-stage ovarian cancer with extensive metastases requiring aggressive primary treatment using neoadjuvant chemotherapy followed by surgery. - Adjuvant chemotherapy was employed to further reduce residual disease. - Given the high cost...

work page
[32]

This sets the stage for a disorder of sexual development

The patient is a 19-year-old with primary amenorrhea (no menstruation by age 15–16 is concerning). This sets the stage for a disorder of sexual development

work page
[33]

Normal breasts indicate that estrogen is present, which often results from the aromatization of androgens

The patient is phenotypically female with normal external female genitalia and breast development. Normal breasts indicate that estrogen is present, which often results from the aromatization of androgens

work page
[34]

The development of this hair is dependent on androgens

The marked absence of pubic and axillary hair is an important clue. The development of this hair is dependent on androgens. This suggests there is a problem with androgen action despite normal or elevated androgen levels

work page
[35]

These are later confirmed by ultrasonogra- phy to be immature testes

Bilateral palpable masses in the inguinal regions are identified on physical exam. These are later confirmed by ultrasonogra- phy to be immature testes. Finding testes in a patient with a female phenotype is a significant finding. ... Putting all this together: The patient has a 46,XY karyotype, presence of testes, normal breast development (due to aroma-...

work page 2009

[1] [1]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. arXiv preprint arXiv:2402.09742. Giacomo Frisoni, Miki Mizutani, Gianluca Moro, and Lorenzo Valgimigli. 2022. Bioreader: a retrieval- enhanced text-to-text transformer for biomedical lit- erature. InProceedings of the 2022 conference on empirical methods in nat...

work page arXiv 2022

[2] [2]

Precise zero-shot dense retrieval without relevance labels,

Precise zero-shot dense retrieval without rele- vance labels.arXiv preprint arXiv:2212.10496. Lorraine Goeuriot, Gareth JF Jones, Liadh Kelly, Hen- ning Müller, and Justin Zobel. 2016. Medical in- formation retrieval: introduction to the special issue. Information Retrieval Journal, 19:1–5. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande...

work page arXiv 2016

[3] [3]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Ruihui Hou, Shencheng Chen, Yongqi Fan, Lifeng Zhu, Jing Sun, Jingping Liu, and Tong Ruan. 2024. Msdi- agnosis: An emr-based dataset for clinical multi-step diagnosis.arXiv preprint arXiv:2408.10039. IIYi. 2026. Iiyi online consultation platform. https: //bingli.iiyi.com...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[4] [4]

Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, and Yu Wang

medikal: Integrating knowledge graphs as assistants of llms for enhanced clinical diagnosis on emrs.arXiv preprint arXiv:2406.14326. Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, and Yu Wang. 2025. Meds 3: To- wards medical slow thinking with self-evolved soft dual-sided process supervision.arXiv preprint arXiv:2501.12051. Bowen Jin, Hans...

work page arXiv 2025

[5] [5]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval.Bioinformatics, 39(11):btad651. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi- cient memory management for large language model serving w...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Latent Retrieval for Weakly Supervised Open Domain Question Answering

Latent retrieval for weakly supervised open domain question answering.arXiv preprint arXiv:1906.00300. Chaofan Li, Zheng Liu, Jianlyv Chen, Defu Lian, and Yingxia Shao. 2025a. Reinforced information re- trieval.arXiv preprint arXiv:2502.11562. Lei Li, Xiangxu Zhang, Xiao Zhou, and Zheng Liu

work page internal anchor Pith review Pith/arXiv arXiv 1906

[7] [7]

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou

Automir: Effective zero-shot medical informa- tion retrieval without relevance labels.arXiv preprint arXiv:2410.20050. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025b. Search-o1: Agentic search- enhanced large reasoning models.arXiv preprint arXiv:2501.05366. 10 Jimmy Lin, Xueguang Ma, Sheng...

work page arXiv

[8] [8]

InProceedings of the 44th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356– 2362

Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. InProceedings of the 44th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356– 2362. Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. 2025a. Ho- pRAG: Multi-...

work page arXiv 2025

[9] [9]

InProceedings of the 17th ACM conference on Information and knowledge management, pages 143–152

Medsearch: a specialized search engine for medical information retrieval. InProceedings of the 17th ACM conference on Information and knowledge management, pages 143–152. Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2024. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th Inter- national ACM SIGIR Conference on Rese...

work page arXiv 2024

[10] [10]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin

Ms marco: A human-generated machine read- ing comprehension dataset. Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document ranking with a pretrained sequence-to- sequence model.arXiv preprint arXiv:2003.06713. Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with bert.arXiv preprint arXiv:1910.14424. Op...

work page arXiv 2020

[11] [11]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others

Quantifying the reasoning abilities of llms on real-world clinical cases.arXiv preprint arXiv:2503.04691. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others. 2019. Language models are unsupervised multitask learn- ers.OpenAI blog, 1(8):9. Stephen Robertson, Hugo Zaragoza, and 1 others. 2009. The probabilistic rel...

work page arXiv 2019

[12] [12]

11 Qwen Team

Bright: A realistic and challenging bench- mark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883. 11 Qwen Team. 2024. Qwen2.5: A party of foundation models. Qwen Team. 2025a. Qwen3: Think deeper, act faster. Qwen Team. 2025b. Qwq-32b: Embracing the power of reinforcement learning. Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- hishek Sr...

work page arXiv 2024

[13] [13]

Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D Wang, Joyce C Ho, Chao Zhang, and Carl Yang

World Scientific. Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D Wang, Joyce C Ho, Chao Zhang, and Carl Yang. 2024. Bmretriever: Tuning large language models as better biomedical text retrievers.arXiv preprint arXiv:2404.18443. Wen-wai Yim, Asma Ben Abacha, Yujuan Fu, Zhaoyi Sun, Fei Xia, Meliha Yetisgen-Yildiz, and Martin Krallinger. 2024. ...

work page arXiv 2024

[14] [14]

All docu- ments, from external webpages and Wikipedia, are segmented into smaller passages by sentence-level splitting and regrouped into chunks of approxi- mately 128 tokens

for the Bioinformatics dataset. All docu- ments, from external webpages and Wikipedia, are segmented into smaller passages by sentence-level splitting and regrouped into chunks of approxi- mately 128 tokens. Clinical Evidence Retrieval DatasetsThe clini- cal evidence retrieval task comprises three datasets, each representing a critical stage in clinical d...

work page 2025

[15] [15]

The answer should be Laboratory Tests, Imaging Examinations, Endoscopic Examinations, or Other Examinations

Examinations Recommendation (EXM): The question asks for the most appropriate diagnostic test or examination to confirm a suspected condition. The answer should be Laboratory Tests, Imaging Examinations, Endoscopic Examinations, or Other Examinations

work page

[16] [16]

The answer should be Disease Diagnosis, Syndrome Diagnosis, Etiological Diagnosis, or Functional Disorder Diagnosis

Diagnostic Reasoning (DIA): The question asks for the most likely disease, syndrome, etiology, or functional disorder affecting the patient. The answer should be Disease Diagnosis, Syndrome Diagnosis, Etiological Diagnosis, or Functional Disorder Diagnosis

work page

[17] [17]

The answer should be Pharmacological Treatment, Surgical Treatment, Other Therapies, and Preventive Measures

Treatment Planning (TRT): The question asks for the best treatment plan, including pharmacological, surgical, or preventive measures. The answer should be Pharmacological Treatment, Surgical Treatment, Other Therapies, and Preventive Measures. **Task:** For each given Question-Answer (QA) pair, determine the most appropriate classification from the three ...

work page

[18] [18]

Too Simple

Depth of Reasoning: The question should require deeper reasoning. If the question appears too simple, mark it as "Too Simple"

work page

[19] [19]

incorrect options

Unambiguous Correct Answer: The question must have a unique and unambiguous correct answer. If the question asks for "incorrect options" or allows for multiple correct answers, mark it as "Ambiguous Answer"

work page

[20] [20]

Not Reformulatable

Open-Ended Reformulation Feasibility: The question should be suitable for reformatting into an open-ended format. If the question cannot be easily reformulated into an open-ended problem and a clear ground-truth answer, mark it as "Not Reformulatable"

work page

[21] [21]

Non-Medical Entity

Medical Entity as the Correct Answer: The correct answer must be a medical entity, such as a disease, drug, symptom, anatomical structure, laboratory test, imaging examination, or treatment method. If the correct option is an abstract concept, behavior, tool, or any non-medical entity, mark it as "Non-Medical Entity". For each question, provide one of the...

work page

[22] [22]

Remove the multiple-choice options from the original question

work page

[23] [23]

Which of the following

If the original question contains phrases like “Which of the following...”, rewrite it into a self-contained open-ended form, but only minimally modify the wording required to make it complete without the options. **Output Format:** Your output should follow the following format, do not output any additional content: - Open-ended Question: [your rewritten...

work page 2023

[24] [24]

input_type

augments queries by incorporating poten- tial in-domain answers and prompting an LLM to rewrite the query in a retrieval-friendly form. For search-enhanced large reasoning models, we explore two recent approaches. Search-R1 (Jin et al., 2025) extends DeepSeek-R1 by employing reinforcement learning to enable the model to au- tonomously generate multiple se...

work page 2025

[25] [25]

She presents with symptomatic hypotension and melena, which are suggestive of gastrointestinal bleeding

Patient Profile and Presentation: The patient is an 82-year-old woman with a history of moderate aortic stenosis and coronary artery disease (with a drug-eluting stent), currently on dual antiplatelet therapy (aspirin and clopidogrel). She presents with symptomatic hypotension and melena, which are suggestive of gastrointestinal bleeding

work page

[26] [26]

High-dose proton-pump inhibitors were started, and her bleeding stabilized

Initial Management: She was resuscitated with intravenous fluids and blood products. High-dose proton-pump inhibitors were started, and her bleeding stabilized. Upper and lower endoscopies were performed but did not reveal any bleeding source

work page

[27] [27]

The small bowel is not accessible via standard endoscopy, which is why further evaluation is necessary

Diagnostic Challenge: When both upper and lower endoscopies are unremarkable in a patient with suspected GI bleeding, the most likely source is the small bowel. The small bowel is not accessible via standard endoscopy, which is why further evaluation is necessary

work page

[28] [28]

Next Diagnostic Step: The next appropriate test for evaluating small bowel bleeding is video-capsule endoscopy. This non-invasive procedure allows visualization of the entire small intestine and can detect lesions like angioectasia, tumors, or other small-bowel abnormalities that could be causing occult bleeding

work page

[29] [29]

Final Answer: Video-capsule endoscopy

Conclusion:In summary, once upper and lower endoscopies are negative in a patient with GI bleeding, video-capsule endoscopy is the next best step in evaluation. Final Answer: Video-capsule endoscopy. Positive document (ID: wiki20220301en160_42780) Common reasons for using capsule endoscopy include diagnosis of unexplained bleeding, iron deficiency, or abd...

work page 2018

[30] [30]

Achieve tumor volume reduction through neoadjuvant chemotherapy and interval debulking surgery

work page

[31] [31]

- Rationale: - The patient had advanced-stage ovarian cancer with extensive metastases requiring aggressive primary treatment using neoadjuvant chemotherapy followed by surgery

Prevent recurrence and extend progression-free survival through maintenance therapy. - Rationale: - The patient had advanced-stage ovarian cancer with extensive metastases requiring aggressive primary treatment using neoadjuvant chemotherapy followed by surgery. - Adjuvant chemotherapy was employed to further reduce residual disease. - Given the high cost...

work page

[32] [32]

This sets the stage for a disorder of sexual development

The patient is a 19-year-old with primary amenorrhea (no menstruation by age 15–16 is concerning). This sets the stage for a disorder of sexual development

work page

[33] [33]

Normal breasts indicate that estrogen is present, which often results from the aromatization of androgens

The patient is phenotypically female with normal external female genitalia and breast development. Normal breasts indicate that estrogen is present, which often results from the aromatization of androgens

work page

[34] [34]

The development of this hair is dependent on androgens

The marked absence of pubic and axillary hair is an important clue. The development of this hair is dependent on androgens. This suggests there is a problem with androgen action despite normal or elevated androgen levels

work page

[35] [35]

These are later confirmed by ultrasonogra- phy to be immature testes

Bilateral palpable masses in the inguinal regions are identified on physical exam. These are later confirmed by ultrasonogra- phy to be immature testes. Finding testes in a patient with a female phenotype is a significant finding. ... Putting all this together: The patient has a 46,XY karyotype, presence of testes, normal breast development (due to aroma-...

work page 2009