Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

Devi Prasad Bal; Subhashree Puhan

arxiv: 2605.02520 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.AI· cs.IR

Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

Devi Prasad Bal , Subhashree Puhan This is my paper

Pith reviewed 2026-05-09 15:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords Retrieval-Augmented GenerationBiomedical question answeringCross-encoder rerankingDense retrievalBioASQ benchmarkContextual precision

0 comments

The pith

Cross-encoder reranking delivers the highest precision and best overall score when retrieving documents for biomedical question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study fixes the language model, embeddings, and vector database, then tests five different ways to pull relevant passages for a RAG system on biomedical questions. It finds that letting the retriever see both the query and each candidate document together, as cross-encoder reranking does, produces the strongest results on contextual precision and a composite quality score. Simple dense vector search comes within a tiny margin of the leader, while methods that expand or diversify queries add noise. All retrieval approaches beat a no-context baseline on answer quality, showing that grounding helps even when the generator is held constant.

Core claim

With GPT-4o-mini, text-embedding-3-small, and a 250-pair subset of BioASQ held fixed, cross-encoder reranking reaches a composite score of 0.827 and contextual precision of 0.852; dense retrieval is nearly as strong at 0.822; multi-query expansion shows the lowest precision at 0.671; and every retrieval condition substantially improves answer relevancy over the no-context case.

What carries the argument

Cross-Encoder Reranking, which scores each query-document pair by running them together through a model that directly models their interaction before selecting the top passages.

If this is right

Direct query-document interaction improves precision more reliably than query expansion or diversity penalties.
Dense vector search alone is competitive enough that added reranking steps may not always justify their cost.
Retrieval remains valuable even when the downstream generator is strong, as shown by large gains in answer relevancy over the no-context baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The small performance gap between dense search and reranking suggests that, for many biomedical queries, simpler pipelines may deliver most of the benefit without extra latency.
Noise from query expansion points to a possible need for tighter filtering or learned expansion models in domains with specialized vocabulary.
Because the test set is modest and fixed, extending the comparison to full BioASQ or other medical corpora would test whether the observed ordering holds at scale.

Load-bearing premise

That the ranking among strategies stays the same when the generator model, embedding model, or size and diversity of the biomedical corpus change.

What would settle it

A follow-up run on the same 250 questions but with a different generator or embedding model in which multi-query expansion or MMR records a higher composite score than cross-encoder reranking.

Figures

Figures reproduced from arXiv: 2605.02520 by Devi Prasad Bal, Subhashree Puhan.

**Figure 2.** Figure 2: Evaluation metric contributions to composite score (unweighted mean of four metrics, each contributing 1/4 view at source ↗

**Figure 3.** Figure 3: Per-metric improvement over the Dense Vector Search baseline (n = 250 QA pairs). (A) Contextual precision (baseline: 0.809): Cross-Encoder gains +0.043; Multi-Query suffers the largest loss (−0.138). (B) Contextual recall (baseline: 0.887): all strategies decline, with MMR posting the steepest loss (−0.112). (C) Faithfulness (baseline: 0.897): differences are negligible across strategies (range −0.004 to +… view at source ↗

**Figure 4.** Figure 4: Multi-metric performance profile (radar chart) comparing all six strategies across the four evaluation dimensions. Cross-Encoder Re-ranking occupies the largest total area, driven by its contextual precision advantage (0.852). Multi-Query Expansion shows the most asymmetric profile with the weakest contextual precision (0.671) of any retrieval strategy. MMR records the lowest contextual recall (0.775) and … view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it deserves. This paper presents a systematic empirical comparison of five retrieval strategies -- Dense Vector Search, Hybrid BM25 + Dense retrieval, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance (MMR) -- within a biomedical question-answering RAG pipeline. All strategies share a fixed generation model (GPT-4o-mini), a common vector store (ChromaDB), and OpenAI's text-embedding-3-small embeddings, ensuring that observed differences are attributable to retrieval alone. Evaluation is conducted on 250 question-answer pairs drawn from a preprocessed subset of the BioASQ benchmark (rag-mini-bioasq) using four DeepEval metrics: contextual precision, contextual recall, faithfulness, and answer relevancy, each reported with 95% confidence intervals. A no-context ablation is included as a lower bound. Cross-Encoder Reranking achieves the best composite score (0.827) and highest contextual precision (0.852), confirming that query-document interaction yields measurable retrieval gains. Multi-Query Expansion, despite its recall-oriented design, produces the weakest contextual precision (0.671), suggesting naive query diversification introduces retrieval noise. MMR sacrifices answer relevancy for diversity, while the Dense baseline (composite 0.822) falls within 0.005 points of the top strategy. All RAG conditions dramatically outperform the no-context ablation on answer relevancy (0.658-0.701 vs. 0.287), confirming the practical value of retrieval. The full pipeline, hyperparameters, and evaluation code are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clean, controlled head-to-head on five retrieval methods in a fixed biomedical RAG pipeline that shows cross-encoder reranking edging out the rest by a small margin.

read the letter

The paper's main contribution is running five established retrieval strategies through an identical biomedical QA setup and reporting the numbers with confidence intervals. Cross-encoder reranking tops the composite score at 0.827 and contextual precision at 0.852, while plain dense retrieval sits right behind at 0.822. The no-context ablation makes the retrieval benefit clear on answer relevancy. They keep the generator, embedder, and store fixed, which lets the differences trace back to the retrieval step alone, and they release the code and pipeline details.

Referee Report

0 major / 4 minor

Summary. This paper performs a systematic, controlled empirical study benchmarking five retrieval strategies in a biomedical RAG setup for question-answering tasks. The strategies are Dense Vector Search, Hybrid BM25 + Dense, Cross-Encoder Reranking, Multi-Query Expansion, and MMR. The experimental design freezes the LLM (GPT-4o-mini), embeddings (text-embedding-3-small), and vector store (ChromaDB), evaluates on 250 BioASQ pairs using DeepEval metrics with 95% CIs, and includes a no-context baseline. The main result is that Cross-Encoder Reranking attains the highest composite score of 0.827 and contextual precision of 0.852, with all RAG methods outperforming the no-context condition on answer relevancy.

Significance. Assuming the findings are robust, this study delivers actionable evidence favoring cross-encoder reranking for improving retrieval quality in biomedical RAG applications. The controlled isolation of retrieval effects, inclusion of confidence intervals, no-context ablation, and public release of code and pipeline are notable strengths that support reproducibility and allow the community to build upon the work. It addresses a practical question in a critical domain with direct implications for system design.

minor comments (4)

The manuscript does not specify the exact hyperparameter settings used for each retrieval strategy, such as the number of expanded queries in Multi-Query Expansion or the diversity parameter in MMR. Although code is released, explicit values in the text would improve clarity.
Preprocessing details for deriving the 250-pair rag-mini-bioasq subset from the full BioASQ benchmark are not described, which is important for understanding the data characteristics and enabling replication.
The composite score used to rank strategies is referenced but its calculation method (e.g., how the four metrics are combined) is not defined; this should be clarified in the methods or results section.
A summary table presenting all metric scores with their 95% confidence intervals for each strategy and the baseline would enhance the presentation and allow readers to assess overlaps directly.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the recognition of its strengths in controlled design, confidence intervals, ablation, and reproducibility, and the recommendation for minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements

full rationale

The paper performs a controlled empirical benchmarking of five retrieval strategies in a biomedical RAG pipeline, holding fixed the generator (GPT-4o-mini), embeddings (text-embedding-3-small), vector store (ChromaDB), and evaluation dataset (250-pair rag-mini-bioasq subset). All reported scores, rankings, and confidence intervals follow directly from running the same evaluation metrics (contextual precision, recall, faithfulness, answer relevancy) on each strategy and comparing the outputs. No equations, fitted parameters, or self-citations are used to derive or justify the central claims; the ordering (Cross-Encoder highest at 0.827 composite) is a straightforward measurement outcome rather than a reduction to prior inputs or definitions. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on the assumption that DeepEval metrics are valid proxies for biomedical RAG utility and that the chosen benchmark subset and fixed components isolate retrieval effects without hidden interactions.

axioms (2)

domain assumption DeepEval metrics (contextual precision, recall, faithfulness, answer relevancy) are appropriate and sufficient for evaluating biomedical RAG performance
The paper uses these four metrics as the sole evaluation criteria without additional domain-specific validation or human judgment correlation reported in the abstract.
domain assumption The 250-pair preprocessed BioASQ subset is representative of real biomedical question-answering needs
Evaluation is performed exclusively on this subset; no external validation or sensitivity analysis to subset selection is described.

pith-pipeline@v0.9.0 · 5647 in / 1572 out tokens · 42784 ms · 2026-05-09T15:57:36.782114+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Gautier Izacard and Edouard Grave

P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Proc. Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 9459–9474, 2020. [5] J. He et al., "Retrieval-Augmented Generation in Biomedicine: A Survey of Technologies, Datasets, and Clinical Applications," arXiv:2505.01146, 2025. [6] D. Soong et al...

work page arXiv 2020
[2]

ChromaDB: The Open-Source Embedding Database,

Chroma, "ChromaDB: The Open-Source Embedding Database," 2023. [Online]. Available: https://www.trychroma.com [Accessed: Apr. 2026]. [29] OpenAI. “GPT-4o mini: advancing cost-efficient intelligence.” OpenAI Blog, Jul. 2024. [Online]. Available: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ [Accessed: Apr. 2026]

work page 2023

[1] [1]

Gautier Izacard and Edouard Grave

P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Proc. Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 9459–9474, 2020. [5] J. He et al., "Retrieval-Augmented Generation in Biomedicine: A Survey of Technologies, Datasets, and Clinical Applications," arXiv:2505.01146, 2025. [6] D. Soong et al...

work page arXiv 2020

[2] [2]

ChromaDB: The Open-Source Embedding Database,

Chroma, "ChromaDB: The Open-Source Embedding Database," 2023. [Online]. Available: https://www.trychroma.com [Accessed: Apr. 2026]. [29] OpenAI. “GPT-4o mini: advancing cost-efficient intelligence.” OpenAI Blog, Jul. 2024. [Online]. Available: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ [Accessed: Apr. 2026]

work page 2023