A Systematic Study of Biomedical Retrieval Pipeline Trade-offs in Performance and Efficiency

Hayk Stepanyan; Matthew McDermott

arxiv: 2604.20853 · v1 · submitted 2026-02-23 · 💻 cs.IR

A Systematic Study of Biomedical Retrieval Pipeline Trade-offs in Performance and Efficiency

Hayk Stepanyan , Matthew McDermott This is my paper

Pith reviewed 2026-05-15 20:48 UTC · model grok-4.3

classification 💻 cs.IR

keywords biomedical retrievalcorpus aggregationvector indexingHNSWFAISSretrieval efficiencyLLM-as-a-judgeMedRAG

0 comments

The pith

Aggregating multiple biomedical corpora delivers the highest retrieval quality, while MedRAG/pubmed stands out as the best single-corpus option for speed and accuracy trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper runs a large-scale empirical comparison of retrieval pipelines for biomedical text by varying the underlying corpora, how finely the text is split into chunks, and the vector indexing methods used for fast search. It shows that pooling several public datasets into one combined corpus consistently beats any individual dataset on relevance metrics across exam questions, conversational queries, and other query styles. Among single corpora, MedRAG/pubmed paired with graph-based HNSW indexing, suitable chunk sizes, and FAISS configuration emerges as the strongest performer on the speed-quality frontier. The results are scored via an LLM acting as judge, with human checks on a subset of cases to support the rankings. Practitioners building clinical search or question-answering systems can apply these findings to pick components that improve answer quality without excessive compute cost.

Core claim

Corpus aggregation yields superior absolute retrieval quality across diverse biomedical query types, while the MedRAG/pubmed corpus is Pareto-optimal among single corpora when retrieval uses HNSW graph indexing, appropriate chunking, and FAISS vector indexing, as measured by LLM-as-a-judge win-rate comparisons validated on a human subset.

What carries the argument

Retrieval pipeline settings that combine corpus selection (aggregation versus single sets such as MedRAG/pubmed), chunk granularity choices, and vector index configuration (HNSW and FAISS), with performance measured by win-rate comparisons from an LLM-as-a-judge.

If this is right

Applications that prioritize maximum relevance should combine available biomedical corpora rather than select one.
Efficiency-sensitive deployments should adopt MedRAG/pubmed with HNSW and FAISS indexing plus tuned chunk sizes.
Chunk granularity must be matched to the chosen corpus and index to realize the reported speed-quality balance.
LLM-as-a-judge provides a scalable way to compare retrieval pipelines when human annotation budgets are limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation approach could be tested in other narrow domains such as legal documents or materials science literature to check whether it produces similar quality gains.
Downstream clinical tasks like summarization or diagnosis support could be measured with these retrieval pipelines to quantify end-to-end impact.
Standard benchmark collections might be pre-aggregated following the paper's recipe so future papers report comparable numbers.
Hybrid index designs that blend HNSW with other structures could be explored to push the efficiency frontier further.

Load-bearing premise

That LLM-as-a-judge relevance scores, even after partial human validation, reliably match what real users would consider relevant across all biomedical query types and settings.

What would settle it

A direct head-to-head human expert study that rates relevance of passages from aggregated versus MedRAG/pubmed corpora and finds that the ordering of systems reverses on more than half the queries compared with the LLM judgments.

Figures

Figures reproduced from arXiv: 2604.20853 by Hayk Stepanyan, Matthew McDermott.

**Figure 2.** Figure 2: Pairwise corpus comparison using rank-aligned win rates aggregated across all query sources. Each [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Pairwise corpus comparison across query categories showing that while the aggregated ”All” corpus [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Averaged win rates across varying chunk sizes stratified by query category. The experiment is run on the PMC Open Access dataset only [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Pareto frontier illustrating the trade-off [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Pareto frontier illustrating the trade-off [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Pairwise draw rates for corpus comparison [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Averaged pairwise chunk comparison using [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Retrieval systems are increasingly used in biomedical and clinical natural language processing applications, yet practical guidance for researchers building such systems is limited. In this work, we provide such guidance through an empirical study of how retrieval pipeline design choices affect performance and efficiency at scale. In particular, we examine retrieval over a variety of existing, public biomedical text datasets, leveraging a variety of disparate types of queries, including exam-style questions, conversational medical queries, community-asked questions, and non-question formulations across various retrieval pipeline settings spanning corpus selection, chunk granularity, and vector index configuration. Retrieval results are judged using a robust, win-rate comparison assessment via an LLM-as-a-judge setting with human validation. Across these experiments, we identify several points of concrete guidance for reviewers, including the superiority of corpus aggregation for absolute retrieval quality, and the emergence of MedRAG/pubmed as the Pareto-optimal singleton corpus under graph-based (HNSW) indexing, appropriate chunking strategies, and FAISS indexing choices that offer the best trade-offs in speed and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This gives practical benchmarks showing corpus aggregation improves biomedical retrieval and MedRAG/pubmed works well for speed-quality trade-offs, but the LLM judge lacks enough human validation details to fully trust the rankings.

read the letter

This paper runs a systematic set of experiments on public biomedical datasets with varied query types, from exam questions to conversational ones. It measures how choices in corpus selection, chunking, and indexing affect both retrieval quality and runtime, and it lands on two main results: combining corpora lifts absolute performance, and MedRAG/pubmed comes out as the best single-corpus option under HNSW indexing and FAISS setups that keep things fast. That kind of concrete guidance is what people building clinical search tools actually need, and the work covers enough ground to make the comparisons credible on their own terms. The experiments use multiple datasets and track efficiency metrics alongside accuracy, which is a step up from narrower prior studies. The LLM-as-a-judge approach with some human validation on a subset supports the claims without obvious circularity. The main soft spot is the evaluation channel. The abstract mentions human validation but gives no numbers on subset size, selection, or agreement rates. If the LLM has any systematic tilt toward longer contexts or PubMed-style text, the Pareto rankings and aggregation benefits could look different. That is a real but contained issue rather than a fatal one, since the design is otherwise straightforward and the datasets are public. This is aimed at practitioners in biomedical NLP who need to pick pipelines without starting from scratch. It does not introduce new algorithms, but the empirical coverage is new enough to be worth citing if you work in this area. I would bring it to a reading group to discuss the trade-off numbers, and a serious editor should send it for peer review so the judge validation can be tightened up.

Referee Report

1 major / 0 minor

Summary. The manuscript conducts a systematic empirical study of biomedical retrieval pipelines, evaluating the effects of corpus selection (including aggregation), chunking granularity, and vector indexing (HNSW/FAISS) on performance and efficiency. Using diverse public datasets and query types (exam-style, conversational, community questions), it employs LLM-as-a-judge win-rate comparisons with human validation to conclude that corpus aggregation yields superior absolute retrieval quality and that MedRAG/pubmed is the Pareto-optimal singleton corpus under graph-based indexing and appropriate chunking/FAISS choices.

Significance. If the results hold, the work provides valuable practical guidance for researchers building biomedical retrieval systems, identifying concrete design choices that optimize quality-efficiency trade-offs at scale. The breadth across multiple datasets and query formulations strengthens the applicability of the recommendations for the IR and biomedical NLP communities.

major comments (1)

[Abstract] Abstract: the manuscript states that LLM-as-a-judge assessments include human validation on a subset, but reports no quantitative details on subset size, selection method, agreement rate, or disagreement analysis. Since every headline claim (corpus aggregation superiority; MedRAG/pubmed Pareto optimality) and all reported trade-offs flow exclusively through the LLM win-rate channel, this omission is load-bearing for the reliability of the central findings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the manuscript's practical value for biomedical retrieval system design. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states that LLM-as-a-judge assessments include human validation on a subset, but reports no quantitative details on subset size, selection method, agreement rate, or disagreement analysis. Since every headline claim (corpus aggregation superiority; MedRAG/pubmed Pareto optimality) and all reported trade-offs flow exclusively through the LLM win-rate channel, this omission is load-bearing for the reliability of the central findings.

Authors: We agree that the absence of quantitative details on the human validation subset is a significant omission that affects the interpretability of the LLM-as-a-judge results. In the revised manuscript we will report the exact subset size, the selection procedure (random sampling stratified by query type and corpus), the agreement rate between LLM and human judgments, and a brief disagreement analysis. These additions will be placed in the evaluation section and referenced from the abstract to directly substantiate the reliability of the reported corpus aggregation benefits and the Pareto optimality of MedRAG/pubmed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on retrieval pipelines

full rationale

The paper reports an empirical study comparing biomedical retrieval configurations (corpus selection, chunking, indexing) via direct performance measurements and LLM-as-a-judge win rates with human validation on a subset. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All headline claims (corpus aggregation superiority, MedRAG/pubmed Pareto optimality) rest on experimental comparisons rather than reducing to inputs by definition or self-reference. The LLM-judge reliability concern is a validity issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on standard information-retrieval evaluation assumptions and the representativeness of the chosen public datasets and query types; no new entities or fitted constants are introduced.

axioms (1)

domain assumption LLM-as-a-judge produces relevance scores that correlate with human judgments when validated on a sample
Invoked to justify using LLM judgments for the main win-rate comparisons.

pith-pipeline@v0.9.0 · 5475 in / 1234 out tokens · 43920 ms · 2026-05-15T20:48:25.371496+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Retrieval-Augmented Generation for Large Language Models: A Survey

URLhttps://arxiv.org/abs/2312.10997. Jiawei He, Boya Zhang, Hossein Rouhizadeh, Yingjian Chen, Rui Yang, Jin Lu, Xudong Chen, Nan Liu, and Douglas Teodoro. Retrieval- augmented generation in biomedicine: A survey of technologies, datasets, and clinical applications,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Gautier Izacard and Edouard Grave

URLhttps://arxiv.org/abs/2505.01146. Gautier Izacard and Edouard Grave. Leveraging pas- sage retrieval with generative models for open do- main question answering. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors,Proceed- ings of the 16th Conference of the European Chap- ter of the Association for Computational Linguis- tics: Main Volume, pages ...

work page arXiv
[3]

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.74. URLhttps: //aclanthology.org/2021.eacl-main.74/. Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, and Jaewoo Kang. Improving medical rea- soning through retrieval and self-reflection with retrieval-augmented large language models.Bioin- formatics, 40(Supplement 1):i119–i129, 06 2024. ...

work page doi:10.18653/v1/2021.eacl-main.74 2021
[4]

ISBN 9781713829546

Curran Associates Inc. ISBN 9781713829546. Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario ˇSaˇ sko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, 9 Biomedical Retrieval Pi...

work page doi:10.1016/j.jbi.2024.104769 2021
[5]

Available fromhttps: //www.ncbi.nlm.nih.gov/mesh/

[cited 2026 Feb 4]. Available fromhttps: //www.ncbi.nlm.nih.gov/mesh/. National Library of Medicine. PMC Open Access Subset [Internet]. National Library of Medicine,

work page 2026
[6]

have always been heavy

[cited 2026 Feb 4]. Available fromhttps: //pmc.ncbi.nlm.nih.gov/tools/openftlist/. Fnu Neha, Deepshikha Bhati, and Deepak Kumar Shukla. Retrieval-augmented generation (rag) in healthcare: A comprehensive review.AI, 6(9): 226, 2025. doi: 10.3390/ai6090226. URLhttps: //doi.org/10.3390/ai6090226. Ankit Pal, Logesh Kumar Umapathi, and Malaikan- nan Sankarasub...

work page doi:10.3390/ai6090226 2026

[1] [1]

Retrieval-Augmented Generation for Large Language Models: A Survey

URLhttps://arxiv.org/abs/2312.10997. Jiawei He, Boya Zhang, Hossein Rouhizadeh, Yingjian Chen, Rui Yang, Jin Lu, Xudong Chen, Nan Liu, and Douglas Teodoro. Retrieval- augmented generation in biomedicine: A survey of technologies, datasets, and clinical applications,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Gautier Izacard and Edouard Grave

URLhttps://arxiv.org/abs/2505.01146. Gautier Izacard and Edouard Grave. Leveraging pas- sage retrieval with generative models for open do- main question answering. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors,Proceed- ings of the 16th Conference of the European Chap- ter of the Association for Computational Linguis- tics: Main Volume, pages ...

work page arXiv

[3] [3]

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.74. URLhttps: //aclanthology.org/2021.eacl-main.74/. Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, and Jaewoo Kang. Improving medical rea- soning through retrieval and self-reflection with retrieval-augmented large language models.Bioin- formatics, 40(Supplement 1):i119–i129, 06 2024. ...

work page doi:10.18653/v1/2021.eacl-main.74 2021

[4] [4]

ISBN 9781713829546

Curran Associates Inc. ISBN 9781713829546. Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario ˇSaˇ sko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, 9 Biomedical Retrieval Pi...

work page doi:10.1016/j.jbi.2024.104769 2021

[5] [5]

Available fromhttps: //www.ncbi.nlm.nih.gov/mesh/

[cited 2026 Feb 4]. Available fromhttps: //www.ncbi.nlm.nih.gov/mesh/. National Library of Medicine. PMC Open Access Subset [Internet]. National Library of Medicine,

work page 2026

[6] [6]

have always been heavy

[cited 2026 Feb 4]. Available fromhttps: //pmc.ncbi.nlm.nih.gov/tools/openftlist/. Fnu Neha, Deepshikha Bhati, and Deepak Kumar Shukla. Retrieval-augmented generation (rag) in healthcare: A comprehensive review.AI, 6(9): 226, 2025. doi: 10.3390/ai6090226. URLhttps: //doi.org/10.3390/ai6090226. Ankit Pal, Logesh Kumar Umapathi, and Malaikan- nan Sankarasub...

work page doi:10.3390/ai6090226 2026