arxiv: 2604.05190 · v2 · submitted 2026-04-06 · 💻 cs.CL · cs.AI· cs.IR

Recognition: no theorem link

Retrieval-Augmented LLMs for Evidence Localization in Clinical Trial Recruitment from Longitudinal EHR Narratives

Ziyi Chen , Mengxian Lyu , Cheng Peng , Yonghui Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords retrieval-augmented generationlarge language modelsclinical trial recruitmentelectronic health recordsevidence localizationpatient eligibility screeningMedGemma

0 comments

The pith

Retrieval-augmented MedGemma reaches 89.05% micro-F1 by localizing evidence for trial eligibility in long EHR narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests generative large language models on the task of screening longitudinal electronic health records to match patients against clinical trial eligibility criteria. It compares plain long-context processing, named-entity summarization, and retrieval-augmented generation across both general and medically adapted models, using the 2018 N2C2 Track 1 dataset. The MedGemma model combined with dynamic evidence retrieval performs best overall and shows the clearest gains on criteria that require reasoning across multiple notes over time. Short-context criteria such as laboratory values see only modest gains. Practical use therefore requires choosing the right method for each criterion type to balance accuracy against computational cost.

Core claim

Generative LLMs equipped with retrieval-augmented generation localize relevant evidence segments within lengthy patient narratives more effectively than encoder-based models or unassisted long-context reading. On the 2018 N2C2 benchmark the MedGemma plus RAG combination records the highest micro-F1 of 89.05 percent. The same approach delivers its largest improvements precisely on eligibility criteria that span extended temporal reasoning across many notes, whereas criteria anchored to a single short passage improve only incrementally.

What carries the argument

Retrieval-augmented generation that dynamically pulls criterion-relevant segments from the full longitudinal EHR before feeding them to a medical-adapted generative LLM such as MedGemma, thereby mitigating the lost-in-the-middle problem.

If this is right

Criteria that require long-term reasoning across patient histories obtain substantial accuracy gains from generative LLMs.
Criteria limited to short contexts, such as individual lab tests, receive only incremental benefit.
Real-world systems must select among rule-based queries, encoder models, and generative LLMs according to the specific criteria to keep compute costs reasonable.
Automated evidence localization can reduce the manual screening burden that currently limits trial enrollment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding the same retrieval-plus-generation pipeline inside hospital EHR interfaces could shorten screening time enough to increase the fraction of eligible patients offered trials.
The same evidence-localization technique may transfer to other long-document medical tasks such as extracting timelines from full patient charts or summarizing research articles for clinicians.
Testing the pipeline on multi-institution, multi-year EHR collections that include non-English notes would reveal whether performance holds outside the benchmark distribution.

Load-bearing premise

The 2018 N2C2 Track 1 dataset and the chosen micro-F1 metric sufficiently represent the variability of real-world longitudinal EHRs and the range of eligibility criteria used in ongoing trials.

What would settle it

Re-running the MedGemma RAG pipeline on a newer collection of de-identified EHR notes drawn from currently recruiting trials and comparing its screening decisions against expert manual review would directly test whether the reported performance generalizes.

Figures

Figures reproduced from arXiv: 2604.05190 by Cheng Peng, Mengxian Lyu, Yonghui Wu, Ziyi Chen.

**Figure 2.** Figure 2: Overall performance comparison of decoder-based models across NER, original long-context, and RAG strategies Input Type Mean Tokens [Min, Max] Applied Token Limit NER-problem 2955 [235, 10120] 512/ 2048/ 8192 Ori Text 5290 [1670, 15186] 8192 RAG-top10 1403 [777, 2777] 2048 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the "Lost in the Middle" issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedGemma with RAG hits 89% micro-F1 on the 2018 N2C2 trial eligibility task and beats the other two long-document strategies, but the gains on cross-note reasoning rest on that single benchmark without further checks.

read the letter

The paper applies MedGemma plus retrieval-augmented generation to patient screening for clinical trials and reports the highest micro-F1 of 89.05% on the 2018 N2C2 Track 1 set. It compares three ways to manage long EHR narratives—full context, NER-based summarization, and RAG—and shows RAG coming out ahead, with larger lifts on criteria that need information spread across notes rather than single lab values or short spans.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic exploration of encoder- and decoder-based generative LLMs for screening patients for clinical trial enrollment using longitudinal EHR narratives. It compares general-purpose and medical-adapted LLMs across three strategies for managing long documents—original long-context processing, NER-based extractive summarization, and RAG-based evidence retrieval—evaluated on the 2018 N2C2 Track 1 benchmark dataset. The key finding is that the MedGemma model combined with the RAG strategy achieves the highest micro-F1 score of 89.05%, with particular gains on eligibility criteria requiring long-term reasoning across documents.

Significance. If the performance claims hold and generalize beyond the specific benchmark, this work could have significant impact on reducing the labor-intensive process of clinical trial recruitment by automating evidence localization in EHRs. The differentiation between improvements on long-term vs. short-context criteria provides actionable guidance for choosing between rule-based, encoder, and generative approaches in practice, potentially improving trial enrollment rates which are currently hindered by screening bottlenecks.

major comments (2)

[Abstract and Results] Abstract and Results: The reported best performance of 89.05% micro-F1 for MedGemma with RAG is presented without any baseline comparisons (e.g., to rule-based systems or prior SOTA on N2C2), statistical significance testing, or details on how the three strategies were implemented, which is necessary to substantiate the central performance claim.
[Results and Discussion] Results and Discussion: The claim that 'Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents' is load-bearing for the paper's contribution but lacks supporting evidence such as a breakdown of F1 scores by criterion type, confirmation of the longitudinal depth in the N2C2 dataset (e.g., average number of notes per patient), or error analysis showing specific gains in cross-document temporal reasoning.

minor comments (2)

[Methods] Methods: Clarify the specific models used under 'general-purpose LLMs' and 'medical-adapted LLMs' and provide implementation details for the NER-based summarization strategy.
[Abstract] Abstract: The abstract could benefit from mentioning the number of criteria or patients in the benchmark for context on the scale of the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights areas where additional context and evidence will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The reported best performance of 89.05% micro-F1 for MedGemma with RAG is presented without any baseline comparisons (e.g., to rule-based systems or prior SOTA on N2C2), statistical significance testing, or details on how the three strategies were implemented, which is necessary to substantiate the central performance claim.

Authors: We agree that the central performance claim would be more robust with explicit baselines and implementation details. In the revised manuscript, we will add a rule-based baseline using keyword and regex matching on eligibility criteria, reference the prior best-reported micro-F1 on the 2018 N2C2 Track 1 dataset, include a dedicated subsection detailing the implementation of the original long-context, NER-based summarization, and RAG strategies, and report statistical significance tests (e.g., McNemar’s test) comparing MedGemma+RAG against the other configurations. These additions will be placed in the Results and Methods sections. revision: yes
Referee: [Results and Discussion] Results and Discussion: The claim that 'Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents' is load-bearing for the paper's contribution but lacks supporting evidence such as a breakdown of F1 scores by criterion type, confirmation of the longitudinal depth in the N2C2 dataset (e.g., average number of notes per patient), or error analysis showing specific gains in cross-document temporal reasoning.

Authors: We acknowledge that the claim requires quantitative backing. In the revision, we will add a table breaking down micro-F1 scores by criterion category (long-term reasoning vs. short-context criteria such as lab tests), report dataset statistics including the average number of notes per patient to confirm its longitudinal character, and include a qualitative error analysis with examples of cases where RAG improves cross-document temporal reasoning. These elements will be integrated into the Results and Discussion sections to directly support the statement. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on fixed public benchmark

full rationale

The paper reports measured micro-F1 scores from applying off-the-shelf and fine-tuned LLMs (including MedGemma with RAG) to the 2018 N2C2 Track 1 dataset. No equations, parameter fits, or derivations are presented; claims of improvement on long-term reasoning criteria are direct experimental outcomes on the chosen benchmark rather than quantities constructed from the model's own inputs or prior self-citations. No self-definitional loops, fitted-input-as-prediction, or load-bearing uniqueness theorems appear.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can reliably extract and reason over eligibility criteria from unstructured longitudinal notes when given retrieval augmentation or summarization, plus the representativeness of the chosen benchmark.

axioms (1)

domain assumption Named entity recognition and retrieval methods can produce faithful summaries or evidence snippets without losing critical eligibility information.
Invoked when comparing the NER-based and RAG strategies to the original long-context baseline.

pith-pipeline@v0.9.0 · 5567 in / 1238 out tokens · 36624 ms · 2026-05-10T18:52:11.936365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

M. H. van Velthoven, N. Mastellos, A. Majeed, J. O’Donoghue, J. Car, BMC Med. Inform. Decis. Mak. 2016, 16, DOI 10.1186/s12911-016-0332-1. [7] R. Griffiths, L. Herbert, A. Akbari, R. Bailey, J. Hollinghurst, R. Pugh, T. Szakmany, F. Torabi, R. A. Lyons, Int. J. Popul. Data Sci. 2022, 7, 1724. [8] L. Penberthy, R. Brown, F. Puma, B. Dahman, Contemp. Clin. ...

work page doi:10.1186/s12911-016-0332-1 2016
[2]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, C. Ré, arXiv [cs.LG] 2022. A APPENDIX Appendix Table 1. Criteria descriptions DRUG-ABUSE: Drug abuse, current or past ALCOHOL-ABUSE: Current alcohol use over weekly recommended limits ENGLISH: Patient must speak English MAKES-DECISIONS: Patient must make their own medical decisions ABDOMINAL: History of intra abdomina...

work page 2022