A Randomized Controlled Trial and Pilot of Scout: an LLM-Based EHR Search and Synthesis Platform

Angelo Milazzo; Blake Cameron; Bradley Hintze; Henry Foote; Jason Tatreau; Jason Thieling; Kartik Pejavara; Marshall Nichols; Matthew Ellis; Matthew Gardner

arxiv: 2604.26953 · v1 · submitted 2026-03-07 · 💻 cs.IR · cs.CY

A Randomized Controlled Trial and Pilot of Scout: an LLM-Based EHR Search and Synthesis Platform

Michael Gao , Suresh Balu , William Knechtle , Kartik Pejavara , William Jeck , Matthew Ellis , Jason Thieling , Blake Cameron

show 10 more authors

Jason Tatreau Tareq Aljurf Henry Foote Michael Revoir Marshall Nichols Matthew Gardner William Ratliff Bradley Hintze Angelo Milazzo Sreekanth Vemulapalli

This is my paper

Pith reviewed 2026-05-15 14:16 UTC · model grok-4.3

classification 💻 cs.IR cs.CY

keywords LLMEHRclinical workflowrandomized controlled trialnatural language queryclinician workloadAI in healthcareelectronic health records

0 comments

The pith

An LLM-based EHR search tool cut clinician task time by 37.6 percent while keeping output quality equal to direct EHR use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates Scout, a system that lets clinicians ask natural-language questions about electronic health record data and receive answers with direct citations back to the source notes. A randomized crossover trial with 20 clinicians across seven specialties and 200 structured cases measured time, workload, and expert-rated quality. Scout shortened task completion by more than a third, lowered NASA Task Load Index scores especially in mental demand and effort, and met non-inferiority thresholds for accuracy, completeness, and relevance. A three-month pilot with over 200 users across 20 specialties produced thousands of interactions with low rates of unsupported claims after human review.

Core claim

Scout generates natural-language responses to EHR queries that include citations linking each claim to the original patient data. In the prospective randomized evaluator-blinded crossover trial, use of Scout reduced task completion time by 37.6 percent and produced statistically significant drops in perceived workload while non-inferiority testing confirmed that accuracy, completeness, and relevance remained comparable to the EHR-only condition. The concurrent pilot deployment across more than 200 users confirmed practical uptake in diverse clinical and administrative scenarios with automated and manual error checks showing most flagged items were actually supported by the chart.

What carries the argument

Scout is the LLM-based EHR search and synthesis platform that produces cited responses to natural-language queries of patient records.

If this is right

Clinicians could complete the same volume of data-review tasks in less time, freeing capacity for direct patient interaction.
Lower mental and temporal demand scores suggest reduced daily cognitive load that could accumulate across shifts.
Non-inferior quality metrics support integration of similar tools into existing EHR systems without separate verification steps.
The pilot's broad specialty coverage indicates the approach scales beyond the seven specialties tested in the trial.
Human review of LLM-flagged outputs proved necessary, implying that hybrid automated-plus-clinician workflows may be required for safe deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If time savings persist in live care, hospitals might reallocate clinician effort from chart review toward higher-value activities.
The gap between automated LLM-as-judge flags and actual chart support highlights the need for improved medical-specific verifiers.
Widespread pilot adoption across specialties suggests the natural-language interface lowers the barrier to EHR data access for non-technical users.
Combining citation links with clinician oversight could serve as a model for other high-stakes LLM applications in medicine.

Load-bearing premise

The 20 trial participants and 200 structured cases represent the range of real-time clinical decision-making and documentation demands that occur in live patient care.

What would settle it

A follow-up study that tracks actual diagnostic or treatment errors and time savings in unscripted live clinical encounters with and without Scout would directly test whether the time reduction and quality preservation hold outside the trial setting.

Figures

Figures reproduced from arXiv: 2604.26953 by Angelo Milazzo, Blake Cameron, Bradley Hintze, Henry Foote, Jason Tatreau, Jason Thieling, Kartik Pejavara, Marshall Nichols, Matthew Ellis, Matthew Gardner, Michael Gao, Michael Revoir, Sreekanth Vemulapalli, Suresh Balu, Tareq Aljurf, William Jeck, William Knechtle, William Ratliff.

**Figure 3.** Figure 3: Trial Participation and Randomization. Each participant within a use case was allocated the same 10 patients (randomized once at the start of each use case and consistent across participants within use case). Within use cases, participants were randomized to sequence 1 (Scout for first 5 cases) or sequence 2 (EHR-only for the first 5 cases). After completing the first block of 5 cases, participants crossed… view at source ↗

read the original abstract

Clinical documentation and data retrieval within Electronic Health Records (EHRs) contribute substantially to clinician workload and burnout. To address this, we developed Scout, an LLM-based EHR search and synthesis platform that enables clinicians to query EHR data using natural language. Each response includes citations linking each claim to the original data source, facilitating easy verification of generated content. We conducted a prospective randomized, evaluator-blinded crossover trial across seven clinical specialties (20 participants, 200 structured cases). Participants completed realistic clinical tasks using either Scout or the EHR alone, with outcomes including time to completion, NASA Task Load Index workload scores, and blinded expert adjudication of accuracy, completeness, and relevance. Scout reduced task completion time by 37.6% and significantly decreased perceived workload, with the largest reductions in mental demand, effort, and temporal demand. Non-inferiority analyses showed that tasks completed with Scout maintained accuracy, completeness, and relevance relative to tasks completed with the EHR-only. A concurrent pilot deployment across over 200 users and more than 20 specialties generated over 6,600 interactions in three months, revealing diverse clinical and administrative use cases. Automated evaluation using an LLM-as-judge framework identified errors at low rates. Subsequent manual review of a subset of outputs revealed that most claims flagged by the automated judge as errors were in fact supported by the patient chart, demonstrating the importance of human validation. These findings provide early trial-based evidence that LLM-powered EHR tools can meaningfully reduce clinical and administrative workloads while maintaining output quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Scout, an LLM-based EHR search and synthesis platform enabling natural-language queries with source citations for verification. It reports a prospective randomized evaluator-blinded crossover trial (20 participants, 200 structured cases across seven specialties) showing a 37.6% reduction in task completion time, significantly lower NASA-TLX workload scores (especially mental demand, effort, and temporal demand), and non-inferiority in blinded expert ratings of accuracy, completeness, and relevance versus EHR-only use. A pilot deployment (>200 users, >20 specialties, >6,600 interactions) plus LLM-as-judge and manual error review provide supporting deployment data.

Significance. If the results hold, the work supplies controlled-trial evidence that citation-enabled LLM tools can meaningfully reduce clinician time and workload on EHR tasks while preserving output quality, directly addressing documentation burden and burnout. The crossover design, evaluator blinding, and dual automated-plus-human validation strengthen the practical implications for clinical informatics and information retrieval applications.

major comments (3)

[Abstract] Abstract: The non-inferiority margin for accuracy, completeness, and relevance is unspecified, preventing assessment of whether the n=20 design and 200 cases provide adequate power to exclude clinically relevant degradation.
[Abstract] Abstract: Inter-rater reliability for the blinded expert adjudication is not reported; without this, the stability of the non-inferiority conclusions on quality metrics cannot be evaluated.
[Abstract] Abstract: The reliance on 200 structured cases limits ecological validity for detecting subtle omissions or workflow-specific errors that arise in live, unstructured clinical practice.

minor comments (2)

[Abstract] Abstract: Additional detail on the exact statistical tests, confidence intervals, and power calculations for the non-inferiority and workload analyses would improve transparency.
[Abstract] Abstract: The discrepancy between seven specialties in the trial and over 20 in the pilot should be clarified for consistency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which have helped us strengthen the clarity and transparency of the manuscript. We address each major comment below and have revised the abstract and discussion accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The non-inferiority margin for accuracy, completeness, and relevance is unspecified, preventing assessment of whether the n=20 design and 200 cases provide adequate power to exclude clinically relevant degradation.

Authors: We agree that the non-inferiority margin should be stated explicitly in the abstract. The margin was predefined as a 10% absolute difference based on prior clinical informatics literature and input from our clinician co-authors. We have revised the abstract to include this margin and note that the sample of 200 cases provided >80% power to confirm non-inferiority at this threshold, as detailed in the statistical analysis plan. revision: yes
Referee: [Abstract] Abstract: Inter-rater reliability for the blinded expert adjudication is not reported; without this, the stability of the non-inferiority conclusions on quality metrics cannot be evaluated.

Authors: We acknowledge the omission in the abstract. The full manuscript reports substantial inter-rater agreement (Cohen’s κ = 0.82) between the two blinded experts. We have added this statistic to the abstract to allow readers to assess the stability of the quality ratings. revision: yes
Referee: [Abstract] Abstract: The reliance on 200 structured cases limits ecological validity for detecting subtle omissions or workflow-specific errors that arise in live, unstructured clinical practice.

Authors: We agree that structured cases cannot fully capture every nuance of live clinical workflows. However, the cases were developed iteratively with practicing clinicians to reflect high-frequency EHR tasks across seven specialties. The concurrent pilot deployment (>6,600 real interactions) and subsequent manual error review provide complementary real-world evidence that error rates remained low. We have expanded the limitations and discussion sections to more explicitly address this trade-off and the role of the pilot data. revision: partial

Circularity Check

0 steps flagged

Empirical RCT with measured outcomes; no derivations or self-referential reductions

full rationale

The paper reports direct measurements from a prospective randomized crossover trial (n=20 participants, 200 cases): task completion time (37.6% reduction), NASA-TLX workload scores, and blinded expert adjudication of accuracy/completeness/relevance. Non-inferiority is assessed via observed data comparisons. No equations, fitted parameters renamed as predictions, ansatzes, or load-bearing self-citations appear in the central claims. The LLM-as-judge component is presented as an auxiliary automated screen whose outputs are manually validated, not as a derivation of the trial results. The pilot deployment data is observational usage statistics, not a fitted model. All load-bearing steps are external to any internal derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on standard RCT assumptions and the validity of NASA-TLX and expert adjudication measures; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Crossover design has no carryover effects between Scout and EHR-only conditions.
Assumed for the randomized evaluator-blinded crossover trial to isolate tool effects.

pith-pipeline@v0.9.0 · 5646 in / 1275 out tokens · 45353 ms · 2026-05-15T14:16:45.323740+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Measuring Documentation Burden in Healthcare

Murad MH, Vaa Stelling BE, West CP, et al. Measuring Documentation Burden in Healthcare. J Gen Intern Med. 2024;39(14):2837-2848. doi:10.1007/s11606-024-08956-8 13. Joukes E, Abu-Hanna A, Cornet R, de Keizer NF. Time Spent on Dedicated Patient Care and Documentation Tasks Before and After the Introduction of a Structured and Standardized Electronic Health...

work page doi:10.1007/s11606-024-08956-8 2024
[2]

Ambient Documentation Technology in Clinician Experience of Documentation Burden and Burnout

You JG, Dbouk RH, Landman A, et al. Ambient Documentation Technology in Clinician Experience of Documentation Burden and Burnout. JAMA Netw Open. 2025;8(8):e2528056. doi:10.1001/jamanetworkopen.2025.28056 25. Cao DY , Silkey JR, Decker MC, Wanat KA. Artificial intelligence-driven digital scribes in clinical documentation: Pilot study assessing the impact ...

work page doi:10.1001/jamanetworkopen.2025.28056 2025
[3]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng L, Chiang WL, Sheng Y , et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv. Preprint posted online December 24, 2023:arXiv:2306.05685. doi:10.48550/arXiv.2306.05685 38. Chung P, Swaminathan A, Goodell AJ, et al. VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records. arXiv. Preprint posted online Ja...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
[4]

all_relevant_facts_entailed

knitr_1.38 generics_0.1.2 vctrs_0.7.1 [28] grid_4.1.3 tidyselect_1.2.1 snakecase_0.11.1 [31] glue_1.6.2 R6_2.5.1 Rdpack_2.6.4 [34] rmarkdown_2.13 minqa_1.2.5 ggplot2_3.3.6 [37] purrr_1.2.1 magrittr_2.0.3 scales_1.2.1 [40] rbibutils_2.3 htmltools_0.5.2 splines_4.1.3 [43] MASS_7.3-55 rsconnect_0.8.28 xtable_1.8-4 [46] colorspace_2.0-3 numDeriv_2016.8-1.1 ut...

work page
[5]

A statement in the generated output that could not be found in or supported by the patient chart

Best Category - Exact category name. (Reflects one of the exact category name from the 20 categories defined above.) 3. Interpretation - reasoning for why it matches the assigned category according to definitions and rules 4. Second Category - When undecided. This could be null VALIDATION CHECKS: After categorization, verify: 1. Calculate average number o...

work page

[1] [1]

Measuring Documentation Burden in Healthcare

Murad MH, Vaa Stelling BE, West CP, et al. Measuring Documentation Burden in Healthcare. J Gen Intern Med. 2024;39(14):2837-2848. doi:10.1007/s11606-024-08956-8 13. Joukes E, Abu-Hanna A, Cornet R, de Keizer NF. Time Spent on Dedicated Patient Care and Documentation Tasks Before and After the Introduction of a Structured and Standardized Electronic Health...

work page doi:10.1007/s11606-024-08956-8 2024

[2] [2]

Ambient Documentation Technology in Clinician Experience of Documentation Burden and Burnout

You JG, Dbouk RH, Landman A, et al. Ambient Documentation Technology in Clinician Experience of Documentation Burden and Burnout. JAMA Netw Open. 2025;8(8):e2528056. doi:10.1001/jamanetworkopen.2025.28056 25. Cao DY , Silkey JR, Decker MC, Wanat KA. Artificial intelligence-driven digital scribes in clinical documentation: Pilot study assessing the impact ...

work page doi:10.1001/jamanetworkopen.2025.28056 2025

[3] [3]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng L, Chiang WL, Sheng Y , et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv. Preprint posted online December 24, 2023:arXiv:2306.05685. doi:10.48550/arXiv.2306.05685 38. Chung P, Swaminathan A, Goodell AJ, et al. VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records. arXiv. Preprint posted online Ja...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023

[4] [4]

all_relevant_facts_entailed

knitr_1.38 generics_0.1.2 vctrs_0.7.1 [28] grid_4.1.3 tidyselect_1.2.1 snakecase_0.11.1 [31] glue_1.6.2 R6_2.5.1 Rdpack_2.6.4 [34] rmarkdown_2.13 minqa_1.2.5 ggplot2_3.3.6 [37] purrr_1.2.1 magrittr_2.0.3 scales_1.2.1 [40] rbibutils_2.3 htmltools_0.5.2 splines_4.1.3 [43] MASS_7.3-55 rsconnect_0.8.28 xtable_1.8-4 [46] colorspace_2.0-3 numDeriv_2016.8-1.1 ut...

work page

[5] [5]

A statement in the generated output that could not be found in or supported by the patient chart

Best Category - Exact category name. (Reflects one of the exact category name from the 20 categories defined above.) 3. Interpretation - reasoning for why it matches the assigned category according to definitions and rules 4. Second Category - When undecided. This could be null VALIDATION CHECKS: After categorization, verify: 1. Calculate average number o...

work page