A Randomized Controlled Trial and Pilot of Scout: an LLM-Based EHR Search and Synthesis Platform
Pith reviewed 2026-05-15 14:16 UTC · model grok-4.3
The pith
An LLM-based EHR search tool cut clinician task time by 37.6 percent while keeping output quality equal to direct EHR use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Scout generates natural-language responses to EHR queries that include citations linking each claim to the original patient data. In the prospective randomized evaluator-blinded crossover trial, use of Scout reduced task completion time by 37.6 percent and produced statistically significant drops in perceived workload while non-inferiority testing confirmed that accuracy, completeness, and relevance remained comparable to the EHR-only condition. The concurrent pilot deployment across more than 200 users confirmed practical uptake in diverse clinical and administrative scenarios with automated and manual error checks showing most flagged items were actually supported by the chart.
What carries the argument
Scout is the LLM-based EHR search and synthesis platform that produces cited responses to natural-language queries of patient records.
If this is right
- Clinicians could complete the same volume of data-review tasks in less time, freeing capacity for direct patient interaction.
- Lower mental and temporal demand scores suggest reduced daily cognitive load that could accumulate across shifts.
- Non-inferior quality metrics support integration of similar tools into existing EHR systems without separate verification steps.
- The pilot's broad specialty coverage indicates the approach scales beyond the seven specialties tested in the trial.
- Human review of LLM-flagged outputs proved necessary, implying that hybrid automated-plus-clinician workflows may be required for safe deployment.
Where Pith is reading between the lines
- If time savings persist in live care, hospitals might reallocate clinician effort from chart review toward higher-value activities.
- The gap between automated LLM-as-judge flags and actual chart support highlights the need for improved medical-specific verifiers.
- Widespread pilot adoption across specialties suggests the natural-language interface lowers the barrier to EHR data access for non-technical users.
- Combining citation links with clinician oversight could serve as a model for other high-stakes LLM applications in medicine.
Load-bearing premise
The 20 trial participants and 200 structured cases represent the range of real-time clinical decision-making and documentation demands that occur in live patient care.
What would settle it
A follow-up study that tracks actual diagnostic or treatment errors and time savings in unscripted live clinical encounters with and without Scout would directly test whether the time reduction and quality preservation hold outside the trial setting.
Figures
read the original abstract
Clinical documentation and data retrieval within Electronic Health Records (EHRs) contribute substantially to clinician workload and burnout. To address this, we developed Scout, an LLM-based EHR search and synthesis platform that enables clinicians to query EHR data using natural language. Each response includes citations linking each claim to the original data source, facilitating easy verification of generated content. We conducted a prospective randomized, evaluator-blinded crossover trial across seven clinical specialties (20 participants, 200 structured cases). Participants completed realistic clinical tasks using either Scout or the EHR alone, with outcomes including time to completion, NASA Task Load Index workload scores, and blinded expert adjudication of accuracy, completeness, and relevance. Scout reduced task completion time by 37.6% and significantly decreased perceived workload, with the largest reductions in mental demand, effort, and temporal demand. Non-inferiority analyses showed that tasks completed with Scout maintained accuracy, completeness, and relevance relative to tasks completed with the EHR-only. A concurrent pilot deployment across over 200 users and more than 20 specialties generated over 6,600 interactions in three months, revealing diverse clinical and administrative use cases. Automated evaluation using an LLM-as-judge framework identified errors at low rates. Subsequent manual review of a subset of outputs revealed that most claims flagged by the automated judge as errors were in fact supported by the patient chart, demonstrating the importance of human validation. These findings provide early trial-based evidence that LLM-powered EHR tools can meaningfully reduce clinical and administrative workloads while maintaining output quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Scout, an LLM-based EHR search and synthesis platform enabling natural-language queries with source citations for verification. It reports a prospective randomized evaluator-blinded crossover trial (20 participants, 200 structured cases across seven specialties) showing a 37.6% reduction in task completion time, significantly lower NASA-TLX workload scores (especially mental demand, effort, and temporal demand), and non-inferiority in blinded expert ratings of accuracy, completeness, and relevance versus EHR-only use. A pilot deployment (>200 users, >20 specialties, >6,600 interactions) plus LLM-as-judge and manual error review provide supporting deployment data.
Significance. If the results hold, the work supplies controlled-trial evidence that citation-enabled LLM tools can meaningfully reduce clinician time and workload on EHR tasks while preserving output quality, directly addressing documentation burden and burnout. The crossover design, evaluator blinding, and dual automated-plus-human validation strengthen the practical implications for clinical informatics and information retrieval applications.
major comments (3)
- [Abstract] Abstract: The non-inferiority margin for accuracy, completeness, and relevance is unspecified, preventing assessment of whether the n=20 design and 200 cases provide adequate power to exclude clinically relevant degradation.
- [Abstract] Abstract: Inter-rater reliability for the blinded expert adjudication is not reported; without this, the stability of the non-inferiority conclusions on quality metrics cannot be evaluated.
- [Abstract] Abstract: The reliance on 200 structured cases limits ecological validity for detecting subtle omissions or workflow-specific errors that arise in live, unstructured clinical practice.
minor comments (2)
- [Abstract] Abstract: Additional detail on the exact statistical tests, confidence intervals, and power calculations for the non-inferiority and workload analyses would improve transparency.
- [Abstract] Abstract: The discrepancy between seven specialties in the trial and over 20 in the pilot should be clarified for consistency.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments, which have helped us strengthen the clarity and transparency of the manuscript. We address each major comment below and have revised the abstract and discussion accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The non-inferiority margin for accuracy, completeness, and relevance is unspecified, preventing assessment of whether the n=20 design and 200 cases provide adequate power to exclude clinically relevant degradation.
Authors: We agree that the non-inferiority margin should be stated explicitly in the abstract. The margin was predefined as a 10% absolute difference based on prior clinical informatics literature and input from our clinician co-authors. We have revised the abstract to include this margin and note that the sample of 200 cases provided >80% power to confirm non-inferiority at this threshold, as detailed in the statistical analysis plan. revision: yes
-
Referee: [Abstract] Abstract: Inter-rater reliability for the blinded expert adjudication is not reported; without this, the stability of the non-inferiority conclusions on quality metrics cannot be evaluated.
Authors: We acknowledge the omission in the abstract. The full manuscript reports substantial inter-rater agreement (Cohen’s κ = 0.82) between the two blinded experts. We have added this statistic to the abstract to allow readers to assess the stability of the quality ratings. revision: yes
-
Referee: [Abstract] Abstract: The reliance on 200 structured cases limits ecological validity for detecting subtle omissions or workflow-specific errors that arise in live, unstructured clinical practice.
Authors: We agree that structured cases cannot fully capture every nuance of live clinical workflows. However, the cases were developed iteratively with practicing clinicians to reflect high-frequency EHR tasks across seven specialties. The concurrent pilot deployment (>6,600 real interactions) and subsequent manual error review provide complementary real-world evidence that error rates remained low. We have expanded the limitations and discussion sections to more explicitly address this trade-off and the role of the pilot data. revision: partial
Circularity Check
Empirical RCT with measured outcomes; no derivations or self-referential reductions
full rationale
The paper reports direct measurements from a prospective randomized crossover trial (n=20 participants, 200 cases): task completion time (37.6% reduction), NASA-TLX workload scores, and blinded expert adjudication of accuracy/completeness/relevance. Non-inferiority is assessed via observed data comparisons. No equations, fitted parameters renamed as predictions, ansatzes, or load-bearing self-citations appear in the central claims. The LLM-as-judge component is presented as an auxiliary automated screen whose outputs are manually validated, not as a derivation of the trial results. The pilot deployment data is observational usage statistics, not a fitted model. All load-bearing steps are external to any internal derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Crossover design has no carryover effects between Scout and EHR-only conditions.
Reference graph
Works this paper leans on
-
[1]
Measuring Documentation Burden in Healthcare
Murad MH, Vaa Stelling BE, West CP, et al. Measuring Documentation Burden in Healthcare. J Gen Intern Med. 2024;39(14):2837-2848. doi:10.1007/s11606-024-08956-8 13. Joukes E, Abu-Hanna A, Cornet R, de Keizer NF. Time Spent on Dedicated Patient Care and Documentation Tasks Before and After the Introduction of a Structured and Standardized Electronic Health...
-
[2]
Ambient Documentation Technology in Clinician Experience of Documentation Burden and Burnout
You JG, Dbouk RH, Landman A, et al. Ambient Documentation Technology in Clinician Experience of Documentation Burden and Burnout. JAMA Netw Open. 2025;8(8):e2528056. doi:10.1001/jamanetworkopen.2025.28056 25. Cao DY , Silkey JR, Decker MC, Wanat KA. Artificial intelligence-driven digital scribes in clinical documentation: Pilot study assessing the impact ...
-
[3]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng L, Chiang WL, Sheng Y , et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv. Preprint posted online December 24, 2023:arXiv:2306.05685. doi:10.48550/arXiv.2306.05685 38. Chung P, Swaminathan A, Goodell AJ, et al. VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records. arXiv. Preprint posted online Ja...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
-
[4]
knitr_1.38 generics_0.1.2 vctrs_0.7.1 [28] grid_4.1.3 tidyselect_1.2.1 snakecase_0.11.1 [31] glue_1.6.2 R6_2.5.1 Rdpack_2.6.4 [34] rmarkdown_2.13 minqa_1.2.5 ggplot2_3.3.6 [37] purrr_1.2.1 magrittr_2.0.3 scales_1.2.1 [40] rbibutils_2.3 htmltools_0.5.2 splines_4.1.3 [43] MASS_7.3-55 rsconnect_0.8.28 xtable_1.8-4 [46] colorspace_2.0-3 numDeriv_2016.8-1.1 ut...
-
[5]
A statement in the generated output that could not be found in or supported by the patient chart
Best Category - Exact category name. (Reflects one of the exact category name from the 20 categories defined above.) 3. Interpretation - reasoning for why it matches the assigned category according to definitions and rules 4. Second Category - When undecided. This could be null VALIDATION CHECKS: After categorization, verify: 1. Calculate average number o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.