pith. sign in

arxiv: 2602.00074 · v2 · submitted 2026-01-21 · 💻 cs.CY · cs.AI

Adoption and Use of LLMs at an Academic Medical Center

Pith reviewed 2026-05-16 12:48 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords large language modelselectronic health recordsclinical documentationAI adoptionChatEHRhealthcare automationcost savingsuser training
0
0 comments X

The pith

A medical center built ChatEHR to embed LLMs directly into electronic health records, turning model access into a trained and monitored institutional tool used in 23,000 sessions by 1075 clinicians.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes development of ChatEHR to overcome manual data entry friction when applying LLMs to clinical documentation. The system gives users access to the full multi-year patient timeline inside the EHR for both fixed automations and interactive sessions. After training, 1075 users completed 23,000 sessions in the first three months, with summaries as the dominant task. The authors report 0.73 hallucinations and 1.60 inaccuracies per summary generation and estimate $6 million in first-year savings. They present the internal build as a way for health systems to keep control while continuously evaluating performance.

Core claim

ChatEHR redefines LLM use as an institutional capability by giving trained users access to the entire patient timeline spanning several years through a combination of static automations and an interactive user interface in the EHR, which supported 7 automations, 1075 routine users, 23,000 sessions in three months, and initial estimates of $6M savings in the first year without yet quantifying care improvements.

What carries the argument

ChatEHR, the integrated system that pairs LLMs with the complete patient medical record timeline inside the EHR to support both fixed automations and interactive use after user training.

Load-bearing premise

The reported hallucination rates, inaccuracy counts, and $6M savings estimates are accurately measured and directly attributable to ChatEHR rather than other changes in workflows or documentation practices.

What would settle it

An independent audit that re-examines the error annotation protocols for summaries and the baseline cost calculation methods to confirm or revise the figures of 0.73 hallucinations and 1.60 inaccuracies per generation and the annual savings total.

read the original abstract

While large language models (LLMs) can support clinical documentation needs, standalone tools struggle with "workflow friction" from manual data entry. We developed ChatEHR, a system that enables the use of LLMs with the entire patient timeline spanning several years. ChatEHR enables automations - which are static combinations of prompts and data that perform a fixed task - and interactive use in the electronic health record (EHR) via a user interface (UI). The resulting ability to sift through patient medical records for diverse use-cases such as pre-visit chart review, screening for transfer eligibility, monitoring for surgical site infections, and chart abstraction, redefines LLM use as an institutional capability. This system, accessible after user-training, enables continuous monitoring and evaluation of LLM use. In 1.5 years, we built 7 automations and 1075 users have trained to become routine users of the UI, engaging in 23,000 sessions in the first 3 months of launch. For automations, being model-agnostic and accessing multiple types of data was essential for matching specific clinical or administrative tasks with the most appropriate LLM. Benchmark-based evaluations proved insufficient for monitoring and evaluation of the UI, requiring new methods to monitor performance. Generation of summaries was the most frequent task in the UI, with an estimated 0.73 hallucinations and 1.60 inaccuracies per generation. The resulting mix of cost savings, time savings, and revenue growth required a value assessment framework to prioritize work as well as quantify the impact of using LLMs. Initial estimates are $6M savings in the first year of use, without quantifying the benefit of the better care offered. Such a "build-from-within" strategy provides an opportunity for health systems to maintain agency via a vendor-agnostic, internally governed LLM platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the development and deployment of ChatEHR, an internally built system that integrates LLMs into the electronic health record (EHR) to support both static automations (fixed prompt-data combinations) and interactive UI-based queries over multi-year patient timelines. It reports adoption metrics (7 automations built, 1075 trained users, 23,000 sessions in the first 3 months), performance estimates for summary generation (0.73 hallucinations and 1.60 inaccuracies per generation), and an initial $6M first-year cost-savings figure, framing the work as a vendor-agnostic, institutionally governed approach to LLM use in clinical settings.

Significance. If the quantified outcomes can be substantiated, the paper provides a concrete case study of scaling LLM capabilities within a health system while retaining internal control and continuous monitoring. The model-agnostic design for automations and the shift from benchmark-only evaluation to custom monitoring methods address practical barriers in clinical deployment; the reported usage volume and savings estimate, if verified, would offer a rare data point on real-world institutional impact.

major comments (2)
  1. [Abstract] Abstract: The headline performance figures (0.73 hallucinations and 1.60 inaccuracies per summary generation) and the $6M first-year savings estimate are presented without any description of the underlying measurement protocols, annotation rubrics, inter-rater reliability checks, baseline comparisons, time-motion data, or arithmetic linking session volume to dollar savings. These quantities are load-bearing for the central claim of institutional value and cannot be evaluated as reported.
  2. [Abstract] Abstract: The statement that 'benchmark-based evaluations proved insufficient' and that 'new methods' were required for UI monitoring is asserted but not accompanied by any concrete description of those methods, their validation, or how they differ from standard approaches, leaving the evaluation framework opaque.
minor comments (2)
  1. [Abstract] Abstract: The term 'automations' is introduced as 'static combinations of prompts and data' but receives no further operational definition or examples of the seven built automations, which would help readers understand the scope of tasks addressed.
  2. [Abstract] Abstract: The claim that ChatEHR 'redefines LLM use as an institutional capability' is interpretive; a more neutral phrasing such as 'demonstrates LLM use as an institutional capability' would better separate observation from framing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments below and will revise the manuscript to provide the requested methodological details and transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance figures (0.73 hallucinations and 1.60 inaccuracies per summary generation) and the $6M first-year savings estimate are presented without any description of the underlying measurement protocols, annotation rubrics, inter-rater reliability checks, baseline comparisons, time-motion data, or arithmetic linking session volume to dollar savings. These quantities are load-bearing for the central claim of institutional value and cannot be evaluated as reported.

    Authors: We agree that the abstract lacks sufficient detail on these measurement protocols. In the revised manuscript we will expand the abstract to briefly describe the annotation rubrics, inter-rater reliability checks, time-motion studies, and the arithmetic used to link session volume to the $6M savings estimate. A dedicated methods subsection will also be added or expanded to fully document these protocols and any baseline comparisons. revision: yes

  2. Referee: [Abstract] Abstract: The statement that 'benchmark-based evaluations proved insufficient' and that 'new methods' were required for UI monitoring is asserted but not accompanied by any concrete description of those methods, their validation, or how they differ from standard approaches, leaving the evaluation framework opaque.

    Authors: We acknowledge that the manuscript does not yet provide concrete descriptions of the new UI monitoring methods. In revision we will detail these methods, including how they use real-time logging of clinical outputs, custom error-detection rules tailored to multi-year patient timelines, validation against expert review, and explicit contrasts with standard NLP benchmarks to show why the latter proved insufficient for our clinical use cases. revision: yes

Circularity Check

0 steps flagged

Observational deployment report contains no circular derivations or self-referential predictions

full rationale

The manuscript reports direct observational outcomes from building and deploying ChatEHR: user training counts, session volumes, task frequencies, and rough estimates of hallucinations, inaccuracies, and dollar savings. No equations, fitted parameters, uniqueness theorems, or ansatzes are invoked; the central claims are presented as measured results from system use rather than predictions derived from prior inputs. Absence of protocols for annotation or cost calculation affects verifiability but does not create circularity, as nothing reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a descriptive case study of system implementation and usage. It introduces no mathematical free parameters, relies on no explicit axioms, and postulates no new entities. All claims rest on reported deployment statistics and internal estimates.

pith-pipeline@v0.9.0 · 5887 in / 1215 out tokens · 36403 ms · 2026-05-16T12:48:04.655134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.