arxiv: 2605.08955 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

Outlier detection for patient monitoring and alerting

Milo\v{s} Hauskrecht , Iyad Batal , Michal Valko , Shyam Visweswaran , Gregory F. Cooper , Gilles Clermont

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords outlier detectionelectronic health recordsanomaly detectionpatient monitoringclinical alertingmedical error detectionpostoperative care

0 comments

The pith

Unusual patient management decisions in electronic health records can be detected as outliers and alerted with true positive rates of 25% to 66%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether spotting unusual patient care choices by comparing them to past electronic health records can help catch errors. The authors built an outlier detection system on data from 4486 post-cardiac surgery patients. They generated alerts for 222 cases and had experts review them to see if the flagged decisions were indeed mistakes. The study found that 25% to 66% of the alerts were true, with better performance on the most extreme outliers. This points to a way of using existing patient data to support real-time monitoring and reduce medical errors.

Core claim

A data-driven outlier detection approach applied to patient-management decisions in electronic health records can identify potential errors. When evaluated on cases from 4486 post-cardiac surgical patients using expert opinions as ground truth, the method achieved true alert rates ranging from 25% to 66%, with the highest rates for the strongest outliers. This supports the hypothesis that generating alerts for unusual decisions is worthwhile for patient monitoring.

What carries the argument

Outlier detection model trained on historical EHR patient cases to score the unusualness of current patient-management decisions.

Load-bearing premise

Unusual decisions with respect to past patient care are likely to be errors, and expert opinions provide a valid measure of whether an alert is true.

What would settle it

A follow-up study with more patients and multiple independent expert reviews finding true alert rates below 20% for most alerts would disprove that the approach leads to promising rates.

Figures

Figures reproduced from arXiv: 2605.08955 by Gilles Clermont, Gregory F. Cooper, Iyad Batal, Michal Valko, Milo\v{s} Hauskrecht, Shyam Visweswaran.

**Figure 1.** Figure 1: Outlier-based alerting framework and its two stages. The model-building stage is shown on the top and the model-application stage is shown on the bottom. 48 M. Hauskrecht et al. / Journal of Biomedical Informatics 46 (2013) 47–55 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The segmentation of a patient’s EHR into four patient state – action instances. M. Hauskrecht et al. / Journal of Biomedical Informatics 46 (2013) 47–55 49 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of temporal features for time-series of continuous laboratory test values. 50 M. Hauskrecht et al. / Journal of Biomedical Informatics 46 (2013) 47–55 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Calculation of the alert score from the two anomaly scores [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Distributions of alert scores for 222 cases used in the evaluation (top panel), and alert scores for 4870 initially generated alert candidates (bottom panel) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: analyzes this relation (the relation of the alert score and the true alert rate) in more depth by binning the alert scores (in intervals of width of 0.2) and presenting the true alert rate per bin. The true alert rates for responses to Item 1 vary from 25% for low alert scores to 66% for high alert scores, indicating top action-specific alerts that come with higher alert scores are more likely associated w… view at source ↗

read the original abstract

We develop and evaluate a data-driven approach for detecting unusual (anomalous) patient-management decisions using past patient cases stored in electronic health records (EHRs). Our hypothesis is that a patient-management decision that is unusual with respect to past patient care may be due to an error and that it is worthwhile to generate an alert if such a decision is encountered. We evaluate this hypothesis using data obtained from EHRs of 4486 post-cardiac surgical patients and a subset of 222 alerts generated from the data. We base the evaluation on the opinions of a panel of experts. The results of the study support our hypothesis that the outlier-based alerting can lead to promising true alert rates. We observed true alert rates that ranged from 25\% to 66\% for a variety of patient-management actions, with 66\% corresponding to the strongest outliers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper applies outlier detection to EHR data for flagging unusual post-surgical management decisions and reports 25-66% expert-judged true alerts, but the expert ground truth has no reported agreement stats or outcome links.

read the letter

The one thing to know is that this paper applies outlier detection to electronic health records to identify unusual patient management decisions in post-cardiac surgery cases, with expert review indicating true alert rates between 25% and 66% for a variety of actions, higher for stronger outliers. They used data from 4486 patients and reviewed 222 generated alerts. What is new is the focus on management decisions rather than just physiological signals, tested in a real clinical setting with existing EHR data. The paper does well in framing a straightforward hypothesis and showing that the expert agreement rates rise with outlier strength, which gives the results a practical feel. The soft spots are in the evaluation. The true alert rates depend on a panel of experts, but there are no details on how the experts were chosen, whether they agreed with one another, or any blinding. There is also no check on whether the unusual decisions actually led to worse patient outcomes. This leaves room for the possibility that the method is mostly catching legitimate but atypical care that the panel simply views as wrong. The abstract also skips the exact outlier method and preprocessing steps, which makes the technical claims harder to assess. This work is aimed at researchers in health informatics or clinical decision support who want examples of turning raw EHR data into alerts. A reader in that area would find the rates and setup worth considering as a starting point. The paper shows clear thinking about the clinical problem and engages directly with the challenges of using data this way. It deserves a serious referee to get the methods and validation details filled in. I would recommend sending it to peer review rather than desk rejecting it.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a data-driven outlier detection method to flag anomalous patient-management decisions in EHR data from 4486 post-cardiac surgery patients. It generates 222 alerts and evaluates them through expert panel review, reporting true alert rates of 25% to 66% (higher for stronger outliers) and concluding that the approach yields promising results for error detection and alerting.

Significance. If the evaluation methodology proves robust, the work offers a practical, data-driven complement to rule-based clinical decision support by identifying deviations from historical care patterns. Strengths include the use of real EHR data at scale and direct expert validation on actual alerts. The reported rate range provides an initial signal that outlier strength correlates with expert-flagged issues, which could inform alerting thresholds in monitoring systems.

major comments (3)

[§3 (Methods)] §3 (Methods): The outlier detection procedure is described at a high level only. No specification is given for the feature set extracted from EHR management decisions, the distance/density measure used to quantify outlierness, preprocessing (normalization, missing-value handling, temporal alignment), or any multiple-testing correction. These omissions make it impossible to assess reproducibility or to determine whether the reported rates depend on particular modeling choices.
[§4 (Evaluation)] §4 (Evaluation): True-alert rates rest entirely on expert panel judgments, yet no inter-rater reliability statistic (Cohen’s or Fleiss’ kappa), panel size, selection criteria, blinding protocol, or correlation with downstream patient outcomes is reported. Without these, the 25–66% figures cannot be interpreted as evidence that statistical outlierness corresponds to clinical error rather than legitimate practice variation.
[§4.1 and Table 2] §4.1 and Table 2: The subset of 222 alerts is presented without describing the sampling frame or selection criteria from the full set of outliers. If the 222 were chosen to include the strongest outliers, the observed rate gradient may be an artifact of selection rather than a general property of the method.

minor comments (2)

[Abstract and §1] The abstract and §1 should explicitly define “true alert rate” (expert agreement that the decision was erroneous) and distinguish it from positive predictive value against objective outcomes.
[Figure 1] Figure 1 (outlier score distribution) would benefit from axis labels that include units and from an overlay of the expert-labeled subset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§3 (Methods)] §3 (Methods): The outlier detection procedure is described at a high level only. No specification is given for the feature set extracted from EHR management decisions, the distance/density measure used to quantify outlierness, preprocessing (normalization, missing-value handling, temporal alignment), or any multiple-testing correction. These omissions make it impossible to assess reproducibility or to determine whether the reported rates depend on particular modeling choices.

Authors: We agree that the methods section would benefit from greater specificity to support reproducibility. We will revise §3 to provide a detailed description of the feature set extracted from the EHR (encompassing vital signs, laboratory values, medications, and procedural data), the specific outlier detection approach and distance/density measure, all preprocessing steps including normalization, missing-value handling, and temporal alignment, and confirmation that no multiple-testing correction was applied. These additions will allow readers to evaluate the dependence of results on modeling choices. revision: yes
Referee: [§4 (Evaluation)] §4 (Evaluation): True-alert rates rest entirely on expert panel judgments, yet no inter-rater reliability statistic (Cohen’s or Fleiss’ kappa), panel size, selection criteria, blinding protocol, or correlation with downstream patient outcomes is reported. Without these, the 25–66% figures cannot be interpreted as evidence that statistical outlierness corresponds to clinical error rather than legitimate practice variation.

Authors: We will expand §4 to include the expert panel size, selection criteria, and blinding protocol. Inter-rater reliability was not computed in the original study, and downstream patient outcomes were not tracked. We will explicitly note these as limitations and discuss the implications for interpreting the true-alert rates as potential indicators of error versus legitimate variation in practice. revision: partial
Referee: [§4.1 and Table 2] §4.1 and Table 2: The subset of 222 alerts is presented without describing the sampling frame or selection criteria from the full set of outliers. If the 222 were chosen to include the strongest outliers, the observed rate gradient may be an artifact of selection rather than a general property of the method.

Authors: We will revise §4.1 and the Table 2 caption to explicitly state the sampling frame and selection criteria applied to obtain the 222 alerts from the complete set of outliers. This clarification will enable readers to assess whether the observed gradient in true-alert rates is influenced by selection or represents a broader property of the outlier detection method. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation

full rationale

The paper develops an outlier detection method on EHR patient-management decisions from 4486 cases, generates 222 alerts, and evaluates true alert rates (25-66%) via separate expert panel review as ground truth. No equations, fitted parameters, or self-citations are shown to reduce the central result to its inputs by construction; the evaluation relies on independent expert judgments rather than reusing the same outcomes or data for both model fitting and performance claims. This keeps the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one key domain assumption with no free parameters or invented entities described in the abstract. The assumption links outlier status directly to error likelihood without external validation beyond experts.

axioms (1)

domain assumption A patient-management decision that is unusual with respect to past patient care may be due to an error
This is the explicit hypothesis stated in the abstract that motivates the entire alerting system.

pith-pipeline@v0.9.0 · 5457 in / 1212 out tokens · 52168 ms · 2026-05-12T01:46:29.499636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

To err is human: building a safer health system

Kohn LT, Corrigan JM, et al. To err is human: building a safer health system. National Academy Press; 2000

work page 2000
[2]

Is US health really the best in the world? JAMA 2000;284(4):483–5

Starﬁeld B. Is US health really the best in the world? JAMA 2000;284(4):483–5

work page 2000
[3]

Costs of medical injuries in Utah and Colorado

Thomas EJ, Studdert DM, Newhouse JP. Costs of medical injuries in Utah and Colorado. Inquiry 1999;36:255–64

work page 1999
[4]

‘Global Trigger Tool’ shows that adverse events in hospitals may be ten times greater than previously measured

Classen DC, Resar R, Grifﬁn F, Federico F, Frankel T, Kimmel N, et al. ‘Global Trigger Tool’ shows that adverse events in hospitals may be ten times greater than previously measured. Health Aff 2011;30:581–9

work page 2011
[5]

Adverse events in hospitals: national incidence among Medicare beneﬁciaries

Levinson DR. Adverse events in hospitals: national incidence among Medicare beneﬁciaries. Contract no.: Department of Health and Human Services, Ofﬁce of the Inspector General, Report number OEI-06-09-00090; 2010

work page 2010
[6]

Temporal trends in rates of patient harm resulting from medical care

Landrigan CP, Parry GJ, Bones CB, Hackbarth AD, Goldmann DA, Sharek PJ. Temporal trends in rates of patient harm resulting from medical care. New Engl J Med 2010;363:2124–34

work page 2010
[7]

Conditional outlier detection for clinical alerting

Hauskrecht M, Valko M, Batal I, Clermont G, Visweswaran S, Cooper GF. Conditional outlier detection for clinical alerting. In: Proceedings of annual American Medical Informatics Association symposium; 2010. p. 286–90

work page 2010
[8]

Anomaly detection: a survey

Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv 2009;41(3)

work page 2009
[9]

Novelty detection: a review – part 1: statistical approaches

Markou M, Singh S. Novelty detection: a review – part 1: statistical approaches. Signal Process 2003;83:2481–97

work page 2003
[10]

Evidence-based anomaly detection

Hauskrecht M, Valko M, Kveton B, Visweswaran S, Cooper GF. Evidence-based anomaly detection. In: Proceedings of annual American Medical Informatics Association symposium; 2007. p. 319–324

work page 2007
[11]

Ten commandments for effective clinical decision support: making the practice of evidence-based medicine a reality

Bates D et al. Ten commandments for effective clinical decision support: making the practice of evidence-based medicine a reality. J Am Med Inf Assoc 2003;10:523–30

work page 2003
[12]

Medical informatics: computer applications in health care and biomedicine

Shortliffe EH, Fagan LM, Perreault LE, Wiederhold G. Medical informatics: computer applications in health care and biomedicine. 2nd ed. New York: Springer Verlag; 2000

work page 2000
[13]

Computerized surveillance of adverse drug events in hospital patients

Classen DC, Pestotnik SL, Evans RS, Burke JP. Computerized surveillance of adverse drug events in hospital patients. JAMA 1991;266:2847–51

work page 1991
[14]

Medication-related clinical decision support in computerized provider order entry systems: a review

Kuperman GJ, Bobb A, Payne TH, Avery AJ, Gandhi TK, Burns G, et al. Medication-related clinical decision support in computerized provider order entry systems: a review. JAMA 2007;14:29–40

work page 2007
[15]

Adverse drug event trigger tool: a practical methodology for measuring medication related harm

Rozich JD, Haraden CR, Resar RK. Adverse drug event trigger tool: a practical methodology for measuring medication related harm. Qual Saf Health Care 2003;12:194–200

work page 2003
[16]

Identifying adverse drug events: development of a computer-based monitor and comparison with chart review and stimulated voluntary report

Jha AK, Kuperman GJ, Teich JM, Leape L, Shea B, Rittenberg E, et al. Identifying adverse drug events: development of a computer-based monitor and comparison with chart review and stimulated voluntary report. JAMA 1998;5:305–14

work page 1998
[17]

A computer-assisted management program for antibiotics and other antiinfective agents

Evans RS, Pestotnik SL, Classen DC, Clemmer TP, Weaver LK, Orme Jr JF, et al. A computer-assisted management program for antibiotics and other antiinfective agents. New Engl J Med 1998;338:232–8

work page 1998
[18]

Managing temporal worlds for medical trend diagnosis

Haimowitz IJ, Kohane IS. Managing temporal worlds for medical trend diagnosis. Artif Intell Med 1996;8(3):299–321

work page 1996
[19]

Clinical monitoring using regression-based trend templates

Haimowitz IJ, Le PP, et al. Clinical monitoring using regression-based trend templates. Artif Intell Med 1995;7(6):473–96

work page 1995
[20]

Temporal abstractions for interpreting diabetic patients monitoring data

Bellazzi R, Larizza C, Riva A. Temporal abstractions for interpreting diabetic patients monitoring data. Intell Data Anal 1998;2:97–122

work page 1998
[21]

Analysis of a failed clinical decision support system for management of congestive heart failure

Wadhwa RFD, Saul MI, Penrod LE, Visweswaran S, Cooper GF, Chapman W. Analysis of a failed clinical decision support system for management of congestive heart failure. In: Proceedings of the fall symposium of the American Medical Informatics Association; 2008. p. 773–777

work page 2008
[22]

Crying wolf: false alarms in a pediatric intensive care unit

Lawless ST. Crying wolf: false alarms in a pediatric intensive care unit. Crit Care Med 1994;22:981–5

work page 1994
[23]

Physicians’ decisions to override computerized drug alerts in primary care

Weingart SN, Toth M, Sands DZ, Aronson MD, Davis RB, Phillips RS. Physicians’ decisions to override computerized drug alerts in primary care. Arch Int Med 2003;163:2625–31

work page 2003
[24]

Characteristics and consequences of drug allergy alert overrides in a computerized physician order entry system

Hsieh TC, Kuperman GJ, Jaggi T, Hojnowski-Diaz P, Fiskio J, Williams DH, et al. Characteristics and consequences of drug allergy alert overrides in a computerized physician order entry system. JAMA 2004;11:482–91

work page 2004
[25]

The nature of statistical learning theory

Vapnik VN. The nature of statistical learning theory. New York: Springer- Verlag; 1995

work page 1995
[26]

LIBSVM: A library for support vector machines

Chang C-C, Lin, C-J. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2011;2(3):1–27. < http://www.csie.ntu.edu.tw/~cjlin/ libsvm>

work page 2011
[27]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

Platt JC. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in max margin classiﬁers. MIT Press; 1999. p. 61–74

work page 1999
[28]

Probabilistic methods for support vector machines

Sollich P. Probabilistic methods for support vector machines. In: Advances in neural information processing systems; 2000. p. 349–55

work page 2000
[29]

Predicting good probabilities with supervised learning

Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on, machine learning; 2005. p. 625–32

work page 2005
[30]

The meaning and use of the area under a receiver operating characteristic (ROC) curve

Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology; 1982

work page 1982
[31]

Feature importance analysis for patient management decisions

Valko M, Hauskrecht M. Feature importance analysis for patient management decisions. In: 13th International congress on medical informatics, Cape Town, South, Africa; 2010. p. 861–5

work page 2010
[32]

Temporal data mining

Post AR, Harrison JA. Temporal data mining. Clin Lab Med 2008;28(1):83–100

work page 2008
[33]

A coefﬁcient of agreement for nominal scales

Cohen J. A coefﬁcient of agreement for nominal scales. Educ Psychol Measur 1960;20(1):37–46

work page 1960
[34]

Overriding of drug safety alerts in computerized physician order entry

VanDerSisj H, Aarts J, Vulto A, Berg M. Overriding of drug safety alerts in computerized physician order entry. J Am Med Inf Assoc 2006;13:138–47

work page 2006
[35]

Medication alert fatigue: the potential for compromised patient safety

Baker DE. Medication alert fatigue: the potential for compromised patient safety. Hospital Pharmacy, vol. 44, no. 6. Wolters Kluwer Health, Inc.; 2009. p. 460–2

work page 2009
[36]

Improving acceptance of computerized prescribing alerts in ambulatory care

Shah NR, Seger AC, Seger DL, Fiskio JM, Kuperman GJ, Blumenfeld B, et al. Improving acceptance of computerized prescribing alerts in ambulatory care. J Am Med Inf Assoc 2006;13(1):5–11

work page 2006
[37]

Factors inﬂuencing alert acceptance

Seidling HM, Phansalkar S, Seger DL, Paterno MD, Shaykevich S, Haefeli WE, et al. Factors inﬂuencing alert acceptance. a novel approach for predicting the success of clinical decision support. J Am Med Inf Assoc 2011;18(4): 479–84

work page 2011
[38]

Monitor alarm fatigue: standardizing use of physiological monitors and decreasing nuisance alarms

Graham KC, Cvach M. Monitor alarm fatigue: standardizing use of physiological monitors and decreasing nuisance alarms. Am J Crit Care 2010;19:28–34

work page 2010
[39]

Tiering drug–drug interaction alerts by severity increases compliance rates

Paterno MD, Maviglia SM, Gorman PN, Seger DL, Yoshida E, Seger AC, et al. Tiering drug–drug interaction alerts by severity increases compliance rates. J Am Med Inf Assoc 2009;16(1):40–6

work page 2009
[40]

Improving patient safety through medical alert management: an automated decision tool to reduce alert fatigue

Lee EK, Mejia AF, Senior T, Jose J. Improving patient safety through medical alert management: an automated decision tool to reduce alert fatigue. In: Proceedings of annual American Medical Informatics Association symposium

work page
[41]

p. 417–21. M. Hauskrecht et al. / Journal of Biomedical Informatics 46 (2013) 47–55 55

work page 2013