Making Conformal Predictors Robust in Healthcare Settings: a Case Study on EEG Classification

Arjun Chatterjee; Jathurshan Pradeepkumar; Jimeng Sun; John Wu; Sayeed Sajjad Razin; Siddhartha Laghuvarapu

arxiv: 2602.19483 · v2 · submitted 2026-02-23 · 💻 cs.LG · cs.AI· stat.ML

Making Conformal Predictors Robust in Healthcare Settings: a Case Study on EEG Classification

Arjun Chatterjee , Sayeed Sajjad Razin , John Wu , Siddhartha Laghuvarapu , Jathurshan Pradeepkumar , Jimeng Sun This is my paper

Pith reviewed 2026-05-15 20:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords conformal predictionEEG classificationdistribution shiftpersonalized calibrationseizure detectionuncertainty quantificationhealthcare AI

0 comments

The pith

Personalized calibration strategies for conformal predictors improve coverage by over 20 percentage points in EEG seizure classification under patient distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conformal prediction supplies sets with theoretical coverage guarantees, but these guarantees erode when EEG recordings come from new patients whose data distributions differ from the training set. The paper tests several conformal methods on seizure classification and finds that standard approaches lose substantial coverage under these shifts. Switching to calibration performed on patient-specific data recovers more than 20 percentage points of coverage. The recovered coverage is obtained without enlarging the typical size of the returned prediction sets. The result is demonstrated on real EEG data and released through an open healthcare framework.

Core claim

In EEG seizure classification, where inter-patient distribution shifts violate the exchangeability assumptions of standard conformal prediction, using calibration sets drawn from the same patient as the test example raises empirical coverage by more than 20 percentage points while preserving comparable prediction-set sizes.

What carries the argument

Patient-specific calibration sets inserted into the conformal prediction pipeline to correct for distribution shifts between training and deployment patients.

Load-bearing premise

Patient-specific calibration data is available at deployment time and the personalization step itself does not create new coverage failures under further unseen shifts.

What would settle it

Coverage falling below the nominal guarantee when the personalized method is applied to new patients who supply no calibration examples of their own or who encounter distribution shifts absent from both training and calibration data.

Figures

Figures reproduced from arXiv: 2602.19483 by Arjun Chatterjee, Jathurshan Pradeepkumar, Jimeng Sun, John Wu, Sayeed Sajjad Razin, Siddhartha Laghuvarapu.

**Figure 1.** Figure 1: EEG Classification Challenges. EEG tasks often violate key aspects of a typical machine learning pipeline. (a) Their annotation process consistently leaves room for uncertainty, which is then passed onto the models. (b) Training, validation, and test distributions are not i.i.d. due to patient distribution shift [13]. Both issues make EEG classification a very challenging machine learning problem. patient-… view at source ↗

**Figure 2.** Figure 2: Empirical coverage under random (top) and patient (bottom) splits. The dotted line is target coverage 1 − α. Under the random split, NCP substantially outperforms non-personalized baselines on TUEV. Under the patient split, all methods fall short of target coverage, reflecting the difficulty of cross-patient distribution shift. NCP substantially improves coverage on the random split. On TUEV under the rand… view at source ↗

**Figure 3.** Figure 3: Average prediction set sizes under random (top) and patient (bottom) splits. NCP consistently maintains smaller prediction sets than non-personalized methods despite achieving competitive or higher coverage [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical coverage vs. α for NCP using varying calibration set sizes k. Shaded regions denote ±1 std. The dashed line is target coverage 1 − α. 3 Discussion Future directions. Our EEG case study highlights distribution shift challenges common across healthcare [12, 13]. Applying personalized conformal predictors [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Quantifying uncertainty in clinical predictions is critical for high-stakes diagnosis tasks. Conformal prediction offers a principled approach by providing prediction sets with theoretical coverage guarantees. However, in practice, patient distribution shifts violate the i.i.d. assumptions underlying standard conformal methods, leading to poor coverage in healthcare settings. In this work, we evaluate several conformal prediction approaches on EEG seizure classification, a task with known distribution shift challenges and label uncertainty. We demonstrate that personalized calibration strategies can improve coverage by over 20 percentage points while maintaining comparable prediction set sizes. Our implementation is available via PyHealth, an open-source healthcare AI framework: https://github.com/sunlabuiuc/PyHealth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports that personalized calibration lifts coverage by over 20 points on EEG seizure data with little change in set size, but the abstract gives almost no implementation or evaluation details.

read the letter

The headline result here is that switching to patient-specific calibration in conformal prediction improves coverage by more than 20 percentage points on EEG seizure classification while keeping prediction set sizes roughly the same. The authors test this on a task where distribution shift across patients is known to break standard conformal guarantees, and they release the code inside PyHealth so others can reproduce or extend it. That combination of a concrete clinical example and open implementation is the main thing the paper contributes. The rest is mostly an empirical check of existing personalization ideas rather than a new theoretical device. The open-source release is useful because it lowers the barrier for anyone who wants to try the same approach on their own EEG recordings. The abstract frames the improvement as an empirical observation rather than something derived from fitting, which is the right way to present it. The soft spot is that almost nothing is said about how the personalization step is actually implemented, how large the per-patient calibration sets are, what the exact baselines were, how many patients were used, or whether any statistical tests were run. Without those pieces it is hard to judge whether the 20-point gain survives different splits or new recording sessions. The stress-test concern about intra-patient non-stationarity is also worth checking in the full text: EEG patterns can drift within a patient, so a calibration set taken from one short window may not remain exchangeable with later data from the same person. If the paper shows that the method still works when calibration data is limited or drawn from a different session, that would strengthen the claim; if not, the result may be narrower than the abstract suggests. This is the kind of paper that matters to people already working on conformal prediction or uncertainty quantification in medical time-series. A reader who wants to see how these methods behave on real EEG seizure data will get something concrete out of it. It is not a foundational methods paper, but the empirical angle is clear enough that it should go to peer review so the experimental details can be examined.

Referee Report

2 major / 1 minor

Summary. The paper evaluates several conformal prediction methods on EEG seizure classification, highlighting failures of standard approaches under patient distribution shifts that violate i.i.d. assumptions. It reports that personalized calibration strategies improve coverage by over 20 percentage points relative to baselines while maintaining comparable prediction set sizes, with an open-source implementation in PyHealth.

Significance. If the reported coverage gains are shown to be robust, statistically significant, and not artifacts of split choices or limited calibration data, the work would provide a concrete, deployable technique for adapting conformal prediction to non-stationary healthcare data. The open implementation in PyHealth is a positive contribution for reproducibility in the field.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the headline claim of a >20 pp coverage improvement supplies no information on the number of patients, the size and source of patient-specific calibration sets, the exact baselines compared, or any statistical tests for significance. These details are load-bearing for evaluating whether the gain is reliable rather than split-dependent.
[§3 and §5] §3 (Method) and §5 (Discussion): the personalization step assumes that patient-specific calibration data remains exchangeable with future test points from the same patient. No experiments address intra-patient non-stationarity (e.g., evolving seizure patterns or electrode drift), which could invalidate coverage guarantees when calibration data is limited to short recordings.

minor comments (1)

[Abstract] The abstract mentions 'label uncertainty' but does not clarify how it interacts with the conformal score function; a brief description in §2 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work regarding conformal prediction in healthcare settings. We have carefully considered the comments and made revisions to enhance the reporting of experimental details and to elaborate on the methodological assumptions. Our point-by-point responses are provided below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim of a >20 pp coverage improvement supplies no information on the number of patients, the size and source of patient-specific calibration sets, the exact baselines compared, or any statistical tests for significance. These details are load-bearing for evaluating whether the gain is reliable rather than split-dependent.

Authors: We agree with the referee that these specifics are essential for a thorough evaluation. Accordingly, we have revised the abstract to include information on the number of patients in the study, the size and source of the patient-specific calibration sets, the exact baselines used for comparison, and the statistical tests performed to assess significance. In §4, we have added detailed descriptions of the data splits, patient counts, calibration set sizes, and p-values from appropriate statistical tests to demonstrate that the coverage improvements are robust and not dependent on specific split choices. These changes ensure the claims are well-supported. revision: yes
Referee: [§3 and §5] §3 (Method) and §5 (Discussion): the personalization step assumes that patient-specific calibration data remains exchangeable with future test points from the same patient. No experiments address intra-patient non-stationarity (e.g., evolving seizure patterns or electrode drift), which could invalidate coverage guarantees when calibration data is limited to short recordings.

Authors: We appreciate this observation regarding the core assumption of our personalization strategy. The approach in §3 does assume exchangeability between the patient-specific calibration data and future test points from the same patient. We have updated §5 to explicitly address this assumption and discuss potential violations due to intra-patient non-stationarity, including examples like evolving seizure patterns and electrode drift. We acknowledge that with calibration data limited to short recordings, coverage guarantees could be affected. Although we were unable to conduct additional experiments on this due to the nature of the available EEG dataset (which lacks extensive longitudinal recordings), we have strengthened the discussion to highlight this as a limitation and suggest avenues for future research, such as adaptive conformal methods. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical evaluation of conformal methods

full rationale

The paper is an empirical evaluation of existing conformal prediction techniques on EEG seizure data. It reports observed coverage gains from personalized calibration without presenting any mathematical derivation, uniqueness theorem, or ansatz that reduces the claimed improvements to fitted parameters or self-citations by construction. Results are framed as experimental outcomes on a specific dataset, with no load-bearing self-referential steps in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard conformal-prediction coverage guarantee under exchangeability and on the empirical observation that patient-level distribution shift violates that guarantee. No new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Conformal prediction guarantees hold only under i.i.d. or exchangeable data
Abstract explicitly states that patient distribution shifts violate the i.i.d. assumptions underlying standard conformal methods.

pith-pipeline@v0.9.0 · 5437 in / 1203 out tokens · 15792 ms · 2026-05-15T20:23:21.809355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Chatterjee, A., Razin, S.S., Wu, J., Laghuvarapu, S., Pradeepkumar, J., Sun, J.: Making conformal predictors robust in healthcare settings: a case study on eeg classification (2026), https://arxiv.org/abs/2602.19483

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Journal of neuroscience methods351, 108966 (2021)

Ge, W., Jing, J., An, S., Herlopian, A., Ng, M., Struck, A.F., Appavu, B., Johnson, E.L., Osman, G., Haider, H.A., et al.: Deep active learning for interictal ictal injury continuum eeg patterns. Journal of neuroscience methods351, 108966 (2021)

work page 2021
[3]

Proceedings of the AAAI Conference on Artificial Intelligence 37(6), 7722–7730 (Jun 2023)

Ghosh, S., Belkhouja, T., Yan, Y., Doppa, J.R.: Improving uncertainty quantifica- tion of deep classifiers via neighborhood conformal prediction: Novel algorithm and theoretical analysis. Proceedings of the AAAI Conference on Artificial Intelligence 37(6), 7722–7730 (Jun 2023)

work page 2023
[4]

IEEE Signal Processing in Medicine and Biology Symposium2015(2015)

Harati, A., Golmohammadi, M., Lopez, S., Obeid, I., Picone, J.: Improved eeg event classification using differential energy. IEEE Signal Processing in Medicine and Biology Symposium2015(2015)

work page 2015
[5]

Advances in Neural Information Processing Systems36, 37728–37747 (2023)

Laghuvarapu, S., Lin, Z., Sun, J.: Codrug: Conformal drug property prediction with density estimation under covariate shift. Advances in Neural Information Processing Systems36, 37728–37747 (2023)

work page 2023
[6]

IEEE Signal Processing in Medicine and Biology Sympo- sium2015(2015)

Lopez, S., Suarez, G., Jungreis, D., Obeid, I., Picone, J.: Automated identification of abnormal adult eegs. IEEE Signal Processing in Medicine and Biology Sympo- sium2015(2015)

work page 2015
[7]

Frontiers in neuroscience10, 196 (2016)

Obeid, I., Picone, J.: The temple university hospital eeg data corpus. Frontiers in neuroscience10, 196 (2016)

work page 2016
[8]

In: 19th IEEE International Conference on Tools with Artificial Intelli- gence (ICTAI 2007)

Papadopoulos, H., Vovk, V., Gammerman, A.: Conformal prediction with neural networks. In: 19th IEEE International Conference on Tools with Artificial Intelli- gence (ICTAI 2007). vol. 2, pp. 388–395. IEEE (2007)

work page 2007
[9]

Pradeepkumar, J., Piao, X., Chen, Z., Sun, J.: Tokenizing single-channel eeg with time-frequency motif learning (2026), https://arxiv.org/abs/2502.16060

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Frontiers in Neuroinformatics12(2018)

Shah, V., von Weltin, E., Lopez, S., McHugh, J.R., Veloso, L., Golmohammadi, M., Obeid, I., Picone, J.: The temple university hospital seizure detection corpus. Frontiers in Neuroinformatics12(2018)

work page 2018
[11]

In: Advances in Neural Information Processing Systems

Tibshirani, R.J., Foygel Barber, R., Candes, E., Ramdas, A.: Conformal prediction under covariate shift. In: Advances in Neural Information Processing Systems. vol. 32 (2019)

work page 2019
[12]

In: Advances in Neural Information Processing Sys- tems

Wu, Z., Yao, H., Liebovitz, D., Sun, J.: An iterative self-learning framework for medical domain generalization. In: Advances in Neural Information Processing Sys- tems. vol. 36, pp. 54833–54854 (2023)

work page 2023
[13]

In: The 11th International Conference on Learning Rep- resentations, ICLR 2023 (2023)

Yang, C., Westover, M.B., Sun, J.: Manydg: Many-domain generalization for healthcare applications. In: The 11th International Conference on Learning Rep- resentations, ICLR 2023 (2023)

work page 2023
[14]

JMIR AI (2023)

Yang, C., Xiao, D., Westover, M.B., Sun, J.: Self-supervised eeg representation learning for automatic sleep staging. JMIR AI (2023)

work page 2023

[1] [1]

Chatterjee, A., Razin, S.S., Wu, J., Laghuvarapu, S., Pradeepkumar, J., Sun, J.: Making conformal predictors robust in healthcare settings: a case study on eeg classification (2026), https://arxiv.org/abs/2602.19483

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Journal of neuroscience methods351, 108966 (2021)

Ge, W., Jing, J., An, S., Herlopian, A., Ng, M., Struck, A.F., Appavu, B., Johnson, E.L., Osman, G., Haider, H.A., et al.: Deep active learning for interictal ictal injury continuum eeg patterns. Journal of neuroscience methods351, 108966 (2021)

work page 2021

[3] [3]

Proceedings of the AAAI Conference on Artificial Intelligence 37(6), 7722–7730 (Jun 2023)

Ghosh, S., Belkhouja, T., Yan, Y., Doppa, J.R.: Improving uncertainty quantifica- tion of deep classifiers via neighborhood conformal prediction: Novel algorithm and theoretical analysis. Proceedings of the AAAI Conference on Artificial Intelligence 37(6), 7722–7730 (Jun 2023)

work page 2023

[4] [4]

IEEE Signal Processing in Medicine and Biology Symposium2015(2015)

Harati, A., Golmohammadi, M., Lopez, S., Obeid, I., Picone, J.: Improved eeg event classification using differential energy. IEEE Signal Processing in Medicine and Biology Symposium2015(2015)

work page 2015

[5] [5]

Advances in Neural Information Processing Systems36, 37728–37747 (2023)

Laghuvarapu, S., Lin, Z., Sun, J.: Codrug: Conformal drug property prediction with density estimation under covariate shift. Advances in Neural Information Processing Systems36, 37728–37747 (2023)

work page 2023

[6] [6]

IEEE Signal Processing in Medicine and Biology Sympo- sium2015(2015)

Lopez, S., Suarez, G., Jungreis, D., Obeid, I., Picone, J.: Automated identification of abnormal adult eegs. IEEE Signal Processing in Medicine and Biology Sympo- sium2015(2015)

work page 2015

[7] [7]

Frontiers in neuroscience10, 196 (2016)

Obeid, I., Picone, J.: The temple university hospital eeg data corpus. Frontiers in neuroscience10, 196 (2016)

work page 2016

[8] [8]

In: 19th IEEE International Conference on Tools with Artificial Intelli- gence (ICTAI 2007)

Papadopoulos, H., Vovk, V., Gammerman, A.: Conformal prediction with neural networks. In: 19th IEEE International Conference on Tools with Artificial Intelli- gence (ICTAI 2007). vol. 2, pp. 388–395. IEEE (2007)

work page 2007

[9] [9]

Pradeepkumar, J., Piao, X., Chen, Z., Sun, J.: Tokenizing single-channel eeg with time-frequency motif learning (2026), https://arxiv.org/abs/2502.16060

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Frontiers in Neuroinformatics12(2018)

Shah, V., von Weltin, E., Lopez, S., McHugh, J.R., Veloso, L., Golmohammadi, M., Obeid, I., Picone, J.: The temple university hospital seizure detection corpus. Frontiers in Neuroinformatics12(2018)

work page 2018

[11] [11]

In: Advances in Neural Information Processing Systems

Tibshirani, R.J., Foygel Barber, R., Candes, E., Ramdas, A.: Conformal prediction under covariate shift. In: Advances in Neural Information Processing Systems. vol. 32 (2019)

work page 2019

[12] [12]

In: Advances in Neural Information Processing Sys- tems

Wu, Z., Yao, H., Liebovitz, D., Sun, J.: An iterative self-learning framework for medical domain generalization. In: Advances in Neural Information Processing Sys- tems. vol. 36, pp. 54833–54854 (2023)

work page 2023

[13] [13]

In: The 11th International Conference on Learning Rep- resentations, ICLR 2023 (2023)

Yang, C., Westover, M.B., Sun, J.: Manydg: Many-domain generalization for healthcare applications. In: The 11th International Conference on Learning Rep- resentations, ICLR 2023 (2023)

work page 2023

[14] [14]

JMIR AI (2023)

Yang, C., Xiao, D., Westover, M.B., Sun, J.: Self-supervised eeg representation learning for automatic sleep staging. JMIR AI (2023)

work page 2023