pith. sign in

arxiv: 2602.19483 · v2 · submitted 2026-02-23 · 💻 cs.LG · cs.AI· stat.ML

Making Conformal Predictors Robust in Healthcare Settings: a Case Study on EEG Classification

Pith reviewed 2026-05-15 20:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords conformal predictionEEG classificationdistribution shiftpersonalized calibrationseizure detectionuncertainty quantificationhealthcare AI
0
0 comments X

The pith

Personalized calibration strategies for conformal predictors improve coverage by over 20 percentage points in EEG seizure classification under patient distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conformal prediction supplies sets with theoretical coverage guarantees, but these guarantees erode when EEG recordings come from new patients whose data distributions differ from the training set. The paper tests several conformal methods on seizure classification and finds that standard approaches lose substantial coverage under these shifts. Switching to calibration performed on patient-specific data recovers more than 20 percentage points of coverage. The recovered coverage is obtained without enlarging the typical size of the returned prediction sets. The result is demonstrated on real EEG data and released through an open healthcare framework.

Core claim

In EEG seizure classification, where inter-patient distribution shifts violate the exchangeability assumptions of standard conformal prediction, using calibration sets drawn from the same patient as the test example raises empirical coverage by more than 20 percentage points while preserving comparable prediction-set sizes.

What carries the argument

Patient-specific calibration sets inserted into the conformal prediction pipeline to correct for distribution shifts between training and deployment patients.

Load-bearing premise

Patient-specific calibration data is available at deployment time and the personalization step itself does not create new coverage failures under further unseen shifts.

What would settle it

Coverage falling below the nominal guarantee when the personalized method is applied to new patients who supply no calibration examples of their own or who encounter distribution shifts absent from both training and calibration data.

Figures

Figures reproduced from arXiv: 2602.19483 by Arjun Chatterjee, Jathurshan Pradeepkumar, Jimeng Sun, John Wu, Sayeed Sajjad Razin, Siddhartha Laghuvarapu.

Figure 1
Figure 1. Figure 1: EEG Classification Challenges. EEG tasks often violate key aspects of a typical machine learning pipeline. (a) Their annotation process consistently leaves room for uncertainty, which is then passed onto the models. (b) Training, validation, and test distributions are not i.i.d. due to patient distribution shift [13]. Both issues make EEG classification a very challenging machine learning problem. patient-… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical coverage under random (top) and patient (bottom) splits. The dotted line is target coverage 1 − α. Under the random split, NCP substantially outperforms non-personalized baselines on TUEV. Under the patient split, all methods fall short of target coverage, reflecting the difficulty of cross-patient distribution shift. NCP substantially improves coverage on the random split. On TUEV under the rand… view at source ↗
Figure 3
Figure 3. Figure 3: Average prediction set sizes under random (top) and patient (bottom) splits. NCP consistently maintains smaller prediction sets than non-personalized methods despite achieving competitive or higher coverage [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical coverage vs. α for NCP using varying calibration set sizes k. Shaded regions denote ±1 std. The dashed line is target coverage 1 − α. 3 Discussion Future directions. Our EEG case study highlights distribution shift challenges common across healthcare [12, 13]. Applying personalized conformal predictors [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Quantifying uncertainty in clinical predictions is critical for high-stakes diagnosis tasks. Conformal prediction offers a principled approach by providing prediction sets with theoretical coverage guarantees. However, in practice, patient distribution shifts violate the i.i.d. assumptions underlying standard conformal methods, leading to poor coverage in healthcare settings. In this work, we evaluate several conformal prediction approaches on EEG seizure classification, a task with known distribution shift challenges and label uncertainty. We demonstrate that personalized calibration strategies can improve coverage by over 20 percentage points while maintaining comparable prediction set sizes. Our implementation is available via PyHealth, an open-source healthcare AI framework: https://github.com/sunlabuiuc/PyHealth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates several conformal prediction methods on EEG seizure classification, highlighting failures of standard approaches under patient distribution shifts that violate i.i.d. assumptions. It reports that personalized calibration strategies improve coverage by over 20 percentage points relative to baselines while maintaining comparable prediction set sizes, with an open-source implementation in PyHealth.

Significance. If the reported coverage gains are shown to be robust, statistically significant, and not artifacts of split choices or limited calibration data, the work would provide a concrete, deployable technique for adapting conformal prediction to non-stationary healthcare data. The open implementation in PyHealth is a positive contribution for reproducibility in the field.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline claim of a >20 pp coverage improvement supplies no information on the number of patients, the size and source of patient-specific calibration sets, the exact baselines compared, or any statistical tests for significance. These details are load-bearing for evaluating whether the gain is reliable rather than split-dependent.
  2. [§3 and §5] §3 (Method) and §5 (Discussion): the personalization step assumes that patient-specific calibration data remains exchangeable with future test points from the same patient. No experiments address intra-patient non-stationarity (e.g., evolving seizure patterns or electrode drift), which could invalidate coverage guarantees when calibration data is limited to short recordings.
minor comments (1)
  1. [Abstract] The abstract mentions 'label uncertainty' but does not clarify how it interacts with the conformal score function; a brief description in §2 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work regarding conformal prediction in healthcare settings. We have carefully considered the comments and made revisions to enhance the reporting of experimental details and to elaborate on the methodological assumptions. Our point-by-point responses are provided below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim of a >20 pp coverage improvement supplies no information on the number of patients, the size and source of patient-specific calibration sets, the exact baselines compared, or any statistical tests for significance. These details are load-bearing for evaluating whether the gain is reliable rather than split-dependent.

    Authors: We agree with the referee that these specifics are essential for a thorough evaluation. Accordingly, we have revised the abstract to include information on the number of patients in the study, the size and source of the patient-specific calibration sets, the exact baselines used for comparison, and the statistical tests performed to assess significance. In §4, we have added detailed descriptions of the data splits, patient counts, calibration set sizes, and p-values from appropriate statistical tests to demonstrate that the coverage improvements are robust and not dependent on specific split choices. These changes ensure the claims are well-supported. revision: yes

  2. Referee: [§3 and §5] §3 (Method) and §5 (Discussion): the personalization step assumes that patient-specific calibration data remains exchangeable with future test points from the same patient. No experiments address intra-patient non-stationarity (e.g., evolving seizure patterns or electrode drift), which could invalidate coverage guarantees when calibration data is limited to short recordings.

    Authors: We appreciate this observation regarding the core assumption of our personalization strategy. The approach in §3 does assume exchangeability between the patient-specific calibration data and future test points from the same patient. We have updated §5 to explicitly address this assumption and discuss potential violations due to intra-patient non-stationarity, including examples like evolving seizure patterns and electrode drift. We acknowledge that with calibration data limited to short recordings, coverage guarantees could be affected. Although we were unable to conduct additional experiments on this due to the nature of the available EEG dataset (which lacks extensive longitudinal recordings), we have strengthened the discussion to highlight this as a limitation and suggest avenues for future research, such as adaptive conformal methods. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical evaluation of conformal methods

full rationale

The paper is an empirical evaluation of existing conformal prediction techniques on EEG seizure data. It reports observed coverage gains from personalized calibration without presenting any mathematical derivation, uniqueness theorem, or ansatz that reduces the claimed improvements to fitted parameters or self-citations by construction. Results are framed as experimental outcomes on a specific dataset, with no load-bearing self-referential steps in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard conformal-prediction coverage guarantee under exchangeability and on the empirical observation that patient-level distribution shift violates that guarantee. No new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Conformal prediction guarantees hold only under i.i.d. or exchangeable data
    Abstract explicitly states that patient distribution shifts violate the i.i.d. assumptions underlying standard conformal methods.

pith-pipeline@v0.9.0 · 5437 in / 1203 out tokens · 15792 ms · 2026-05-15T20:23:21.809355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Chatterjee, A., Razin, S.S., Wu, J., Laghuvarapu, S., Pradeepkumar, J., Sun, J.: Making conformal predictors robust in healthcare settings: a case study on eeg classification (2026), https://arxiv.org/abs/2602.19483

  2. [2]

    Journal of neuroscience methods351, 108966 (2021)

    Ge, W., Jing, J., An, S., Herlopian, A., Ng, M., Struck, A.F., Appavu, B., Johnson, E.L., Osman, G., Haider, H.A., et al.: Deep active learning for interictal ictal injury continuum eeg patterns. Journal of neuroscience methods351, 108966 (2021)

  3. [3]

    Proceedings of the AAAI Conference on Artificial Intelligence 37(6), 7722–7730 (Jun 2023)

    Ghosh, S., Belkhouja, T., Yan, Y., Doppa, J.R.: Improving uncertainty quantifica- tion of deep classifiers via neighborhood conformal prediction: Novel algorithm and theoretical analysis. Proceedings of the AAAI Conference on Artificial Intelligence 37(6), 7722–7730 (Jun 2023)

  4. [4]

    IEEE Signal Processing in Medicine and Biology Symposium2015(2015)

    Harati, A., Golmohammadi, M., Lopez, S., Obeid, I., Picone, J.: Improved eeg event classification using differential energy. IEEE Signal Processing in Medicine and Biology Symposium2015(2015)

  5. [5]

    Advances in Neural Information Processing Systems36, 37728–37747 (2023)

    Laghuvarapu, S., Lin, Z., Sun, J.: Codrug: Conformal drug property prediction with density estimation under covariate shift. Advances in Neural Information Processing Systems36, 37728–37747 (2023)

  6. [6]

    IEEE Signal Processing in Medicine and Biology Sympo- sium2015(2015)

    Lopez, S., Suarez, G., Jungreis, D., Obeid, I., Picone, J.: Automated identification of abnormal adult eegs. IEEE Signal Processing in Medicine and Biology Sympo- sium2015(2015)

  7. [7]

    Frontiers in neuroscience10, 196 (2016)

    Obeid, I., Picone, J.: The temple university hospital eeg data corpus. Frontiers in neuroscience10, 196 (2016)

  8. [8]

    In: 19th IEEE International Conference on Tools with Artificial Intelli- gence (ICTAI 2007)

    Papadopoulos, H., Vovk, V., Gammerman, A.: Conformal prediction with neural networks. In: 19th IEEE International Conference on Tools with Artificial Intelli- gence (ICTAI 2007). vol. 2, pp. 388–395. IEEE (2007)

  9. [9]

    Pradeepkumar, J., Piao, X., Chen, Z., Sun, J.: Tokenizing single-channel eeg with time-frequency motif learning (2026), https://arxiv.org/abs/2502.16060

  10. [10]

    Frontiers in Neuroinformatics12(2018)

    Shah, V., von Weltin, E., Lopez, S., McHugh, J.R., Veloso, L., Golmohammadi, M., Obeid, I., Picone, J.: The temple university hospital seizure detection corpus. Frontiers in Neuroinformatics12(2018)

  11. [11]

    In: Advances in Neural Information Processing Systems

    Tibshirani, R.J., Foygel Barber, R., Candes, E., Ramdas, A.: Conformal prediction under covariate shift. In: Advances in Neural Information Processing Systems. vol. 32 (2019)

  12. [12]

    In: Advances in Neural Information Processing Sys- tems

    Wu, Z., Yao, H., Liebovitz, D., Sun, J.: An iterative self-learning framework for medical domain generalization. In: Advances in Neural Information Processing Sys- tems. vol. 36, pp. 54833–54854 (2023)

  13. [13]

    In: The 11th International Conference on Learning Rep- resentations, ICLR 2023 (2023)

    Yang, C., Westover, M.B., Sun, J.: Manydg: Many-domain generalization for healthcare applications. In: The 11th International Conference on Learning Rep- resentations, ICLR 2023 (2023)

  14. [14]

    JMIR AI (2023)

    Yang, C., Xiao, D., Westover, M.B., Sun, J.: Self-supervised eeg representation learning for automatic sleep staging. JMIR AI (2023)