pith. sign in

arxiv: 2605.18008 · v1 · pith:RXNCRNWRnew · submitted 2026-05-18 · 💻 cs.LG · stat.ML

Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography

Pith reviewed 2026-05-20 12:39 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords uncertainty quantificationdomain shiftphotoplethysmographyblood pressure estimationdeep ensemblesMonte Carlo dropoutconformal predictionrecalibration
0
0 comments X

The pith

Deep ensembles with recalibrated Gaussian negative log-likelihood loss deliver stronger robustness and better calibrated uncertainty for PPG-based blood pressure estimates under domain shift than Monte Carlo dropout or MSE-based approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether common uncertainty quantification techniques remain reliable when deep learning models for blood pressure estimation from photoplethysmography signals encounter new data sources. It trains an XResNet1D-50 on PulseDB and evaluates performance plus uncertainty quality on four external datasets that introduce domain shifts. Deep ensembles prove more robust for predictions than Monte Carlo dropout once shifts are external, while methods built on Gaussian negative log-likelihood loss plus post-hoc recalibration produce the best uncertainty calibration. These patterns matter for safety-critical use because reliable uncertainty can indicate when a cuffless reading should be trusted or verified by other means. The work therefore stresses that both accuracy and calibration must be checked on external data before deployment.

Core claim

Deep ensembles provide stronger predictive robustness under domain shift than Monte Carlo dropout, with the advantage clearest under external shift. Recalibrated GNLL-based methods yield the best uncertainty calibration, for instance GNLL+DE+CP for systolic blood pressure and GNLL+DE+TS for diastolic blood pressure, whereas MSE-based uncertainty becomes practically useful only after recalibration. Across the tested settings, conformal prediction and temperature scaling deliver the most consistent gains.

What carries the argument

Deep ensembles (DE) versus Monte Carlo dropout (MCD), paired with Gaussian negative log-likelihood (GNLL) or mean squared error (MSE) training loss and followed by conformal prediction (CP), temperature scaling (TS), or isotonic regression (IR) recalibration, inside an XResNet1D-50 network for PPG-to-BP regression.

If this is right

  • DE-based methods are the most robust choice for predictive performance when models face domain shift.
  • GNLL supplies the strongest native uncertainty quantification before any recalibration.
  • Recalibration is required to make MSE-derived uncertainty estimates practically usable.
  • CP and TS produce the most consistent calibration improvements across both ID and OOD conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the observed DE advantage persists in streaming clinical data, safety monitors could preferentially route uncertain readings to additional sensors or clinician review.
  • The results suggest testing whether GNLL+DE combinations also improve calibration when the model must adapt online to new patients rather than static external datasets.
  • Future extensions could measure how much of the robustness gain comes from ensemble diversity versus the specific loss and recalibration choices.

Load-bearing premise

The four external datasets used for testing adequately capture the kinds of domain shifts that would occur in real clinical deployment of PPG-based BP estimation.

What would settle it

Apply the same DE, MCD, GNLL, MSE, and recalibration pipelines to PPG recordings from a new clinical population whose distribution differs from both PulseDB and the four external test sets, then check whether DE still outperforms MCD on predictive robustness and whether GNLL+DE+CP or GNLL+DE+TS remains the top-calibrated method.

Figures

Figures reproduced from arXiv: 2605.18008 by Ciaran Bench, Mohammad Moulaeifard, Nils Strodthoff, Philip J. Aston.

Figure 1
Figure 1. Figure 1: Comparison of SBP and DBP probability density distributions between ID training data (CalibFree-VitalDB) [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Bar-plot visualization of top-performing method frequencies across all slices. The left panel summarizes ID [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

Uncertainty quantification (UQ) is critical for safety-critical domains like healthcare, yet it is rarely evaluated under realistic out-of-distribution (OOD) conditions. Here, we assessed predictive performance and uncertainty reliability for deep learning-based blood pressure (BP) estimation from photoplethysmography (PPG) signals under both in-distribution (ID) and OOD settings. Using an XResNet1D-50 trained on PulseDB and tested on four external datasets, we compared deep ensembles (DE) and Monte Carlo dropout (MCD) with Gaussian negative log-likelihood (GNLL) and mean squared error (MSE) losses, optionally followed by post-hoc recalibration via conformal prediction (CP), temperature scaling (TS), and isotonic regression (IR). The key findings of our study are as follows: (1) DE provides stronger predictive robustness under domain shift than MCD, an advantage that becomes clear primarily under external shift. (2) Recalibrated GNLL-based methods yield the best uncertainty calibration (e.g., GNLL+DE+CP for systolic blood pressure (SBP), GNLL+DE+TS for diastolic blood pressure (DBP)), while MSE-based uncertainty requires recalibration to become practically useful. (3) Across settings, CP and TS offer the most consistent gains, with IR remaining competitive in several cases. Overall, our results identify DE-based methods as most robust for predictive performance under domain shift, GNLL as strongest for native UQ, and recalibration as essential for making MSE-based uncertainty practical. These findings highlight the need to jointly assess predictive accuracy and calibration on external data for trustworthy cuffless BP estimation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates uncertainty quantification methods for deep learning-based blood pressure estimation from PPG signals under domain shift. An XResNet1D-50 model is trained on PulseDB and tested on four external datasets, comparing deep ensembles (DE) versus Monte Carlo dropout (MCD), GNLL versus MSE losses, and post-hoc recalibration via conformal prediction (CP), temperature scaling (TS), and isotonic regression (IR). Central claims are that DE exhibits greater predictive robustness than MCD primarily under external shift, that recalibrated GNLL methods achieve the best uncertainty calibration, and that recalibration is required to make MSE-based uncertainty practically useful.

Significance. If the empirical results hold, the work provides a useful systematic comparison of UQ techniques in a safety-critical medical regression task, underscoring the value of external-dataset evaluation and recalibration for calibration under shift. The multi-dataset design and direct comparison of DE/MCD with native versus recalibrated uncertainty are strengths that could inform method selection for cuffless BP monitoring.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Results): The headline claim that DE provides stronger robustness under domain shift than MCD rests on the assumption that the four external test sets induce shifts representative of clinical deployment. No quantification of these shifts (MMD, covariate/concept shift statistics, device/demographic breakdowns) is reported, so it is unclear whether the observed DE advantage generalizes or is specific to the chosen datasets.
  2. [§4.2 and §5] §4.2 and §5 (Discussion): Performance differences between DE and MCD, and between GNLL and MSE, are presented without statistical significance tests, confidence intervals, or multiple-comparison corrections. This weakens the reliability of the ranking statements (e.g., “GNLL+DE+CP for SBP”).
  3. [§3] §3 (Methods): Hyperparameter choices for the XResNet1D-50, ensemble size, dropout rate, and the exact implementation of CP/TS/IR recalibration are not fully specified, limiting reproducibility of the reported calibration and robustness gains.
minor comments (2)
  1. [Figures] Figure captions and axis labels in the calibration plots could more explicitly state whether the x-axis is predicted uncertainty or predicted BP value.
  2. [§3] The manuscript would benefit from a short table summarizing the four external datasets (size, demographics, acquisition device) to aid interpretation of shift severity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the paper accordingly to improve clarity, statistical rigor, and reproducibility. We believe these changes strengthen the work without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Results): The headline claim that DE provides stronger robustness under domain shift than MCD rests on the assumption that the four external test sets induce shifts representative of clinical deployment. No quantification of these shifts (MMD, covariate/concept shift statistics, device/demographic breakdowns) is reported, so it is unclear whether the observed DE advantage generalizes or is specific to the chosen datasets.

    Authors: We appreciate this observation. Our external datasets were selected based on their established use in prior PPG-based BP literature to reflect real-world variations in recording devices, patient demographics, and acquisition protocols. However, we agree that explicit quantification would better substantiate the generalizability of the DE advantage. In the revised manuscript, we will add a new subsection in §4 reporting Maximum Mean Discrepancy (MMD) distances between PulseDB and each external set, along with available device and demographic breakdowns to characterize the shifts. revision: yes

  2. Referee: [§4.2 and §5] §4.2 and §5 (Discussion): Performance differences between DE and MCD, and between GNLL and MSE, are presented without statistical significance tests, confidence intervals, or multiple-comparison corrections. This weakens the reliability of the ranking statements (e.g., “GNLL+DE+CP for SBP”).

    Authors: We acknowledge the importance of statistical validation for comparative claims. In the revised version, we will augment §4.2 and §5 with bootstrap-derived 95% confidence intervals for all key metrics and apply paired non-parametric tests (Wilcoxon signed-rank) with Bonferroni correction for multiple comparisons to support the reported performance rankings and differences between methods. revision: yes

  3. Referee: [§3] §3 (Methods): Hyperparameter choices for the XResNet1D-50, ensemble size, dropout rate, and the exact implementation of CP/TS/IR recalibration are not fully specified, limiting reproducibility of the reported calibration and robustness gains.

    Authors: We thank the referee for highlighting this gap. We will substantially expand §3 in the revision to provide complete hyperparameter details for XResNet1D-50 (learning rate, optimizer, batch size, epochs, weight decay), ensemble size (5 members), dropout probability (0.1), and precise implementations of conformal prediction (including nonconformity score and coverage level), temperature scaling, and isotonic regression. We will also release the full codebase upon acceptance to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external held-out datasets

full rationale

The paper conducts an empirical investigation: an XResNet1D-50 is trained on PulseDB and evaluated for predictive performance and uncertainty calibration on four external datasets under ID and OOD conditions. Comparisons are made between DE and MCD, GNLL and MSE losses, and post-hoc recalibrations (CP, TS, IR). Key findings are stated as direct observations from these experiments (e.g., DE robustness under external shift, GNLL+recalibration best calibrated). No equations, derivations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the provided text. All claims reduce to reported metrics on independent test sets rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the empirical performance differences observed across the chosen datasets and methods rather than new theoretical axioms or invented entities.

axioms (1)
  • domain assumption The four external test datasets represent realistic and relevant domain shifts for PPG-based blood pressure estimation.
    This premise underpins the claim that observed differences reflect robustness under domain shift.

pith-pipeline@v0.9.0 · 5852 in / 1251 out tokens · 41009 ms · 2026-05-20T12:39:36.511547+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Inan, Lalit K

    Ramakrishna Mukkamala, Jin-Oh Hahn, Omer T. Inan, Lalit K. Mestha, Chang-Sei Kim, H. T. Toreyin, and Sathish Kyal. Toward ubiquitous blood pressure monitoring via pulse transit time: Theory and practice.IEEE Transactions on Biomedical Engineering, 62(8):1879–1901, 2015

  2. [2]

    Lovell, Derek Abbott, Yuanting Lim, and Rohan Ward

    Mohamed Elgendi, Robyn Fletcher, Yongbo Liang, Nicholas Howard, Nigel H. Lovell, Derek Abbott, Yuanting Lim, and Rohan Ward. Cuffless blood pressure estimation using only a smartphone.Physiological Measurement, 40(7):075005, 2019

  3. [3]

    Slapniˇcar, N

    G. Slapniˇcar, N. Mlakar, and M. Luštrek. Blood pressure estimation from photoplethysmogram using a spectro- temporal deep neural network.Sensors, 19(15):3420, 2019

  4. [4]

    Deep learning for cuffless blood pressure estimation from photoplethysmography.IEEE Transactions on Biomedical Engineering, 66(7):1919–1930, 2019

    Shengbo Wang, Yang Zhou, Yiting Zhang, and Zhihua Wang. Deep learning for cuffless blood pressure estimation from photoplethysmography.IEEE Transactions on Biomedical Engineering, 66(7):1919–1930, 2019

  5. [5]

    Aston, Peter H

    Mohammad Moulaeifard, Loic Coquelin, Mantas Rinkeviˇcius, Andrius Sološenko, Oskar Pfeffer, Ciaran Bench, Nando Hegemann, Sara Vardanega, Manasi Nandi, Jordi Alastruey, Christian Heiss, Vaidotas Marozas, Andrew Thompson, Philip J. Aston, Peter H. Charlton, and Nils Strodthoff. Machine-learning for photoplethysmography analysis: Benchmarking feature, image...

  6. [6]

    Cuffless blood pressure estimation from photoplethysmography using deep convolutional neural network and transfer learning.Biomedical Signal Processing and Control, 93:106194, 2024

    Hüseyin Murat Koparır and Özkan Arslan. Cuffless blood pressure estimation from photoplethysmography using deep convolutional neural network and transfer learning.Biomedical Signal Processing and Control, 93:106194, 2024

  7. [7]

    What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017

  8. [8]

    Confidence and trust in medical artificial intelligence.Nature Medicine, 27(7):1145–1153, 2021

    Benjamin Kompa, Gijs Snoeck, Niels Wouters, Ruben Coppens, Jan De Mey, Sabine Van Huffel, and Niels Lefeber. Confidence and trust in medical artificial intelligence.Nature Medicine, 27(7):1145–1153, 2021

  9. [9]

    Rajendra Acharya

    Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Davood Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, and U. Rajendra Acharya. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76:243–297, 2021

  10. [10]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning, pages 1050–1059, 2016

  11. [11]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30, 2017

  12. [12]

    Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks.Machine Learning: Health, 1(1):015013, 2025

    Ciaran Bench, Vivek Desai, Mohammad Moulaeifard, Nils Strodthoff, Philip Aston, and Andrew Thompson. Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks.Machine Learning: Health, 1(1):015013, 2025

  13. [13]

    A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis.Machine Learning: Health, 2(1):015011, 2026

    Ciaran Bench, Oskar Pfeffer, Vivek Desai, Mohammad Moulaeifard, Loic Coquelin, Peter H Charlton, Nils Strodthoff, Nando Hegemann, Philip J Aston, and Andrew Thompson. A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis.Machine Learning: Health, 2(1):015011, 2026

  14. [14]

    Generalizable deep learning for photoplethysmography-based blood pressure estimation—A benchmarking study.Machine Learning: Health, 1(1):010501, September 2025

    Mohammad Moulaeifard, Peter H Charlton, and Nils Strodthoff. Generalizable deep learning for photoplethysmography-based blood pressure estimation—A benchmarking study.Machine Learning: Health, 1(1):010501, September 2025

  15. [15]

    Underspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

    Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

  16. [16]

    Underspecification and uncertainty in deep learning models: Is there a connection?Neural Computing and Applications, 37(24):19579–19595, 2025

    Filipa MM Ramos Ferreira and Rosaldo JF Rossetti. Underspecification and uncertainty in deep learning models: Is there a connection?Neural Computing and Applications, 37(24):19579–19595, 2025

  17. [17]

    Videocad: an uncertainty-driven neural network for coronary artery disease screening from facial videos.IEEE Transactions on Instrumentation and Measurement, 72:1–12, 2022

    Xuenan Liu, Xuezhi Yang, Rencheng Song, Jie Zhang, and Longwei Li. Videocad: an uncertainty-driven neural network for coronary artery disease screening from facial videos.IEEE Transactions on Instrumentation and Measurement, 72:1–12, 2022

  18. [18]

    Improving ppg signal classification with machine learning: The power of a second opinion

    Hamzeh Asgharnezhad, Afshar Shamsi, Ivan Bakhshayeshi, Roohallah Alizadehsani, Somayyeh Chamaani, and Hamid Alinejad-Rokny. Improving ppg signal classification with machine learning: The power of a second opinion. In2023 24th International Conference on Digital Signal Processing (DSP), pages 1–5. IEEE, 2023. 13

  19. [19]

    Noncontact blood pressure estimation using bp-related cardiovascular knowledge: An uncalibrated method based on consumer-level camera

    Xuesong Han, Xuezhi Yang, Shuai Fang, Rencheng Song, Longwei Li, and Jie Zhang. Noncontact blood pressure estimation using bp-related cardiovascular knowledge: An uncalibrated method based on consumer-level camera. IEEE Transactions on Instrumentation and Measurement, 72:1–13, 2023

  20. [20]

    Quantifying deep neural network uncertainty for atrial fibrillation detection with limited labels.Scientific Reports, 12(1):20140, 2022

    Brian Chen, Golara Javadi, Alexander Hamilton, Stephanie Sibley, Philip Laird, Purang Abolmaesumi, David Maslove, and Parvin Mousavi. Quantifying deep neural network uncertainty for atrial fibrillation detection with limited labels.Scientific Reports, 12(1):20140, 2022

  21. [21]

    Sarkar Snigdha Sarathi Das, Subangkar Karmaker Shanto, Masum Rahman, Md Saiful Islam, Atif Hasan Rahman, Mohammad M Masud, and Mohammed Eunus Ali. Bayesbeat: Reliable atrial fibrillation detection from noisy photoplethysmography data.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–21, 2022

  22. [22]

    Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, volume 32, 2019

  23. [23]

    Dusenberry, Sebastian Farquhar, and Jasper Snoek

    Zachary Nado, Neal Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, and Jasper Snoek. Evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, volume 34, 2021

  24. [24]

    Tassi, Mohsin Ali, Jongseong Lee, Matthias Humt, Jian-Jiang Feng, Anna Kruspe, Peter Jung, and Ribana Roscher

    Jakob Gawlikowski, Cedric R. Tassi, Mohsin Ali, Jongseong Lee, Matthias Humt, Jian-Jiang Feng, Anna Kruspe, Peter Jung, and Ribana Roscher. A survey of uncertainty in deep neural networks.Artificial Intelligence Review, 54(5):3361–3415, 2021

  25. [25]

    Pulsedb: A large, cleaned dataset based on mimic-iii and vitaldb for benchmarking cuff-less blood pressure estimation methods.Frontiers in Digital Health, 4:1090854, 2023

    Weinan Wang, Pedram Mohseni, Kevin L Kilgore, and Laleh Najafizadeh. Pulsedb: A large, cleaned dataset based on mimic-iii and vitaldb for benchmarking cuff-less blood pressure estimation methods.Frontiers in Digital Health, 4:1090854, 2023

  26. [26]

    A benchmark for machine-learning based non- invasive blood pressure estimation using photoplethysmogram.Scientific Data, 10(1):149, 2023

    Sergio González, Wan-Ting Hsieh, and Trista Pei-Chun Chen. A benchmark for machine-learning based non- invasive blood pressure estimation using photoplethysmogram.Scientific Data, 10(1):149, 2023

  27. [27]

    Mimic-iii waveform database matched subset.MIMIC-III Waveform Database Matched Subset v1

    Benjamin Moody, George Moody, Mauricio Villarroel, G Clifford, and Ikaro Silva. Mimic-iii waveform database matched subset.MIMIC-III Waveform Database Matched Subset v1. 0, 2020

  28. [28]

    Vitaldb, a high-fidelity multi-parameter vital signs database in surgical patients.Scientific Data, 9(1):279, 2022

    Hyung-Chul Lee, Yoonsang Park, Soo Bin Yoon, Seong Mi Yang, Dongnyeok Park, and Chul-Woo Jung. Vitaldb, a high-fidelity multi-parameter vital signs database in surgical patients.Scientific Data, 9(1):279, 2022

  29. [29]

    Blood pressure mor- phology assessment from photoplethysmogram and demographic information using deep learning with attention mechanism.Sensors, 21(6):2167, 2021

    Nicolas Aguirre, Edith Grall-Maës, Leandro J Cymberknop, and Ricardo L Armentano. Blood pressure mor- phology assessment from photoplethysmogram and demographic information using deep learning with attention mechanism.Sensors, 21(6):2167, 2021

  30. [30]

    Cuff-less high- accuracy calibration-free blood pressure estimation using pulse transit time

    Mohamad Kachuee, Mohammad Mahdi Kiani, Hoda Mohammadzade, and Mahdi Shabany. Cuff-less high- accuracy calibration-free blood pressure estimation using pulse transit time. In2015 IEEE international symposium on circuits and systems (ISCAS), pages 1006–1009. IEEE, 2015

  31. [31]

    Bed-based ballistocardiography: Dataset and ability to track cardiovascular parameters.Sensors, 21(1):156, 2020

    Charles Carlson, Vanessa-Rose Turpin, Ahmad Suliman, Carl Ade, Steve Warren, and David E Thompson. Bed-based ballistocardiography: Dataset and ability to track cardiovascular parameters.Sensors, 21(1):156, 2020

  32. [32]

    A new, short-recorded photoplethysmo- gram dataset for blood pressure monitoring in china.Scientific data, 5(1):180020, 2018

    Yongbo Liang, Zhencheng Chen, Guiyong Liu, and Mohamed Elgendi. A new, short-recorded photoplethysmo- gram dataset for blood pressure monitoring in china.Scientific data, 5(1):180020, 2018

  33. [33]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

  34. [34]

    Order restricted statistical inference.(No Title), 1988

    Tim Robertson, Richard Dykstra, and FT Wright. Order restricted statistical inference.(No Title), 1988

  35. [35]

    Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

    Anastasios N Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

  36. [36]

    Springer, 2005

    Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

  37. [37]

    Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

  38. [38]

    Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE journal of biomedical and health informatics, 25(5):1519–1528, 2020

    Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE journal of biomedical and health informatics, 25(5):1519–1528, 2020

  39. [39]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 14

  40. [40]

    Fastai: A layered api for deep learning.Information, 11(2):108, 2020

    Jeremy Howard and Sylvain Gugger. Fastai: A layered api for deep learning.Information, 11(2):108, 2020

  41. [41]

    How to use dropout correctly on residual networks with batch normalization.arXiv preprint arXiv:2302.06112, 2023

    Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Donggeon Lee, and Sang Woo Kim. How to use dropout correctly on residual networks with batch normalization.arXiv preprint arXiv:2302.06112, 2023

  42. [42]

    Fast and robust earth mover’s distances

    Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In2009 IEEE 12th international conference on computer vision, pages 460–467. IEEE, 2009

  43. [43]

    Addressing distribution shift for robust and trustworthy prediction and causal inference in clinical ai settings.JAMA Network Open, 8(6):e2513705, 2025

    Larry Han. Addressing distribution shift for robust and trustworthy prediction and causal inference in clinical ai settings.JAMA Network Open, 8(6):e2513705, 2025

  44. [44]

    Distribution shift detection for the postmarket surveillance of medical ai algorithms: a retrospective simulation study.NPJ Digital Medicine, 7(1):120, 2024

    Lisa M Koch, Christian F Baumgartner, and Philipp Berens. Distribution shift detection for the postmarket surveillance of medical ai algorithms: a retrospective simulation study.NPJ Digital Medicine, 7(1):120, 2024

  45. [45]

    Trustworthy clinical ai solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis

    Benjamin Lambert, Florence Forbes, Senan Doyle, Harmonie Dehaene, and Michel Dojat. Trustworthy clinical ai solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis. Artificial Intelligence in Medicine, 150:102830, 2024

  46. [46]

    A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

    Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

  47. [47]

    Accurate uncertainties for deep learning using calibrated regression

    V olodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. InProceedings of the 35th International Conference on Machine Learning, pages 2796–2804, 2018

  48. [48]

    Evaluating and calibrating uncertainty prediction in regression tasks.Sensors, 22(15):5540, 2022

    Dan Levi, Liran Gispan, Niv Giladi, and Ethan Fetaya. Evaluating and calibrating uncertainty prediction in regression tasks.Sensors, 22(15):5540, 2022

  49. [49]

    Cuffless blood pressure measurement de- vices—international perspectives on accuracy and clinical use: a narrative review.JAMA Cardiol, 10(6):624–631, 2025

    Eugene Yang, Aletta E Schutte, George Stergiou, Fernando Stuardo Wyss, Yvonne Commodore-Mensah, Au- gustine Odili, Ian Kronish, Hae-Young Lee, and Daichi Shimbo. Cuffless blood pressure measurement de- vices—international perspectives on accuracy and clinical use: a narrative review.JAMA Cardiol, 10(6):624–631, 2025

  50. [50]

    MIMIC-III-Ext-PPG: A PPG Benchmark Dataset for Cardiorespiratory Analysis.PhysioNet, March 2026

    Mohammad Moulaeifard, Peter H Charlton, and Nils Strodthoff. MIMIC-III-Ext-PPG: A PPG Benchmark Dataset for Cardiorespiratory Analysis.PhysioNet, March 2026. Version 1.1.0

  51. [51]

    Mimic-iii-ext- ppg, a ppg-based benchmark dataset for cardiovascular and respiratory signal analysis.Scientific Data, 13(1):668, 2026

    Mohammad Moulaeifard, Marie Kutscher, Philip J Aston, Peter H Charlton, and Nils Strodthoff. Mimic-iii-ext- ppg, a ppg-based benchmark dataset for cardiovascular and respiratory signal analysis.Scientific Data, 13(1):668, 2026

  52. [52]

    Deriving health metrics from the photoplethysmogram: Benchmarks and insights from mimic-iii-ext-ppg.arXiv preprint arXiv:2603.21832, 2026

    Mohammad Moulaeifard, Philip J Aston, Peter H Charlton, and Nils Strodthoff. Deriving health metrics from the photoplethysmogram: Benchmarks and insights from mimic-iii-ext-ppg.arXiv preprint arXiv:2603.21832, 2026. 15 Appendix This appendix provides the detailed numerical results that complement the main findings presented in the main text. Specifically,...