Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography

Ciaran Bench; Mohammad Moulaeifard; Nils Strodthoff; Philip J. Aston

arxiv: 2605.18008 · v1 · pith:RXNCRNWRnew · submitted 2026-05-18 · 💻 cs.LG · stat.ML

Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography

Mohammad Moulaeifard , Ciaran Bench , Philip J. Aston , Nils Strodthoff This is my paper

Pith reviewed 2026-05-20 12:39 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords uncertainty quantificationdomain shiftphotoplethysmographyblood pressure estimationdeep ensemblesMonte Carlo dropoutconformal predictionrecalibration

0 comments

The pith

Deep ensembles with recalibrated Gaussian negative log-likelihood loss deliver stronger robustness and better calibrated uncertainty for PPG-based blood pressure estimates under domain shift than Monte Carlo dropout or MSE-based approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether common uncertainty quantification techniques remain reliable when deep learning models for blood pressure estimation from photoplethysmography signals encounter new data sources. It trains an XResNet1D-50 on PulseDB and evaluates performance plus uncertainty quality on four external datasets that introduce domain shifts. Deep ensembles prove more robust for predictions than Monte Carlo dropout once shifts are external, while methods built on Gaussian negative log-likelihood loss plus post-hoc recalibration produce the best uncertainty calibration. These patterns matter for safety-critical use because reliable uncertainty can indicate when a cuffless reading should be trusted or verified by other means. The work therefore stresses that both accuracy and calibration must be checked on external data before deployment.

Core claim

Deep ensembles provide stronger predictive robustness under domain shift than Monte Carlo dropout, with the advantage clearest under external shift. Recalibrated GNLL-based methods yield the best uncertainty calibration, for instance GNLL+DE+CP for systolic blood pressure and GNLL+DE+TS for diastolic blood pressure, whereas MSE-based uncertainty becomes practically useful only after recalibration. Across the tested settings, conformal prediction and temperature scaling deliver the most consistent gains.

What carries the argument

Deep ensembles (DE) versus Monte Carlo dropout (MCD), paired with Gaussian negative log-likelihood (GNLL) or mean squared error (MSE) training loss and followed by conformal prediction (CP), temperature scaling (TS), or isotonic regression (IR) recalibration, inside an XResNet1D-50 network for PPG-to-BP regression.

If this is right

DE-based methods are the most robust choice for predictive performance when models face domain shift.
GNLL supplies the strongest native uncertainty quantification before any recalibration.
Recalibration is required to make MSE-derived uncertainty estimates practically usable.
CP and TS produce the most consistent calibration improvements across both ID and OOD conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the observed DE advantage persists in streaming clinical data, safety monitors could preferentially route uncertain readings to additional sensors or clinician review.
The results suggest testing whether GNLL+DE combinations also improve calibration when the model must adapt online to new patients rather than static external datasets.
Future extensions could measure how much of the robustness gain comes from ensemble diversity versus the specific loss and recalibration choices.

Load-bearing premise

The four external datasets used for testing adequately capture the kinds of domain shifts that would occur in real clinical deployment of PPG-based BP estimation.

What would settle it

Apply the same DE, MCD, GNLL, MSE, and recalibration pipelines to PPG recordings from a new clinical population whose distribution differs from both PulseDB and the four external test sets, then check whether DE still outperforms MCD on predictive robustness and whether GNLL+DE+CP or GNLL+DE+TS remains the top-calibrated method.

Figures

Figures reproduced from arXiv: 2605.18008 by Ciaran Bench, Mohammad Moulaeifard, Nils Strodthoff, Philip J. Aston.

**Figure 1.** Figure 1: Comparison of SBP and DBP probability density distributions between ID training data (CalibFree-VitalDB) [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Bar-plot visualization of top-performing method frequencies across all slices. The left panel summarizes ID [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Uncertainty quantification (UQ) is critical for safety-critical domains like healthcare, yet it is rarely evaluated under realistic out-of-distribution (OOD) conditions. Here, we assessed predictive performance and uncertainty reliability for deep learning-based blood pressure (BP) estimation from photoplethysmography (PPG) signals under both in-distribution (ID) and OOD settings. Using an XResNet1D-50 trained on PulseDB and tested on four external datasets, we compared deep ensembles (DE) and Monte Carlo dropout (MCD) with Gaussian negative log-likelihood (GNLL) and mean squared error (MSE) losses, optionally followed by post-hoc recalibration via conformal prediction (CP), temperature scaling (TS), and isotonic regression (IR). The key findings of our study are as follows: (1) DE provides stronger predictive robustness under domain shift than MCD, an advantage that becomes clear primarily under external shift. (2) Recalibrated GNLL-based methods yield the best uncertainty calibration (e.g., GNLL+DE+CP for systolic blood pressure (SBP), GNLL+DE+TS for diastolic blood pressure (DBP)), while MSE-based uncertainty requires recalibration to become practically useful. (3) Across settings, CP and TS offer the most consistent gains, with IR remaining competitive in several cases. Overall, our results identify DE-based methods as most robust for predictive performance under domain shift, GNLL as strongest for native UQ, and recalibration as essential for making MSE-based uncertainty practical. These findings highlight the need to jointly assess predictive accuracy and calibration on external data for trustworthy cuffless BP estimation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Deep ensembles hold up better than MC dropout for PPG BP under external shift, and GNLL with recalibration calibrates best, but the test sets' representativeness of real clinical shifts is not quantified.

read the letter

The main things to know are that deep ensembles deliver more robust predictions than Monte Carlo dropout once the data moves to external PPG datasets, and that Gaussian negative log-likelihood with post-hoc recalibration produces the most reliable uncertainty estimates in this setting. The paper trains an XResNet1D-50 on PulseDB and evaluates on four held-out external collections, comparing DE and MCD under both ID and OOD conditions while testing GNLL versus MSE losses plus conformal prediction, temperature scaling, and isotonic regression. The reported pattern—that the ensemble advantage shows up mainly under external shift and that recalibrated GNLL needs the least adjustment—lines up with the experimental choices and gives a clear empirical picture for this medical regression task. The systematic side-by-side on multiple datasets is the useful part; it supplies the kind of practical check that matters when people actually want to deploy cuffless monitors. The soft spots sit in the characterization of the shifts themselves. The abstract and key findings do not include any distance metrics, device breakdowns, or demographic splits, so it remains possible that the DE edge is tied to the particular external sets rather than a general property of ensembles versus dropout. Missing statistical significance tests and fuller hyperparameter details also make the size of the differences harder to judge. This is a solid empirical extension rather than a new method, so it will mainly interest readers already working on uncertainty for physiological signals or on cuffless BP specifically. The setup is grounded enough and the question relevant enough that it should go to peer review, with the expectation that the authors add shift quantification and tighten the stats.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates uncertainty quantification methods for deep learning-based blood pressure estimation from PPG signals under domain shift. An XResNet1D-50 model is trained on PulseDB and tested on four external datasets, comparing deep ensembles (DE) versus Monte Carlo dropout (MCD), GNLL versus MSE losses, and post-hoc recalibration via conformal prediction (CP), temperature scaling (TS), and isotonic regression (IR). Central claims are that DE exhibits greater predictive robustness than MCD primarily under external shift, that recalibrated GNLL methods achieve the best uncertainty calibration, and that recalibration is required to make MSE-based uncertainty practically useful.

Significance. If the empirical results hold, the work provides a useful systematic comparison of UQ techniques in a safety-critical medical regression task, underscoring the value of external-dataset evaluation and recalibration for calibration under shift. The multi-dataset design and direct comparison of DE/MCD with native versus recalibrated uncertainty are strengths that could inform method selection for cuffless BP monitoring.

major comments (3)

[Abstract and §4] Abstract and §4 (Results): The headline claim that DE provides stronger robustness under domain shift than MCD rests on the assumption that the four external test sets induce shifts representative of clinical deployment. No quantification of these shifts (MMD, covariate/concept shift statistics, device/demographic breakdowns) is reported, so it is unclear whether the observed DE advantage generalizes or is specific to the chosen datasets.
[§4.2 and §5] §4.2 and §5 (Discussion): Performance differences between DE and MCD, and between GNLL and MSE, are presented without statistical significance tests, confidence intervals, or multiple-comparison corrections. This weakens the reliability of the ranking statements (e.g., “GNLL+DE+CP for SBP”).
[§3] §3 (Methods): Hyperparameter choices for the XResNet1D-50, ensemble size, dropout rate, and the exact implementation of CP/TS/IR recalibration are not fully specified, limiting reproducibility of the reported calibration and robustness gains.

minor comments (2)

[Figures] Figure captions and axis labels in the calibration plots could more explicitly state whether the x-axis is predicted uncertainty or predicted BP value.
[§3] The manuscript would benefit from a short table summarizing the four external datasets (size, demographics, acquisition device) to aid interpretation of shift severity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the paper accordingly to improve clarity, statistical rigor, and reproducibility. We believe these changes strengthen the work without altering its core contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): The headline claim that DE provides stronger robustness under domain shift than MCD rests on the assumption that the four external test sets induce shifts representative of clinical deployment. No quantification of these shifts (MMD, covariate/concept shift statistics, device/demographic breakdowns) is reported, so it is unclear whether the observed DE advantage generalizes or is specific to the chosen datasets.

Authors: We appreciate this observation. Our external datasets were selected based on their established use in prior PPG-based BP literature to reflect real-world variations in recording devices, patient demographics, and acquisition protocols. However, we agree that explicit quantification would better substantiate the generalizability of the DE advantage. In the revised manuscript, we will add a new subsection in §4 reporting Maximum Mean Discrepancy (MMD) distances between PulseDB and each external set, along with available device and demographic breakdowns to characterize the shifts. revision: yes
Referee: [§4.2 and §5] §4.2 and §5 (Discussion): Performance differences between DE and MCD, and between GNLL and MSE, are presented without statistical significance tests, confidence intervals, or multiple-comparison corrections. This weakens the reliability of the ranking statements (e.g., “GNLL+DE+CP for SBP”).

Authors: We acknowledge the importance of statistical validation for comparative claims. In the revised version, we will augment §4.2 and §5 with bootstrap-derived 95% confidence intervals for all key metrics and apply paired non-parametric tests (Wilcoxon signed-rank) with Bonferroni correction for multiple comparisons to support the reported performance rankings and differences between methods. revision: yes
Referee: [§3] §3 (Methods): Hyperparameter choices for the XResNet1D-50, ensemble size, dropout rate, and the exact implementation of CP/TS/IR recalibration are not fully specified, limiting reproducibility of the reported calibration and robustness gains.

Authors: We thank the referee for highlighting this gap. We will substantially expand §3 in the revision to provide complete hyperparameter details for XResNet1D-50 (learning rate, optimizer, batch size, epochs, weight decay), ensemble size (5 members), dropout probability (0.1), and precise implementations of conformal prediction (including nonconformity score and coverage level), temperature scaling, and isotonic regression. We will also release the full codebase upon acceptance to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external held-out datasets

full rationale

The paper conducts an empirical investigation: an XResNet1D-50 is trained on PulseDB and evaluated for predictive performance and uncertainty calibration on four external datasets under ID and OOD conditions. Comparisons are made between DE and MCD, GNLL and MSE losses, and post-hoc recalibrations (CP, TS, IR). Key findings are stated as direct observations from these experiments (e.g., DE robustness under external shift, GNLL+recalibration best calibrated). No equations, derivations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the provided text. All claims reduce to reported metrics on independent test sets rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the empirical performance differences observed across the chosen datasets and methods rather than new theoretical axioms or invented entities.

axioms (1)

domain assumption The four external test datasets represent realistic and relevant domain shifts for PPG-based blood pressure estimation.
This premise underpins the claim that observed differences reflect robustness under domain shift.

pith-pipeline@v0.9.0 · 5852 in / 1251 out tokens · 41009 ms · 2026-05-20T12:39:36.511547+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using an XResNet1D-50 trained on PulseDB and tested on four external datasets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Inan, Lalit K

Ramakrishna Mukkamala, Jin-Oh Hahn, Omer T. Inan, Lalit K. Mestha, Chang-Sei Kim, H. T. Toreyin, and Sathish Kyal. Toward ubiquitous blood pressure monitoring via pulse transit time: Theory and practice.IEEE Transactions on Biomedical Engineering, 62(8):1879–1901, 2015

work page 1901
[2]

Lovell, Derek Abbott, Yuanting Lim, and Rohan Ward

Mohamed Elgendi, Robyn Fletcher, Yongbo Liang, Nicholas Howard, Nigel H. Lovell, Derek Abbott, Yuanting Lim, and Rohan Ward. Cuffless blood pressure estimation using only a smartphone.Physiological Measurement, 40(7):075005, 2019

work page 2019
[3]

Slapniˇcar, N

G. Slapniˇcar, N. Mlakar, and M. Luštrek. Blood pressure estimation from photoplethysmogram using a spectro- temporal deep neural network.Sensors, 19(15):3420, 2019

work page 2019
[4]

Deep learning for cuffless blood pressure estimation from photoplethysmography.IEEE Transactions on Biomedical Engineering, 66(7):1919–1930, 2019

Shengbo Wang, Yang Zhou, Yiting Zhang, and Zhihua Wang. Deep learning for cuffless blood pressure estimation from photoplethysmography.IEEE Transactions on Biomedical Engineering, 66(7):1919–1930, 2019

work page 1919
[5]

Aston, Peter H

Mohammad Moulaeifard, Loic Coquelin, Mantas Rinkeviˇcius, Andrius Sološenko, Oskar Pfeffer, Ciaran Bench, Nando Hegemann, Sara Vardanega, Manasi Nandi, Jordi Alastruey, Christian Heiss, Vaidotas Marozas, Andrew Thompson, Philip J. Aston, Peter H. Charlton, and Nils Strodthoff. Machine-learning for photoplethysmography analysis: Benchmarking feature, image...

work page 2026
[6]

Cuffless blood pressure estimation from photoplethysmography using deep convolutional neural network and transfer learning.Biomedical Signal Processing and Control, 93:106194, 2024

Hüseyin Murat Koparır and Özkan Arslan. Cuffless blood pressure estimation from photoplethysmography using deep convolutional neural network and transfer learning.Biomedical Signal Processing and Control, 93:106194, 2024

work page 2024
[7]

What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[8]

Confidence and trust in medical artificial intelligence.Nature Medicine, 27(7):1145–1153, 2021

Benjamin Kompa, Gijs Snoeck, Niels Wouters, Ruben Coppens, Jan De Mey, Sabine Van Huffel, and Niels Lefeber. Confidence and trust in medical artificial intelligence.Nature Medicine, 27(7):1145–1153, 2021

work page 2021
[9]

Rajendra Acharya

Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Davood Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, and U. Rajendra Acharya. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76:243–297, 2021

work page 2021
[10]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning, pages 1050–1059, 2016

work page 2016
[11]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[12]

Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks.Machine Learning: Health, 1(1):015013, 2025

Ciaran Bench, Vivek Desai, Mohammad Moulaeifard, Nils Strodthoff, Philip Aston, and Andrew Thompson. Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks.Machine Learning: Health, 1(1):015013, 2025

work page 2025
[13]

A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis.Machine Learning: Health, 2(1):015011, 2026

Ciaran Bench, Oskar Pfeffer, Vivek Desai, Mohammad Moulaeifard, Loic Coquelin, Peter H Charlton, Nils Strodthoff, Nando Hegemann, Philip J Aston, and Andrew Thompson. A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis.Machine Learning: Health, 2(1):015011, 2026

work page 2026
[14]

Generalizable deep learning for photoplethysmography-based blood pressure estimation—A benchmarking study.Machine Learning: Health, 1(1):010501, September 2025

Mohammad Moulaeifard, Peter H Charlton, and Nils Strodthoff. Generalizable deep learning for photoplethysmography-based blood pressure estimation—A benchmarking study.Machine Learning: Health, 1(1):010501, September 2025

work page 2025
[15]

Underspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

work page 2022
[16]

Underspecification and uncertainty in deep learning models: Is there a connection?Neural Computing and Applications, 37(24):19579–19595, 2025

Filipa MM Ramos Ferreira and Rosaldo JF Rossetti. Underspecification and uncertainty in deep learning models: Is there a connection?Neural Computing and Applications, 37(24):19579–19595, 2025

work page 2025
[17]

Videocad: an uncertainty-driven neural network for coronary artery disease screening from facial videos.IEEE Transactions on Instrumentation and Measurement, 72:1–12, 2022

Xuenan Liu, Xuezhi Yang, Rencheng Song, Jie Zhang, and Longwei Li. Videocad: an uncertainty-driven neural network for coronary artery disease screening from facial videos.IEEE Transactions on Instrumentation and Measurement, 72:1–12, 2022

work page 2022
[18]

Improving ppg signal classification with machine learning: The power of a second opinion

Hamzeh Asgharnezhad, Afshar Shamsi, Ivan Bakhshayeshi, Roohallah Alizadehsani, Somayyeh Chamaani, and Hamid Alinejad-Rokny. Improving ppg signal classification with machine learning: The power of a second opinion. In2023 24th International Conference on Digital Signal Processing (DSP), pages 1–5. IEEE, 2023. 13

work page 2023
[19]

Noncontact blood pressure estimation using bp-related cardiovascular knowledge: An uncalibrated method based on consumer-level camera

Xuesong Han, Xuezhi Yang, Shuai Fang, Rencheng Song, Longwei Li, and Jie Zhang. Noncontact blood pressure estimation using bp-related cardiovascular knowledge: An uncalibrated method based on consumer-level camera. IEEE Transactions on Instrumentation and Measurement, 72:1–13, 2023

work page 2023
[20]

Quantifying deep neural network uncertainty for atrial fibrillation detection with limited labels.Scientific Reports, 12(1):20140, 2022

Brian Chen, Golara Javadi, Alexander Hamilton, Stephanie Sibley, Philip Laird, Purang Abolmaesumi, David Maslove, and Parvin Mousavi. Quantifying deep neural network uncertainty for atrial fibrillation detection with limited labels.Scientific Reports, 12(1):20140, 2022

work page 2022
[21]

Sarkar Snigdha Sarathi Das, Subangkar Karmaker Shanto, Masum Rahman, Md Saiful Islam, Atif Hasan Rahman, Mohammad M Masud, and Mohammed Eunus Ali. Bayesbeat: Reliable atrial fibrillation detection from noisy photoplethysmography data.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–21, 2022

work page 2022
[22]

Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[23]

Dusenberry, Sebastian Farquhar, and Jasper Snoek

Zachary Nado, Neal Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, and Jasper Snoek. Evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, volume 34, 2021

work page 2021
[24]

Tassi, Mohsin Ali, Jongseong Lee, Matthias Humt, Jian-Jiang Feng, Anna Kruspe, Peter Jung, and Ribana Roscher

Jakob Gawlikowski, Cedric R. Tassi, Mohsin Ali, Jongseong Lee, Matthias Humt, Jian-Jiang Feng, Anna Kruspe, Peter Jung, and Ribana Roscher. A survey of uncertainty in deep neural networks.Artificial Intelligence Review, 54(5):3361–3415, 2021

work page 2021
[25]

Pulsedb: A large, cleaned dataset based on mimic-iii and vitaldb for benchmarking cuff-less blood pressure estimation methods.Frontiers in Digital Health, 4:1090854, 2023

Weinan Wang, Pedram Mohseni, Kevin L Kilgore, and Laleh Najafizadeh. Pulsedb: A large, cleaned dataset based on mimic-iii and vitaldb for benchmarking cuff-less blood pressure estimation methods.Frontiers in Digital Health, 4:1090854, 2023

work page 2023
[26]

A benchmark for machine-learning based non- invasive blood pressure estimation using photoplethysmogram.Scientific Data, 10(1):149, 2023

Sergio González, Wan-Ting Hsieh, and Trista Pei-Chun Chen. A benchmark for machine-learning based non- invasive blood pressure estimation using photoplethysmogram.Scientific Data, 10(1):149, 2023

work page 2023
[27]

Mimic-iii waveform database matched subset.MIMIC-III Waveform Database Matched Subset v1

Benjamin Moody, George Moody, Mauricio Villarroel, G Clifford, and Ikaro Silva. Mimic-iii waveform database matched subset.MIMIC-III Waveform Database Matched Subset v1. 0, 2020

work page 2020
[28]

Vitaldb, a high-fidelity multi-parameter vital signs database in surgical patients.Scientific Data, 9(1):279, 2022

Hyung-Chul Lee, Yoonsang Park, Soo Bin Yoon, Seong Mi Yang, Dongnyeok Park, and Chul-Woo Jung. Vitaldb, a high-fidelity multi-parameter vital signs database in surgical patients.Scientific Data, 9(1):279, 2022

work page 2022
[29]

Blood pressure mor- phology assessment from photoplethysmogram and demographic information using deep learning with attention mechanism.Sensors, 21(6):2167, 2021

Nicolas Aguirre, Edith Grall-Maës, Leandro J Cymberknop, and Ricardo L Armentano. Blood pressure mor- phology assessment from photoplethysmogram and demographic information using deep learning with attention mechanism.Sensors, 21(6):2167, 2021

work page 2021
[30]

Cuff-less high- accuracy calibration-free blood pressure estimation using pulse transit time

Mohamad Kachuee, Mohammad Mahdi Kiani, Hoda Mohammadzade, and Mahdi Shabany. Cuff-less high- accuracy calibration-free blood pressure estimation using pulse transit time. In2015 IEEE international symposium on circuits and systems (ISCAS), pages 1006–1009. IEEE, 2015

work page 2015
[31]

Bed-based ballistocardiography: Dataset and ability to track cardiovascular parameters.Sensors, 21(1):156, 2020

Charles Carlson, Vanessa-Rose Turpin, Ahmad Suliman, Carl Ade, Steve Warren, and David E Thompson. Bed-based ballistocardiography: Dataset and ability to track cardiovascular parameters.Sensors, 21(1):156, 2020

work page 2020
[32]

A new, short-recorded photoplethysmo- gram dataset for blood pressure monitoring in china.Scientific data, 5(1):180020, 2018

Yongbo Liang, Zhencheng Chen, Guiyong Liu, and Mohamed Elgendi. A new, short-recorded photoplethysmo- gram dataset for blood pressure monitoring in china.Scientific data, 5(1):180020, 2018

work page 2018
[33]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

work page 2017
[34]

Order restricted statistical inference.(No Title), 1988

Tim Robertson, Richard Dykstra, and FT Wright. Order restricted statistical inference.(No Title), 1988

work page 1988
[35]

Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

Anastasios N Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

work page 2023
[36]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

work page 2005
[37]

Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

work page 2007
[38]

Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE journal of biomedical and health informatics, 25(5):1519–1528, 2020

Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE journal of biomedical and health informatics, 25(5):1519–1528, 2020

work page 2020
[39]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 14

work page 2016
[40]

Fastai: A layered api for deep learning.Information, 11(2):108, 2020

Jeremy Howard and Sylvain Gugger. Fastai: A layered api for deep learning.Information, 11(2):108, 2020

work page 2020
[41]

How to use dropout correctly on residual networks with batch normalization.arXiv preprint arXiv:2302.06112, 2023

Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Donggeon Lee, and Sang Woo Kim. How to use dropout correctly on residual networks with batch normalization.arXiv preprint arXiv:2302.06112, 2023

work page arXiv 2023
[42]

Fast and robust earth mover’s distances

Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In2009 IEEE 12th international conference on computer vision, pages 460–467. IEEE, 2009

work page 2009
[43]

Addressing distribution shift for robust and trustworthy prediction and causal inference in clinical ai settings.JAMA Network Open, 8(6):e2513705, 2025

Larry Han. Addressing distribution shift for robust and trustworthy prediction and causal inference in clinical ai settings.JAMA Network Open, 8(6):e2513705, 2025

work page 2025
[44]

Distribution shift detection for the postmarket surveillance of medical ai algorithms: a retrospective simulation study.NPJ Digital Medicine, 7(1):120, 2024

Lisa M Koch, Christian F Baumgartner, and Philipp Berens. Distribution shift detection for the postmarket surveillance of medical ai algorithms: a retrospective simulation study.NPJ Digital Medicine, 7(1):120, 2024

work page 2024
[45]

Trustworthy clinical ai solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis

Benjamin Lambert, Florence Forbes, Senan Doyle, Harmonie Dehaene, and Michel Dojat. Trustworthy clinical ai solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis. Artificial Intelligence in Medicine, 150:102830, 2024

work page 2024
[46]

A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

work page 2008
[47]

Accurate uncertainties for deep learning using calibrated regression

V olodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. InProceedings of the 35th International Conference on Machine Learning, pages 2796–2804, 2018

work page 2018
[48]

Evaluating and calibrating uncertainty prediction in regression tasks.Sensors, 22(15):5540, 2022

Dan Levi, Liran Gispan, Niv Giladi, and Ethan Fetaya. Evaluating and calibrating uncertainty prediction in regression tasks.Sensors, 22(15):5540, 2022

work page 2022
[49]

Cuffless blood pressure measurement de- vices—international perspectives on accuracy and clinical use: a narrative review.JAMA Cardiol, 10(6):624–631, 2025

Eugene Yang, Aletta E Schutte, George Stergiou, Fernando Stuardo Wyss, Yvonne Commodore-Mensah, Au- gustine Odili, Ian Kronish, Hae-Young Lee, and Daichi Shimbo. Cuffless blood pressure measurement de- vices—international perspectives on accuracy and clinical use: a narrative review.JAMA Cardiol, 10(6):624–631, 2025

work page 2025
[50]

MIMIC-III-Ext-PPG: A PPG Benchmark Dataset for Cardiorespiratory Analysis.PhysioNet, March 2026

Mohammad Moulaeifard, Peter H Charlton, and Nils Strodthoff. MIMIC-III-Ext-PPG: A PPG Benchmark Dataset for Cardiorespiratory Analysis.PhysioNet, March 2026. Version 1.1.0

work page 2026
[51]

Mimic-iii-ext- ppg, a ppg-based benchmark dataset for cardiovascular and respiratory signal analysis.Scientific Data, 13(1):668, 2026

Mohammad Moulaeifard, Marie Kutscher, Philip J Aston, Peter H Charlton, and Nils Strodthoff. Mimic-iii-ext- ppg, a ppg-based benchmark dataset for cardiovascular and respiratory signal analysis.Scientific Data, 13(1):668, 2026

work page 2026
[52]

Deriving health metrics from the photoplethysmogram: Benchmarks and insights from mimic-iii-ext-ppg.arXiv preprint arXiv:2603.21832, 2026

Mohammad Moulaeifard, Philip J Aston, Peter H Charlton, and Nils Strodthoff. Deriving health metrics from the photoplethysmogram: Benchmarks and insights from mimic-iii-ext-ppg.arXiv preprint arXiv:2603.21832, 2026. 15 Appendix This appendix provides the detailed numerical results that complement the main findings presented in the main text. Specifically,...

work page arXiv 2026

[1] [1]

Inan, Lalit K

Ramakrishna Mukkamala, Jin-Oh Hahn, Omer T. Inan, Lalit K. Mestha, Chang-Sei Kim, H. T. Toreyin, and Sathish Kyal. Toward ubiquitous blood pressure monitoring via pulse transit time: Theory and practice.IEEE Transactions on Biomedical Engineering, 62(8):1879–1901, 2015

work page 1901

[2] [2]

Lovell, Derek Abbott, Yuanting Lim, and Rohan Ward

Mohamed Elgendi, Robyn Fletcher, Yongbo Liang, Nicholas Howard, Nigel H. Lovell, Derek Abbott, Yuanting Lim, and Rohan Ward. Cuffless blood pressure estimation using only a smartphone.Physiological Measurement, 40(7):075005, 2019

work page 2019

[3] [3]

Slapniˇcar, N

G. Slapniˇcar, N. Mlakar, and M. Luštrek. Blood pressure estimation from photoplethysmogram using a spectro- temporal deep neural network.Sensors, 19(15):3420, 2019

work page 2019

[4] [4]

Deep learning for cuffless blood pressure estimation from photoplethysmography.IEEE Transactions on Biomedical Engineering, 66(7):1919–1930, 2019

Shengbo Wang, Yang Zhou, Yiting Zhang, and Zhihua Wang. Deep learning for cuffless blood pressure estimation from photoplethysmography.IEEE Transactions on Biomedical Engineering, 66(7):1919–1930, 2019

work page 1919

[5] [5]

Aston, Peter H

Mohammad Moulaeifard, Loic Coquelin, Mantas Rinkeviˇcius, Andrius Sološenko, Oskar Pfeffer, Ciaran Bench, Nando Hegemann, Sara Vardanega, Manasi Nandi, Jordi Alastruey, Christian Heiss, Vaidotas Marozas, Andrew Thompson, Philip J. Aston, Peter H. Charlton, and Nils Strodthoff. Machine-learning for photoplethysmography analysis: Benchmarking feature, image...

work page 2026

[6] [6]

Cuffless blood pressure estimation from photoplethysmography using deep convolutional neural network and transfer learning.Biomedical Signal Processing and Control, 93:106194, 2024

Hüseyin Murat Koparır and Özkan Arslan. Cuffless blood pressure estimation from photoplethysmography using deep convolutional neural network and transfer learning.Biomedical Signal Processing and Control, 93:106194, 2024

work page 2024

[7] [7]

What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[8] [8]

Confidence and trust in medical artificial intelligence.Nature Medicine, 27(7):1145–1153, 2021

Benjamin Kompa, Gijs Snoeck, Niels Wouters, Ruben Coppens, Jan De Mey, Sabine Van Huffel, and Niels Lefeber. Confidence and trust in medical artificial intelligence.Nature Medicine, 27(7):1145–1153, 2021

work page 2021

[9] [9]

Rajendra Acharya

Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Davood Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, and U. Rajendra Acharya. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76:243–297, 2021

work page 2021

[10] [10]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning, pages 1050–1059, 2016

work page 2016

[11] [11]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[12] [12]

Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks.Machine Learning: Health, 1(1):015013, 2025

Ciaran Bench, Vivek Desai, Mohammad Moulaeifard, Nils Strodthoff, Philip Aston, and Andrew Thompson. Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks.Machine Learning: Health, 1(1):015013, 2025

work page 2025

[13] [13]

A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis.Machine Learning: Health, 2(1):015011, 2026

Ciaran Bench, Oskar Pfeffer, Vivek Desai, Mohammad Moulaeifard, Loic Coquelin, Peter H Charlton, Nils Strodthoff, Nando Hegemann, Philip J Aston, and Andrew Thompson. A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis.Machine Learning: Health, 2(1):015011, 2026

work page 2026

[14] [14]

Generalizable deep learning for photoplethysmography-based blood pressure estimation—A benchmarking study.Machine Learning: Health, 1(1):010501, September 2025

Mohammad Moulaeifard, Peter H Charlton, and Nils Strodthoff. Generalizable deep learning for photoplethysmography-based blood pressure estimation—A benchmarking study.Machine Learning: Health, 1(1):010501, September 2025

work page 2025

[15] [15]

Underspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

work page 2022

[16] [16]

Underspecification and uncertainty in deep learning models: Is there a connection?Neural Computing and Applications, 37(24):19579–19595, 2025

Filipa MM Ramos Ferreira and Rosaldo JF Rossetti. Underspecification and uncertainty in deep learning models: Is there a connection?Neural Computing and Applications, 37(24):19579–19595, 2025

work page 2025

[17] [17]

Videocad: an uncertainty-driven neural network for coronary artery disease screening from facial videos.IEEE Transactions on Instrumentation and Measurement, 72:1–12, 2022

Xuenan Liu, Xuezhi Yang, Rencheng Song, Jie Zhang, and Longwei Li. Videocad: an uncertainty-driven neural network for coronary artery disease screening from facial videos.IEEE Transactions on Instrumentation and Measurement, 72:1–12, 2022

work page 2022

[18] [18]

Improving ppg signal classification with machine learning: The power of a second opinion

Hamzeh Asgharnezhad, Afshar Shamsi, Ivan Bakhshayeshi, Roohallah Alizadehsani, Somayyeh Chamaani, and Hamid Alinejad-Rokny. Improving ppg signal classification with machine learning: The power of a second opinion. In2023 24th International Conference on Digital Signal Processing (DSP), pages 1–5. IEEE, 2023. 13

work page 2023

[19] [19]

Noncontact blood pressure estimation using bp-related cardiovascular knowledge: An uncalibrated method based on consumer-level camera

Xuesong Han, Xuezhi Yang, Shuai Fang, Rencheng Song, Longwei Li, and Jie Zhang. Noncontact blood pressure estimation using bp-related cardiovascular knowledge: An uncalibrated method based on consumer-level camera. IEEE Transactions on Instrumentation and Measurement, 72:1–13, 2023

work page 2023

[20] [20]

Quantifying deep neural network uncertainty for atrial fibrillation detection with limited labels.Scientific Reports, 12(1):20140, 2022

Brian Chen, Golara Javadi, Alexander Hamilton, Stephanie Sibley, Philip Laird, Purang Abolmaesumi, David Maslove, and Parvin Mousavi. Quantifying deep neural network uncertainty for atrial fibrillation detection with limited labels.Scientific Reports, 12(1):20140, 2022

work page 2022

[21] [21]

Sarkar Snigdha Sarathi Das, Subangkar Karmaker Shanto, Masum Rahman, Md Saiful Islam, Atif Hasan Rahman, Mohammad M Masud, and Mohammed Eunus Ali. Bayesbeat: Reliable atrial fibrillation detection from noisy photoplethysmography data.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–21, 2022

work page 2022

[22] [22]

Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[23] [23]

Dusenberry, Sebastian Farquhar, and Jasper Snoek

Zachary Nado, Neal Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, and Jasper Snoek. Evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, volume 34, 2021

work page 2021

[24] [24]

Tassi, Mohsin Ali, Jongseong Lee, Matthias Humt, Jian-Jiang Feng, Anna Kruspe, Peter Jung, and Ribana Roscher

Jakob Gawlikowski, Cedric R. Tassi, Mohsin Ali, Jongseong Lee, Matthias Humt, Jian-Jiang Feng, Anna Kruspe, Peter Jung, and Ribana Roscher. A survey of uncertainty in deep neural networks.Artificial Intelligence Review, 54(5):3361–3415, 2021

work page 2021

[25] [25]

Pulsedb: A large, cleaned dataset based on mimic-iii and vitaldb for benchmarking cuff-less blood pressure estimation methods.Frontiers in Digital Health, 4:1090854, 2023

Weinan Wang, Pedram Mohseni, Kevin L Kilgore, and Laleh Najafizadeh. Pulsedb: A large, cleaned dataset based on mimic-iii and vitaldb for benchmarking cuff-less blood pressure estimation methods.Frontiers in Digital Health, 4:1090854, 2023

work page 2023

[26] [26]

A benchmark for machine-learning based non- invasive blood pressure estimation using photoplethysmogram.Scientific Data, 10(1):149, 2023

Sergio González, Wan-Ting Hsieh, and Trista Pei-Chun Chen. A benchmark for machine-learning based non- invasive blood pressure estimation using photoplethysmogram.Scientific Data, 10(1):149, 2023

work page 2023

[27] [27]

Mimic-iii waveform database matched subset.MIMIC-III Waveform Database Matched Subset v1

Benjamin Moody, George Moody, Mauricio Villarroel, G Clifford, and Ikaro Silva. Mimic-iii waveform database matched subset.MIMIC-III Waveform Database Matched Subset v1. 0, 2020

work page 2020

[28] [28]

Vitaldb, a high-fidelity multi-parameter vital signs database in surgical patients.Scientific Data, 9(1):279, 2022

Hyung-Chul Lee, Yoonsang Park, Soo Bin Yoon, Seong Mi Yang, Dongnyeok Park, and Chul-Woo Jung. Vitaldb, a high-fidelity multi-parameter vital signs database in surgical patients.Scientific Data, 9(1):279, 2022

work page 2022

[29] [29]

Blood pressure mor- phology assessment from photoplethysmogram and demographic information using deep learning with attention mechanism.Sensors, 21(6):2167, 2021

Nicolas Aguirre, Edith Grall-Maës, Leandro J Cymberknop, and Ricardo L Armentano. Blood pressure mor- phology assessment from photoplethysmogram and demographic information using deep learning with attention mechanism.Sensors, 21(6):2167, 2021

work page 2021

[30] [30]

Cuff-less high- accuracy calibration-free blood pressure estimation using pulse transit time

Mohamad Kachuee, Mohammad Mahdi Kiani, Hoda Mohammadzade, and Mahdi Shabany. Cuff-less high- accuracy calibration-free blood pressure estimation using pulse transit time. In2015 IEEE international symposium on circuits and systems (ISCAS), pages 1006–1009. IEEE, 2015

work page 2015

[31] [31]

Bed-based ballistocardiography: Dataset and ability to track cardiovascular parameters.Sensors, 21(1):156, 2020

Charles Carlson, Vanessa-Rose Turpin, Ahmad Suliman, Carl Ade, Steve Warren, and David E Thompson. Bed-based ballistocardiography: Dataset and ability to track cardiovascular parameters.Sensors, 21(1):156, 2020

work page 2020

[32] [32]

A new, short-recorded photoplethysmo- gram dataset for blood pressure monitoring in china.Scientific data, 5(1):180020, 2018

Yongbo Liang, Zhencheng Chen, Guiyong Liu, and Mohamed Elgendi. A new, short-recorded photoplethysmo- gram dataset for blood pressure monitoring in china.Scientific data, 5(1):180020, 2018

work page 2018

[33] [33]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

work page 2017

[34] [34]

Order restricted statistical inference.(No Title), 1988

Tim Robertson, Richard Dykstra, and FT Wright. Order restricted statistical inference.(No Title), 1988

work page 1988

[35] [35]

Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

Anastasios N Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

work page 2023

[36] [36]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

work page 2005

[37] [37]

Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

work page 2007

[38] [38]

Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE journal of biomedical and health informatics, 25(5):1519–1528, 2020

Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE journal of biomedical and health informatics, 25(5):1519–1528, 2020

work page 2020

[39] [39]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 14

work page 2016

[40] [40]

Fastai: A layered api for deep learning.Information, 11(2):108, 2020

Jeremy Howard and Sylvain Gugger. Fastai: A layered api for deep learning.Information, 11(2):108, 2020

work page 2020

[41] [41]

How to use dropout correctly on residual networks with batch normalization.arXiv preprint arXiv:2302.06112, 2023

Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Donggeon Lee, and Sang Woo Kim. How to use dropout correctly on residual networks with batch normalization.arXiv preprint arXiv:2302.06112, 2023

work page arXiv 2023

[42] [42]

Fast and robust earth mover’s distances

Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In2009 IEEE 12th international conference on computer vision, pages 460–467. IEEE, 2009

work page 2009

[43] [43]

Addressing distribution shift for robust and trustworthy prediction and causal inference in clinical ai settings.JAMA Network Open, 8(6):e2513705, 2025

Larry Han. Addressing distribution shift for robust and trustworthy prediction and causal inference in clinical ai settings.JAMA Network Open, 8(6):e2513705, 2025

work page 2025

[44] [44]

Distribution shift detection for the postmarket surveillance of medical ai algorithms: a retrospective simulation study.NPJ Digital Medicine, 7(1):120, 2024

Lisa M Koch, Christian F Baumgartner, and Philipp Berens. Distribution shift detection for the postmarket surveillance of medical ai algorithms: a retrospective simulation study.NPJ Digital Medicine, 7(1):120, 2024

work page 2024

[45] [45]

Trustworthy clinical ai solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis

Benjamin Lambert, Florence Forbes, Senan Doyle, Harmonie Dehaene, and Michel Dojat. Trustworthy clinical ai solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis. Artificial Intelligence in Medicine, 150:102830, 2024

work page 2024

[46] [46]

A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

work page 2008

[47] [47]

Accurate uncertainties for deep learning using calibrated regression

V olodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. InProceedings of the 35th International Conference on Machine Learning, pages 2796–2804, 2018

work page 2018

[48] [48]

Evaluating and calibrating uncertainty prediction in regression tasks.Sensors, 22(15):5540, 2022

Dan Levi, Liran Gispan, Niv Giladi, and Ethan Fetaya. Evaluating and calibrating uncertainty prediction in regression tasks.Sensors, 22(15):5540, 2022

work page 2022

[49] [49]

Cuffless blood pressure measurement de- vices—international perspectives on accuracy and clinical use: a narrative review.JAMA Cardiol, 10(6):624–631, 2025

Eugene Yang, Aletta E Schutte, George Stergiou, Fernando Stuardo Wyss, Yvonne Commodore-Mensah, Au- gustine Odili, Ian Kronish, Hae-Young Lee, and Daichi Shimbo. Cuffless blood pressure measurement de- vices—international perspectives on accuracy and clinical use: a narrative review.JAMA Cardiol, 10(6):624–631, 2025

work page 2025

[50] [50]

MIMIC-III-Ext-PPG: A PPG Benchmark Dataset for Cardiorespiratory Analysis.PhysioNet, March 2026

Mohammad Moulaeifard, Peter H Charlton, and Nils Strodthoff. MIMIC-III-Ext-PPG: A PPG Benchmark Dataset for Cardiorespiratory Analysis.PhysioNet, March 2026. Version 1.1.0

work page 2026

[51] [51]

Mimic-iii-ext- ppg, a ppg-based benchmark dataset for cardiovascular and respiratory signal analysis.Scientific Data, 13(1):668, 2026

Mohammad Moulaeifard, Marie Kutscher, Philip J Aston, Peter H Charlton, and Nils Strodthoff. Mimic-iii-ext- ppg, a ppg-based benchmark dataset for cardiovascular and respiratory signal analysis.Scientific Data, 13(1):668, 2026

work page 2026

[52] [52]

Deriving health metrics from the photoplethysmogram: Benchmarks and insights from mimic-iii-ext-ppg.arXiv preprint arXiv:2603.21832, 2026

Mohammad Moulaeifard, Philip J Aston, Peter H Charlton, and Nils Strodthoff. Deriving health metrics from the photoplethysmogram: Benchmarks and insights from mimic-iii-ext-ppg.arXiv preprint arXiv:2603.21832, 2026. 15 Appendix This appendix provides the detailed numerical results that complement the main findings presented in the main text. Specifically,...

work page arXiv 2026