Trustworthy deep domain adaptation for wearable photoplethysmography signal analysis with decision-theoretic uncertainty quantification
Pith reviewed 2026-05-10 06:55 UTC · model grok-4.3
The pith
Decision-theoretic uncertainty quantification evaluates the trustworthiness of generative outputs in domain adaptation for photoplethysmography signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that decision-theoretic uncertainty quantification addresses the limitations of standard evaluation for generative models in domain adaptation by tying uncertainty estimates directly to downstream task performance. In the photoplethysmography denoising case study for atrial fibrillation classification, this approach formalizes the heuristic of using the discriminative classifier to assess the quality of generated outputs even in the absence of ground truths.
What carries the argument
Decision-theoretic uncertainty quantification, which measures the reliability of generated domain-adapted outputs according to their effect on the performance of a downstream classifier rather than direct comparison to unavailable ground truths.
If this is right
- Generative models can align input features across domains to improve a discriminative model's accuracy on new test data.
- Uncertainty estimates become directly relevant to the intended use case instead of remaining agnostic to downstream utility.
- The downstream classifier itself becomes a validator for the quality of denoised photoplethysmography time series.
- Adapted examples can be accepted or rejected based on their quantified uncertainty before they reach the final classifier.
Where Pith is reading between the lines
- The same decision-theoretic lens could be applied to other wearable signals where device-to-device shifts are common.
- If the uncertainties prove well-calibrated, filtering high-uncertainty adaptations might raise end-to-end classification reliability in real deployments.
- Extending the framework to regression or multi-class downstream tasks would test whether the approach generalizes beyond binary atrial fibrillation detection.
Load-bearing premise
Performance on the downstream classification task serves as a complete and reliable proxy for the true quality of the generated signals, without missing artefacts that the generative model might introduce.
What would settle it
Obtain ground truth labels for the target domain, then check whether examples flagged as high-uncertainty by the framework actually produce lower atrial fibrillation classification accuracy than low-uncertainty examples when fed to the same downstream model.
Figures
read the original abstract
In principle, deep generative models can be used to perform domain adaptation; i.e. align the input feature representations of test data with that of a separate discriminative model's training data. This can help improve the discriminative model's performance on the test data. However, generative models are prone to producing hallucinations and artefacts that may degrade the quality of generated data, and therefore, predictive performance when processed by the discriminative model. While uncertainty quantification can provide a means to assess the quality of adapted data, the standard framework for evaluating the quality of predicted uncertainties may not easily extend to generative models due to the common lack of ground truths (among other reasons). Even with ground truths, this evaluation is agnostic to how the generated outputs are used on the downstream task, limiting the extent to which the uncertainty reliability analysis provides insights about the utility of the uncertainties with respect to the intended use case of the adapted examples. Here, we describe how decision-theoretic uncertainty quantification can address these concerns and provide a convenient framework for evaluating the trustworthiness of generated outputs, in particular, for domain adaptation. We consider a case study in photoplethysmography time series denoising for Atrial Fibrillation classification. This formalises a well-known heuristic method of using a downstream classifier to assess the quality of generated outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes using decision-theoretic uncertainty quantification to assess the trustworthiness of outputs from deep generative models performing domain adaptation on photoplethysmography (PPG) time series, with a case study on denoising for atrial fibrillation (AF) classification. It formalizes the heuristic of routing generated signals through a fixed downstream classifier to evaluate quality in the absence of ground truths, addressing limitations of standard UQ methods that are task-agnostic and require ground truths.
Significance. If the framework is shown to reliably detect trustworthiness issues, it could provide a practical, task-aligned method for evaluating generative domain adaptation in medical signal processing where ground truth is unavailable, potentially improving deployment of wearable PPG analysis systems.
major comments (2)
- [Abstract] Abstract: The central construction formalizes routing generated PPG outputs through a fixed AF classifier and using decision-theoretic scores for trustworthiness. However, this assumes classifier accuracy is a sufficient statistic for output quality; the manuscript does not derive or ablate whether artefacts (phase jitter, spurious harmonics) that preserve AF classification but degrade clinical signal fidelity are captured by the proxy.
- [Case study] Case study section: No independent check or sensitivity analysis is provided to confirm that the decision-theoretic reframing supplies a valid proxy when the generator introduces non-stationary distortions tolerated by the AF detector but rejected by clinicians, which is the exact gap the motivation identifies in standard UQ.
minor comments (1)
- [Abstract] Abstract: The description of the approach is clear at a high level but would benefit from naming the specific decision-theoretic scores (e.g., expected utility or regret) employed in the framework.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. These observations help sharpen the presentation of the assumptions underlying our decision-theoretic uncertainty quantification framework for generative domain adaptation. We respond to each major comment below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central construction formalizes routing generated PPG outputs through a fixed AF classifier and using decision-theoretic scores for trustworthiness. However, this assumes classifier accuracy is a sufficient statistic for output quality; the manuscript does not derive or ablate whether artefacts (phase jitter, spurious harmonics) that preserve AF classification but degrade clinical signal fidelity are captured by the proxy.
Authors: We appreciate the referee drawing attention to this key assumption. Our framework deliberately ties trustworthiness evaluation to downstream task utility through decision theory, so that uncertainty scores reflect expected loss in AF classification performance rather than generic signal quality. Artefacts that leave classification accuracy unchanged are therefore treated as not degrading task-specific trustworthiness, which is consistent with the paper's focus on task-aligned evaluation in the absence of ground truth. We acknowledge, however, that such artefacts could still affect clinical signal fidelity outside the AF classification task. In the revised version we will update the abstract to state this scope explicitly and add a short discussion paragraph on the proxy's limitations with respect to broader clinical fidelity. revision: yes
-
Referee: [Case study] Case study section: No independent check or sensitivity analysis is provided to confirm that the decision-theoretic reframing supplies a valid proxy when the generator introduces non-stationary distortions tolerated by the AF detector but rejected by clinicians, which is the exact gap the motivation identifies in standard UQ.
Authors: The case study applies the decision-theoretic reframing to PPG denoising for AF classification and shows that the resulting uncertainty scores track classification outcomes. The motivation of the work is precisely to move beyond task-agnostic, ground-truth-dependent UQ by anchoring evaluation in downstream utility; distortions that the AF classifier tolerates therefore do not increase expected utility loss under our formulation. We agree that an explicit sensitivity analysis would provide additional reassurance. We will therefore add such an analysis to the revised case-study section, including controlled introduction of phase jitter and spurious harmonics and reporting how the uncertainty scores respond when these distortions are or are not tolerated by the classifier. revision: yes
Circularity Check
Conceptual framework formalizes known heuristic without reducing to inputs by construction
full rationale
The paper proposes decision-theoretic uncertainty quantification as a framework for assessing trustworthiness of generative domain adaptation outputs in PPG denoising for AF classification. It explicitly acknowledges formalizing a well-known heuristic of routing outputs through a downstream classifier, but presents no equations, fitted parameters, self-citations, or derivations that reduce the central claim to its own inputs. The contribution is a reframing of evaluation rather than a mathematical chain that is tautological or self-referential, leaving the analysis self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Deep generative models can align input feature representations of test data with training data of a discriminative model.
- domain assumption Uncertainty quantification can assess quality of generated data despite lack of ground truths.
Reference graph
Works this paper leans on
-
[1]
A review of domain adaptation without target labels,
W. M. Kouw and M. Loog, “A review of domain adaptation without target labels,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 3, pp. 766–785, 2019
work page 2019
-
[2]
A concise review of transfer learning,
A. Farahani, B. Pourshojae, K. Rasheed, and H. R. Arabnia, “A concise review of transfer learning,” in2020 international conference on computational science and computational intelligence (CSCI), pp. 344– 351, IEEE, 2020
work page 2020
-
[3]
Multi-task deep learning for cardiac rhythm detection in wearable devices,
J. Torres-Soto and E. A. Ashley, “Multi-task deep learning for cardiac rhythm detection in wearable devices,”NPJ digital medicine, vol. 3, no. 1, p. 116, 2020
work page 2020
-
[4]
Image-to-image translation with conditional adversarial networks,
P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017
work page 2017
-
[5]
Denoising EEG signals for real-world BCI applications using GANs,
E. Brophy, P. Redmond, A. Fleury, M. De V os, G. Boylan, and T. Ward, “Denoising EEG signals for real-world BCI applications using GANs,” Frontiers in Neuroergonomics, vol. 2, p. 805573, 2022
work page 2022
-
[6]
An ecg denoising method based on the generative adversarial residual network,
B. Xu, R. Liu, M. Shu, X. Shang, and Y . Wang, “An ecg denoising method based on the generative adversarial residual network,”Com- putational and Mathematical Methods in Medicine, vol. 2021, no. 1, p. 5527904, 2021
work page 2021
-
[7]
On the trustworthiness landscape of state-of-the-art generative models: A survey and outlook,
M. Fan, C. Wang, C. Chen, Y . Liu, and J. Huang, “On the trustworthiness landscape of state-of-the-art generative models: A survey and outlook,” International Journal of Computer Vision, vol. 133, no. 7, pp. 4317– 4348, 2025
work page 2025
-
[8]
Image quality metrics: Psnr vs. ssim,
A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in2010 20th international conference on pattern recognition, pp. 2366–2369, IEEE, 2010
work page 2010
-
[9]
No-reference image quality assessment in the spatial domain,
A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,”IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012
work page 2012
-
[10]
Y . Wu, F. Liu, R. Yilmaz, H. Konermann, P. Walter, and J. Stegmaier, “A pragmatic note on evaluating generative models with Frechet Inception Distance for retinal image synthesis,”arXiv preprint arXiv:2502.17160, 2025
-
[11]
What uncertainties do we need in Bayesian deep learning for computer vision?,
A. Kendall and Y . Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[12]
I. ISO. and B. OIML,Guide to the Expression of Uncertainty in Measurement. Aenor Madrid, Spain, 1993
work page 1993
-
[13]
Calibration in machine learning uncertainty quantification: beyond consistency to target adaptivity,
P. Pernot, “Calibration in machine learning uncertainty quantification: beyond consistency to target adaptivity,”APL Machine Learning, vol. 1, no. 4, 2023
work page 2023
-
[14]
P. Pernot, “Can bin-wise scaling improve consistency and adaptivity of prediction uncertainty for machine learning regression?,”arXiv preprint arXiv:2310.11978, 2023
-
[15]
On calibrating diffusion probabilistic models,
T. Pang, C. Lu, C. Du, M. Lin, S. Yan, and Z. Deng, “On calibrating diffusion probabilistic models,”Advances in Neural Information Pro- cessing Systems, vol. 36, pp. 49234–49249, 2023
work page 2023
-
[16]
C. Bench, E. Ahmed, and S. Thomas, “Trustworthy image-to-image translation: evaluating uncertainty calibration in unpaired training sce- narios,” in2025 International Joint Conference on Neural Networks (IJCNN), pp. 1–10, IEEE, 2025
work page 2025
-
[17]
C. Bench, “Uncertainty quantification in deep learning is unsatisfactory for clinical applications and complex decision making.,”TechRxiv, 2026
work page 2026
-
[18]
Rethinking aleatoric and epistemic uncer- tainty
F. B. Smith, J. Kossen, E. Trollope, M. Van Der Wilk, A. Foster, and T. Rainforth, “Rethinking aleatoric and epistemic uncertainty,”arXiv preprint arXiv:2412.20892, 2024
-
[19]
Atrial fibrillation detection from raw photoplethysmography waveforms: A deep learning application,
K. Aschbacher, D. Yilmaz, Y . Kerem, S. Crawford, D. Benaron, J. Liu, M. Eaton, G. H. Tison, J. E. Olgin, Y . Li,et al., “Atrial fibrillation detection from raw photoplethysmography waveforms: A deep learning application,”Heart rhythm O2, vol. 1, no. 1, pp. 3–9, 2020
work page 2020
-
[20]
P. Antiperovitch, D. Mortara, J. Barrios, R. Avram, K. Yee, A. N. Khaless, A. Cristal, G. Tison, and J. Olgin, “Continuous atrial fib- rillation monitoring from photoplethysmography: comparison between supervised deep learning and heuristic signal processing,”Clinical Electrophysiology, vol. 10, no. 2, pp. 334–345, 2024
work page 2024
-
[21]
C. Bench, V . Desai, M. Moulaeifard, N. Strodthoff, P. Aston, and A. Thompson, “Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks,”Machine Learning: Health, vol. 1, no. 1, p. 015013, 2025
work page 2025
-
[22]
G. Masinelli, F. Dell’Agnola, A. Vald ´es, and D. Atienza, “Synthetic realistic noise-corrupted ppg database and noise generator for the evaluation of ppg denoising and delineation algorithms,” 2021
work page 2021
-
[23]
An efficient Bayes error rate estimation method,
Q. Chen, F. Cao, Y . Xing, and J. Liang, “An efficient Bayes error rate estimation method,”Machine Learning, vol. 114, no. 6, p. 134, 2025
work page 2025
-
[24]
Uncertainty calibration error: A new metric for multi-class classification,
M.-H. Laves, S. Ihler, K.-P. Kortmann, and T. Ortmaier, “Uncertainty calibration error: A new metric for multi-class classification,”
-
[25]
Classification accuracy as a proxy for two-sample testing,
I. Kim, A. Ramdas, A. Singh, and L. Wasserman, “Classification accuracy as a proxy for two-sample testing,” 2021
work page 2021
-
[26]
K. Shmelkov, C. Schmid, and K. Alahari, “How good is my GAN?,” in Proceedings of the European conference on computer vision (ECCV), pp. 213–229, 2018
work page 2018
-
[27]
On subjective uncertainty quantification and calibration in natural language generation
Z. Wang and C. Holmes, “On subjective uncertainty quantifica- tion and calibration in natural language generation,”arXiv preprint arXiv:2406.05213, 2024
-
[28]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention, pp. 234– 241, Springer, 2015
work page 2015
-
[29]
Precomputed real-time texture synthesis with Markovian generative adversarial networks,
C. Li and M. Wand, “Precomputed real-time texture synthesis with Markovian generative adversarial networks,” inEuropean conference on computer vision, pp. 702–716, Springer, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.