pith. sign in

arxiv: 2604.17480 · v1 · submitted 2026-04-19 · 💻 cs.LG

Trustworthy deep domain adaptation for wearable photoplethysmography signal analysis with decision-theoretic uncertainty quantification

Pith reviewed 2026-05-10 06:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords domain adaptationuncertainty quantificationphotoplethysmographygenerative modelsatrial fibrillationsignal denoisingtrustworthy machine learning
0
0 comments X

The pith

Decision-theoretic uncertainty quantification evaluates the trustworthiness of generative outputs in domain adaptation for photoplethysmography signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that decision-theoretic uncertainty quantification offers a practical way to judge the quality of data produced by deep generative models during domain adaptation, particularly when ground truth labels for the target domain are missing. Standard uncertainty checks fall short here because they ignore how the adapted signals will actually be used in a downstream task and cannot be validated without reference data. By instead measuring uncertainty through its impact on a subsequent classifier, such as one distinguishing atrial fibrillation from photoplethysmography time series, the method turns the classifier itself into an evaluator of the generated outputs. A reader would care because this supplies a concrete route to trust or reject adapted wearable signals in settings where device differences create persistent domain shifts.

Core claim

The central claim is that decision-theoretic uncertainty quantification addresses the limitations of standard evaluation for generative models in domain adaptation by tying uncertainty estimates directly to downstream task performance. In the photoplethysmography denoising case study for atrial fibrillation classification, this approach formalizes the heuristic of using the discriminative classifier to assess the quality of generated outputs even in the absence of ground truths.

What carries the argument

Decision-theoretic uncertainty quantification, which measures the reliability of generated domain-adapted outputs according to their effect on the performance of a downstream classifier rather than direct comparison to unavailable ground truths.

If this is right

  • Generative models can align input features across domains to improve a discriminative model's accuracy on new test data.
  • Uncertainty estimates become directly relevant to the intended use case instead of remaining agnostic to downstream utility.
  • The downstream classifier itself becomes a validator for the quality of denoised photoplethysmography time series.
  • Adapted examples can be accepted or rejected based on their quantified uncertainty before they reach the final classifier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decision-theoretic lens could be applied to other wearable signals where device-to-device shifts are common.
  • If the uncertainties prove well-calibrated, filtering high-uncertainty adaptations might raise end-to-end classification reliability in real deployments.
  • Extending the framework to regression or multi-class downstream tasks would test whether the approach generalizes beyond binary atrial fibrillation detection.

Load-bearing premise

Performance on the downstream classification task serves as a complete and reliable proxy for the true quality of the generated signals, without missing artefacts that the generative model might introduce.

What would settle it

Obtain ground truth labels for the target domain, then check whether examples flagged as high-uncertainty by the framework actually produce lower atrial fibrillation classification accuracy than low-uncertainty examples when fed to the same downstream model.

Figures

Figures reproduced from arXiv: 2604.17480 by Ciaran Bench.

Figure 1
Figure 1. Figure 1: Example PPG times series (noisy, denoised, and the ground truth non-augmented). Top two rows are AF examples. Negative values of the GAN [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-class reliability diagrams for each test set. a) shows the reliability diagram for unaugmented time series, b) shows the same plot for noise augmented [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scatterplot of noisy and denoised predictive entropy. Moderate [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

In principle, deep generative models can be used to perform domain adaptation; i.e. align the input feature representations of test data with that of a separate discriminative model's training data. This can help improve the discriminative model's performance on the test data. However, generative models are prone to producing hallucinations and artefacts that may degrade the quality of generated data, and therefore, predictive performance when processed by the discriminative model. While uncertainty quantification can provide a means to assess the quality of adapted data, the standard framework for evaluating the quality of predicted uncertainties may not easily extend to generative models due to the common lack of ground truths (among other reasons). Even with ground truths, this evaluation is agnostic to how the generated outputs are used on the downstream task, limiting the extent to which the uncertainty reliability analysis provides insights about the utility of the uncertainties with respect to the intended use case of the adapted examples. Here, we describe how decision-theoretic uncertainty quantification can address these concerns and provide a convenient framework for evaluating the trustworthiness of generated outputs, in particular, for domain adaptation. We consider a case study in photoplethysmography time series denoising for Atrial Fibrillation classification. This formalises a well-known heuristic method of using a downstream classifier to assess the quality of generated outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes using decision-theoretic uncertainty quantification to assess the trustworthiness of outputs from deep generative models performing domain adaptation on photoplethysmography (PPG) time series, with a case study on denoising for atrial fibrillation (AF) classification. It formalizes the heuristic of routing generated signals through a fixed downstream classifier to evaluate quality in the absence of ground truths, addressing limitations of standard UQ methods that are task-agnostic and require ground truths.

Significance. If the framework is shown to reliably detect trustworthiness issues, it could provide a practical, task-aligned method for evaluating generative domain adaptation in medical signal processing where ground truth is unavailable, potentially improving deployment of wearable PPG analysis systems.

major comments (2)
  1. [Abstract] Abstract: The central construction formalizes routing generated PPG outputs through a fixed AF classifier and using decision-theoretic scores for trustworthiness. However, this assumes classifier accuracy is a sufficient statistic for output quality; the manuscript does not derive or ablate whether artefacts (phase jitter, spurious harmonics) that preserve AF classification but degrade clinical signal fidelity are captured by the proxy.
  2. [Case study] Case study section: No independent check or sensitivity analysis is provided to confirm that the decision-theoretic reframing supplies a valid proxy when the generator introduces non-stationary distortions tolerated by the AF detector but rejected by clinicians, which is the exact gap the motivation identifies in standard UQ.
minor comments (1)
  1. [Abstract] Abstract: The description of the approach is clear at a high level but would benefit from naming the specific decision-theoretic scores (e.g., expected utility or regret) employed in the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. These observations help sharpen the presentation of the assumptions underlying our decision-theoretic uncertainty quantification framework for generative domain adaptation. We respond to each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central construction formalizes routing generated PPG outputs through a fixed AF classifier and using decision-theoretic scores for trustworthiness. However, this assumes classifier accuracy is a sufficient statistic for output quality; the manuscript does not derive or ablate whether artefacts (phase jitter, spurious harmonics) that preserve AF classification but degrade clinical signal fidelity are captured by the proxy.

    Authors: We appreciate the referee drawing attention to this key assumption. Our framework deliberately ties trustworthiness evaluation to downstream task utility through decision theory, so that uncertainty scores reflect expected loss in AF classification performance rather than generic signal quality. Artefacts that leave classification accuracy unchanged are therefore treated as not degrading task-specific trustworthiness, which is consistent with the paper's focus on task-aligned evaluation in the absence of ground truth. We acknowledge, however, that such artefacts could still affect clinical signal fidelity outside the AF classification task. In the revised version we will update the abstract to state this scope explicitly and add a short discussion paragraph on the proxy's limitations with respect to broader clinical fidelity. revision: yes

  2. Referee: [Case study] Case study section: No independent check or sensitivity analysis is provided to confirm that the decision-theoretic reframing supplies a valid proxy when the generator introduces non-stationary distortions tolerated by the AF detector but rejected by clinicians, which is the exact gap the motivation identifies in standard UQ.

    Authors: The case study applies the decision-theoretic reframing to PPG denoising for AF classification and shows that the resulting uncertainty scores track classification outcomes. The motivation of the work is precisely to move beyond task-agnostic, ground-truth-dependent UQ by anchoring evaluation in downstream utility; distortions that the AF classifier tolerates therefore do not increase expected utility loss under our formulation. We agree that an explicit sensitivity analysis would provide additional reassurance. We will therefore add such an analysis to the revised case-study section, including controlled introduction of phase jitter and spurious harmonics and reporting how the uncertainty scores respond when these distortions are or are not tolerated by the classifier. revision: yes

Circularity Check

0 steps flagged

Conceptual framework formalizes known heuristic without reducing to inputs by construction

full rationale

The paper proposes decision-theoretic uncertainty quantification as a framework for assessing trustworthiness of generative domain adaptation outputs in PPG denoising for AF classification. It explicitly acknowledges formalizing a well-known heuristic of routing outputs through a downstream classifier, but presents no equations, fitted parameters, self-citations, or derivations that reduce the central claim to its own inputs. The contribution is a reframing of evaluation rather than a mathematical chain that is tautological or self-referential, leaving the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract relies on standard machine learning assumptions about generative models performing domain alignment and uncertainty providing quality signals, without introducing new free parameters, invented entities, or ad-hoc axioms beyond domain assumptions.

axioms (2)
  • domain assumption Deep generative models can align input feature representations of test data with training data of a discriminative model.
    Stated as the principle enabling domain adaptation in the first sentence of the abstract.
  • domain assumption Uncertainty quantification can assess quality of generated data despite lack of ground truths.
    Invoked when discussing limitations of standard evaluation frameworks for generative models.

pith-pipeline@v0.9.0 · 5522 in / 1378 out tokens · 66202 ms · 2026-05-10T06:55:31.979807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    A review of domain adaptation without target labels,

    W. M. Kouw and M. Loog, “A review of domain adaptation without target labels,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 3, pp. 766–785, 2019

  2. [2]

    A concise review of transfer learning,

    A. Farahani, B. Pourshojae, K. Rasheed, and H. R. Arabnia, “A concise review of transfer learning,” in2020 international conference on computational science and computational intelligence (CSCI), pp. 344– 351, IEEE, 2020

  3. [3]

    Multi-task deep learning for cardiac rhythm detection in wearable devices,

    J. Torres-Soto and E. A. Ashley, “Multi-task deep learning for cardiac rhythm detection in wearable devices,”NPJ digital medicine, vol. 3, no. 1, p. 116, 2020

  4. [4]

    Image-to-image translation with conditional adversarial networks,

    P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017

  5. [5]

    Denoising EEG signals for real-world BCI applications using GANs,

    E. Brophy, P. Redmond, A. Fleury, M. De V os, G. Boylan, and T. Ward, “Denoising EEG signals for real-world BCI applications using GANs,” Frontiers in Neuroergonomics, vol. 2, p. 805573, 2022

  6. [6]

    An ecg denoising method based on the generative adversarial residual network,

    B. Xu, R. Liu, M. Shu, X. Shang, and Y . Wang, “An ecg denoising method based on the generative adversarial residual network,”Com- putational and Mathematical Methods in Medicine, vol. 2021, no. 1, p. 5527904, 2021

  7. [7]

    On the trustworthiness landscape of state-of-the-art generative models: A survey and outlook,

    M. Fan, C. Wang, C. Chen, Y . Liu, and J. Huang, “On the trustworthiness landscape of state-of-the-art generative models: A survey and outlook,” International Journal of Computer Vision, vol. 133, no. 7, pp. 4317– 4348, 2025

  8. [8]

    Image quality metrics: Psnr vs. ssim,

    A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in2010 20th international conference on pattern recognition, pp. 2366–2369, IEEE, 2010

  9. [9]

    No-reference image quality assessment in the spatial domain,

    A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,”IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012

  10. [10]

    A pragmatic note on evaluating generative models with Fr\' e chet inception distance for retinal image synthesis

    Y . Wu, F. Liu, R. Yilmaz, H. Konermann, P. Walter, and J. Stegmaier, “A pragmatic note on evaluating generative models with Frechet Inception Distance for retinal image synthesis,”arXiv preprint arXiv:2502.17160, 2025

  11. [11]

    What uncertainties do we need in Bayesian deep learning for computer vision?,

    A. Kendall and Y . Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?,”Advances in neural information processing systems, vol. 30, 2017

  12. [12]

    I. ISO. and B. OIML,Guide to the Expression of Uncertainty in Measurement. Aenor Madrid, Spain, 1993

  13. [13]

    Calibration in machine learning uncertainty quantification: beyond consistency to target adaptivity,

    P. Pernot, “Calibration in machine learning uncertainty quantification: beyond consistency to target adaptivity,”APL Machine Learning, vol. 1, no. 4, 2023

  14. [14]

    Can bin-wise scaling improve consistency and adaptivity of prediction uncertainty for machine learning regression?,

    P. Pernot, “Can bin-wise scaling improve consistency and adaptivity of prediction uncertainty for machine learning regression?,”arXiv preprint arXiv:2310.11978, 2023

  15. [15]

    On calibrating diffusion probabilistic models,

    T. Pang, C. Lu, C. Du, M. Lin, S. Yan, and Z. Deng, “On calibrating diffusion probabilistic models,”Advances in Neural Information Pro- cessing Systems, vol. 36, pp. 49234–49249, 2023

  16. [16]

    Trustworthy image-to-image translation: evaluating uncertainty calibration in unpaired training sce- narios,

    C. Bench, E. Ahmed, and S. Thomas, “Trustworthy image-to-image translation: evaluating uncertainty calibration in unpaired training sce- narios,” in2025 International Joint Conference on Neural Networks (IJCNN), pp. 1–10, IEEE, 2025

  17. [17]

    Uncertainty quantification in deep learning is unsatisfactory for clinical applications and complex decision making.,

    C. Bench, “Uncertainty quantification in deep learning is unsatisfactory for clinical applications and complex decision making.,”TechRxiv, 2026

  18. [18]

    Rethinking aleatoric and epistemic uncer- tainty

    F. B. Smith, J. Kossen, E. Trollope, M. Van Der Wilk, A. Foster, and T. Rainforth, “Rethinking aleatoric and epistemic uncertainty,”arXiv preprint arXiv:2412.20892, 2024

  19. [19]

    Atrial fibrillation detection from raw photoplethysmography waveforms: A deep learning application,

    K. Aschbacher, D. Yilmaz, Y . Kerem, S. Crawford, D. Benaron, J. Liu, M. Eaton, G. H. Tison, J. E. Olgin, Y . Li,et al., “Atrial fibrillation detection from raw photoplethysmography waveforms: A deep learning application,”Heart rhythm O2, vol. 1, no. 1, pp. 3–9, 2020

  20. [20]

    Continuous atrial fib- rillation monitoring from photoplethysmography: comparison between supervised deep learning and heuristic signal processing,

    P. Antiperovitch, D. Mortara, J. Barrios, R. Avram, K. Yee, A. N. Khaless, A. Cristal, G. Tison, and J. Olgin, “Continuous atrial fib- rillation monitoring from photoplethysmography: comparison between supervised deep learning and heuristic signal processing,”Clinical Electrophysiology, vol. 10, no. 2, pp. 334–345, 2024

  21. [21]

    Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks,

    C. Bench, V . Desai, M. Moulaeifard, N. Strodthoff, P. Aston, and A. Thompson, “Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks,”Machine Learning: Health, vol. 1, no. 1, p. 015013, 2025

  22. [22]

    Synthetic realistic noise-corrupted ppg database and noise generator for the evaluation of ppg denoising and delineation algorithms,

    G. Masinelli, F. Dell’Agnola, A. Vald ´es, and D. Atienza, “Synthetic realistic noise-corrupted ppg database and noise generator for the evaluation of ppg denoising and delineation algorithms,” 2021

  23. [23]

    An efficient Bayes error rate estimation method,

    Q. Chen, F. Cao, Y . Xing, and J. Liang, “An efficient Bayes error rate estimation method,”Machine Learning, vol. 114, no. 6, p. 134, 2025

  24. [24]

    Uncertainty calibration error: A new metric for multi-class classification,

    M.-H. Laves, S. Ihler, K.-P. Kortmann, and T. Ortmaier, “Uncertainty calibration error: A new metric for multi-class classification,”

  25. [25]

    Classification accuracy as a proxy for two-sample testing,

    I. Kim, A. Ramdas, A. Singh, and L. Wasserman, “Classification accuracy as a proxy for two-sample testing,” 2021

  26. [26]

    How good is my GAN?,

    K. Shmelkov, C. Schmid, and K. Alahari, “How good is my GAN?,” in Proceedings of the European conference on computer vision (ECCV), pp. 213–229, 2018

  27. [27]

    On subjective uncertainty quantification and calibration in natural language generation

    Z. Wang and C. Holmes, “On subjective uncertainty quantifica- tion and calibration in natural language generation,”arXiv preprint arXiv:2406.05213, 2024

  28. [28]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention, pp. 234– 241, Springer, 2015

  29. [29]

    Precomputed real-time texture synthesis with Markovian generative adversarial networks,

    C. Li and M. Wand, “Precomputed real-time texture synthesis with Markovian generative adversarial networks,” inEuropean conference on computer vision, pp. 702–716, Springer, 2016