PPG-Based Affect Recognition with Long-Range Deep Models: A Measurement-Driven Comparison of CNN, Transformer, and Mamba Architectures
Pith reviewed 2026-05-07 16:26 UTC · model grok-4.3
The pith
CNNs achieve the highest accuracy with the smallest model size for PPG-based recognition of arousal, valence, and relaxation, while Transformers and Mamba perform comparably but offer no consistent gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a uniform evaluation protocol, the CNN baseline reached the top accuracy across arousal, valence, and relaxation tasks with the smallest model size, whereas Transformer and Mamba models delivered comparable performance without consistent outperformance; Transformers showed stronger F1 balance for arousal and relaxation.
What carries the argument
A measurement-driven head-to-head comparison of CNN, CNN-LSTM, Transformer, and Mamba architectures on wrist PPG signals under identical preprocessing and subject-independent 5-fold cross-validation.
Load-bearing premise
That the datasets, preprocessing, and segmentation steps create a neutral testbed that does not systematically favor convolutional layers over attention or state-space mechanisms.
What would settle it
A new experiment on a larger, more varied PPG dataset showing Transformer or Mamba models with reliably higher accuracy and F1 scores across all three affect tasks would falsify the claim that CNNs remain preferable.
read the original abstract
Photoplethysmography (PPG) is increasingly used in wearable affective computing due to its low cost and ease of integration into consumer devices. Recent advances in deep learning have introduced long-range sequence models, such as Transformers, and state-space models, like Mamba, which have demonstrated strong performance on natural language and general time-series tasks. However, it remains unclear whether these architectures offer tangible benefits over widely used Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTMs) for PPG-based affect recognition, given that datasets are typically small and noisy. This work presents a measurement-driven comparison of four deep learning architectures, CNN, CNN-LSTM hybrid, Transformers, and Mamba, for classifying arousal, valence, and relaxation states from wrist-based PPG signals. All models are evaluated under a subject-independent 5-fold cross-validation protocol using identical preprocessing, segmentation, and training pipelines. Our results show that the Transformer and Mamba models achieve performance comparable to that of a CNN baseline, but do not consistently outperform it across all tasks. CNNs remain the most effective overall, providing the highest accuracy with the smallest model size, whereas Transformers have a better balance of F1 scores for Arousal and Relaxation. The study provides the first evaluation of Transformer and Mamba models for PPG-based affect recognition, offering practical guidance on model selection for wearable affective monitoring systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a comparative empirical study of four deep learning architectures—CNN, CNN-LSTM hybrid, Transformer, and Mamba—for classifying arousal, valence, and relaxation from wrist-based PPG signals. All models are trained and evaluated under an identical subject-independent 5-fold cross-validation protocol with shared preprocessing and segmentation pipelines. The central claim is that Transformer and Mamba achieve performance comparable to the CNN baseline but do not consistently outperform it across tasks; CNNs are reported as most effective overall due to highest accuracy and smallest model size, while Transformers show a better F1-score balance for arousal and relaxation. The work positions itself as the first evaluation of Transformer and Mamba models in PPG-based affect recognition.
Significance. If the results hold under a fair comparison, the paper provides practical guidance for model selection in resource-constrained wearable affective computing, where small noisy datasets are common. It usefully demonstrates that long-range sequence models do not automatically deliver gains over simpler CNNs in this domain, helping prioritize efficiency. The consistent multi-architecture evaluation protocol is a strength for benchmarking, though the significance is tempered by the need to confirm the pipeline does not inadvertently favor local-feature models.
major comments (2)
- [§3 (Methods)] §3 (Methods, preprocessing and segmentation): The shared pipeline is presented as architecture-neutral, but there is no ablation or justification for the fixed segment length (typically 10-30 s for PPG affect tasks) relative to the receptive-field or state-space requirements of Transformer and Mamba. Because PPG signals are quasi-periodic and affect information often involves low-frequency trends that bandpass filtering or short windows can suppress, this setup risks systematically limiting the long-range models' advantages, directly undermining the load-bearing claim that CNNs are intrinsically most effective.
- [§4 (Results)] §4 (Results, performance tables): The reported accuracy and F1 scores lack statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank across the 5 folds) or confidence intervals. Without these, it is impossible to assess whether differences such as CNN's highest accuracy are reliable or attributable to fold variability on small noisy datasets, weakening the cross-task superiority conclusion.
minor comments (2)
- [Abstract] Abstract: The claim of 'highest accuracy with the smallest model size' for CNN would be more informative if accompanied by the actual numerical values for accuracy, F1, and parameter counts rather than qualitative statements.
- [§5 (Discussion)] §5 (Discussion): The discussion of practical implications for wearable systems could be strengthened by explicitly addressing how the subject-independent protocol and dataset characteristics limit generalizability to real-world continuous monitoring.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our comparative study of CNN, CNN-LSTM, Transformer, and Mamba architectures for PPG-based affect recognition. The comments help clarify the presentation of our standardized evaluation protocol and strengthen the statistical interpretation of results. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3 (Methods)] §3 (Methods, preprocessing and segmentation): The shared pipeline is presented as architecture-neutral, but there is no ablation or justification for the fixed segment length (typically 10-30 s for PPG affect tasks) relative to the receptive-field or state-space requirements of Transformer and Mamba. Because PPG signals are quasi-periodic and affect information often involves low-frequency trends that bandpass filtering or short windows can suppress, this setup risks systematically limiting the long-range models' advantages, directly undermining the load-bearing claim that CNNs are intrinsically most effective.
Authors: We selected a fixed 20-second segment length following standard practice in the PPG-based affect recognition literature, where windows in the 10–30 s range are routinely employed to capture relevant autonomic responses while maintaining adequate sample counts for training on small, subject-independent datasets. Within each 20 s window the Transformer self-attention and Mamba selective state-space mechanisms are still able to model dependencies across the entire segment—substantially longer than the local receptive fields of the CNN baseline—thereby preserving the intended architecture comparison. Nevertheless, we acknowledge that an explicit ablation across segment lengths would provide further insight into whether longer contexts could confer additional benefits to the long-range models. In the revised manuscript we will (i) add a dedicated paragraph in §3 justifying the segment length with references to prior PPG affect studies, (ii) explicitly discuss the potential limitation for long-range architectures, and (iii) report supplementary results using an alternative 40 s window on the subset of recordings that permit it without excessive sample loss. These additions will clarify that the observed performance ordering holds under the commonly adopted preprocessing regime while transparently noting the boundary condition. revision: partial
-
Referee: [§4 (Results)] §4 (Results, performance tables): The reported accuracy and F1 scores lack statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank across the 5 folds) or confidence intervals. Without these, it is impossible to assess whether differences such as CNN's highest accuracy are reliable or attributable to fold variability on small noisy datasets, weakening the cross-task superiority conclusion.
Authors: We agree that statistical characterization is essential for interpreting differences on small, noisy physiological datasets. In the revised manuscript we will augment all performance tables with (a) 95 % confidence intervals computed across the five folds and (b) paired statistical tests (Wilcoxon signed-rank or paired t-test, as appropriate after normality checks) between each model pair for every metric and task. These additions will allow readers to evaluate whether the reported superiority of the CNN in accuracy and model size, as well as the F1 balance observed for the Transformer on arousal and relaxation, reach statistical significance or remain within the variability expected from five-fold subject-independent splits. revision: yes
Circularity Check
No circularity: purely empirical model comparison with no derivations or self-referential predictions
full rationale
The paper conducts a direct experimental comparison of CNN, CNN-LSTM, Transformer, and Mamba architectures on PPG signals for affect classification. All claims rest on training results under a shared subject-independent 5-fold protocol with identical preprocessing and segmentation. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. The central finding (CNNs most effective overall) is an observed outcome of the runs, not a reduction to inputs by construction. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
free parameters (1)
- Model-specific hyperparameters
axioms (2)
- domain assumption Subject-independent 5-fold cross-validation prevents leakage and supports generalization claims
- domain assumption PPG signals after standard preprocessing retain sufficient affective information
Reference graph
Works this paper leans on
-
[1]
An EEG -Based Computational Model for Decoding Emotional Intelligence, Personality, and Emotions,
K. Kannadasan, J. Shukla, S. Veerasingam , B. S. Begum, and N. Ramasubramanian, “An EEG -Based Computational Model for Decoding Emotional Intelligence, Personality, and Emotions,” IEEE Trans Instrum Meas, vol. 73, pp. 1–13, 2024
work page 2024
-
[2]
Design of a Speech Anger Recognition System on Arduino Nano 33 BLE Sense,
D. M. Waqar, T. S. Gunawan, M. A. Morshidi, and M. Kartiwi, “Design of a Speech Anger Recognition System on Arduino Nano 33 BLE Sense,” in 2021 IEEE 7th International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA), IEEE, Aug. 2021, pp. 64–69
work page 2021
-
[3]
G. Du et al., “A Novel Emotion -Aware Method Based on the Fusion of Textual Description of Speech, Body Movements, and Facial Expressions,” IEEE Trans Instrum Meas, vol. 71, pp. 1– 16, 2022
work page 2022
-
[4]
M. Karnati, A. Seal, D. Bhattacharjee, A. Y azidi, and O. Krejcar, “Understanding Deep Learning Techniques for Recognition of Human Emotions Using Facial Expressions: A Comprehensive Survey,” IEEE Trans Instrum Meas, vol. 72, pp. 1–31, 2023
work page 2023
-
[5]
Emotion recognition in human -computer interaction,
R. Corive et al. , “Emotion recognition in human -computer interaction,” IEEE Signal Process Mag, vol. 18, no. 1, pp. 32 – 80, 2001
work page 2001
-
[6]
X. Wang et al., “Haptic Vibrations in Emotion Induction and Regulation: Insights From Subjective Ratings and EEG Signals,” IEEE Trans Instrum Meas, vol. 74, pp. 1–13, 2025
work page 2025
-
[7]
Wearable Structured Mental -Sensing-Graph Measurement,
D. Yang, B. Gao, W. L. Woo, H. Wen, Y. Zhao, and Z. Gao, “Wearable Structured Mental -Sensing-Graph Measurement,” IEEE Trans Instrum Meas, vol. 72, pp. 1–12, 2023
work page 2023
-
[8]
X. Ning et al., “MetaEmotionNet: Spatial–Spectral–Temporal- Based Attention 3 -D Dense Netw ork With Meta -Learning for EEG Emotion Recognition,” IEEE Trans Instrum Meas, vol. 73, pp. 1–13, 2024
work page 2024
-
[9]
Study on Physiological Characteristics of Emotion,
H. Yu and D. Guo, “Study on Physiological Characteristics of Emotion,” in 2015 Fifth International Conference on Instrumentation and Measurement, Computer, Communication and Control (IMCCC), IEEE, Sep. 2015, pp. 1286–1289
work page 2015
-
[10]
A Noncontact Emotion Recognition Method Based on Complexion and Heart Rate,
G. Du, Q. Tan, C. Li, X. Wang, S. Teng, and P. X. Liu, “A Noncontact Emotion Recognition Method Based on Complexion and Heart Rate,” IEEE Trans Instrum Meas , vol. 71, pp. 1–14, 2022
work page 2022
-
[11]
Emotion recognition using physiological signals: Laboratory vs. wearable sensors,
M. Ragot, N. Martin, S. Em, N. Pallamin, and J. M. Diverrez, “Emotion recognition using physiological signals: Laboratory vs. wearable sensors,” in Advances in Intelligent Systems and Computing, Springer Verlag, 2018, pp. 15–22
work page 2018
-
[12]
Noise reduction of PPG signals using a particle filter for robust emotion recognition,
Y.-K. Lee, O.-W. Kwon, H. S. Shin, J. Jo, and Y. Lee, “Noise reduction of PPG signals using a particle filter for robust emotion recognition,” in 2011 IEEE International Conference on Consumer Electronics -Berlin (ICCE -Berlin), IEEE, Sep. 2011, pp. 202–205
work page 2011
-
[13]
A Photoplethysmogram Dataset for Emotional Analysis,
Y. J. Jin et al., “A Photoplethysmogram Dataset for Emotional Analysis,” Applied Sciences (Switzerland), vol. 12 Jul. 2022
work page 2022
-
[14]
Introducing WeSAD, a multimodal dataset for wearable stress and affect detection,
P. Schmidt, A. Reiss, R. Duerichen, and K. Van Laerhoven, “Introducing WeSAD, a multimodal dataset for wearable stress and affect detection,” in ICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction , Association for Computing Machinery, Inc, Oct. 2018, pp. 400– 408
work page 2018
-
[15]
Fast emotion recognition based on single pulse PPG signal with convolutional neural network,
M. S. Lee, Y. K. Lee, D. S. Pae, M. T. Lim, D. W. Kim, and T. K. Kang, “Fast emotion recognition based on single pulse PPG signal with convolutional neural network,” Applied Sciences (Switzerland), vol. 9, no. 16, Aug. 2019
work page 2019
-
[16]
Feature Augment ed Hybrid CNN for Stress Recognition Using Wrist -based Photoplethysmography Sensor,
N. Rashid, L. Chen, M. Dautta, A. Jimenez, P. Tseng, and M. A. Al Faruque, “Feature Augment ed Hybrid CNN for Stress Recognition Using Wrist -based Photoplethysmography Sensor,” in Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, Institute of Electrical and Electronics Engineers Inc. , 2021,...
work page 2021
-
[17]
M. S. Lee, Y. K. Lee, M. T. Lim, and T. K. Kang, “Emotion recognition using convolutional neural network with selected statistical photoplethysmogram features,” Applied Sciences (Switzerland), vol. 10, no. 10, May 2020
work page 2020
-
[18]
Overview of the Transformer -based Models for NLP Tasks,
A. Gillioz, J. Casas, E. Mugellini, and O. A. Khaled, “Overview of the Transformer -based Models for NLP Tasks,” Sep. 2020, pp. 179–183
work page 2020
-
[19]
ECGMamba: Towards ECG Classification with State Space Models,
Y. Qiang, X. Dong, X. Liu, Y. Yang, F. Hu, and R. Wang, “ECGMamba: Towards ECG Classification with State Space Models,” in Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024 , Institute of Electrical and Electronics Engineers Inc., 2024, pp. 6498–6505
work page 2024
-
[20]
WAR M-VR: a Wearable Affect Recognition from Multisensory stimuli in Virtual Reality Dataset,
K. Alghoul, M. Faisal, F. Laamarti, H. Al Osman, and A. El Saddik, “WAR M-VR: a Wearable Affect Recognition from Multisensory stimuli in Virtual Reality Dataset,” 2025
work page 2025
-
[21]
Physiological Sensors Based Emotion Recognition While Experiencing Tactile Enhanced Multimedia,
A. Raheel, M. Majid, M. Alnowami, and S. M. Anwar, “Physiological Sensors Based Emotion Recognition While Experiencing Tactile Enhanced Multimedia,” Sensors, vol. 20, no. 14, p. 4037, Jul. 2020
work page 2020
-
[22]
Enhancing Generalization in PPG -Based Emotion Recognition with a CNN-TCN-LSTM Model,
K. Alghoul, H. Al Osman, and A. El Saddik, “Enhancing Generalization in PPG -Based Emotion Recognition with a CNN-TCN-LSTM Model,” in International Instrumentation and Measurement Technology Conference , Chemnitz: IEEE, 2025
work page 2025
-
[23]
Deap: A database for emotion analysis ;using physiological signals,
S. Koelstra et al. , “DEAP: A database for emotion analysis; Using physiological signals,” IEEE Trans Affect Comput , vol. 3, no. 1, pp. 18–31, Jan. 2012, doi: 10.1109/T-AFFC.2011.15
-
[24]
CNN -LSTM for automatic emotion recognition using contactless photop lythesmographic signals,
W. Mellouk and W. Handouzi, “CNN -LSTM for automatic emotion recognition using contactless photop lythesmographic signals,” Biomed Signal Process Control, vol. 85, Aug. 2023
work page 2023
-
[25]
Transformer-Based Emotion Recognition with EEG,
K. Patel, F. Safavi, R. Chandramouli, and R. Vinjamuri, “Transformer-Based Emotion Recognition with EEG,” in Proceedings of the Annual International Conference of the IEEE En gineering in Medicine and Biology Society, EMBS , Institute of Electrical and Electronics Engineers Inc., 2024
work page 2024
-
[26]
Transformer -Based Self -Supervised Learning for Emotion Recognition,
J. Vazquez -Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, “Transformer -Based Self -Supervised Learning for Emotion Recognition,” in Proceedings - International Conference on Pattern Recognition , Institute of Electrical and Electronics Engineers Inc., 2022, pp. 2605–2612
work page 2022
-
[27]
Z. D. Liu, B. Zhou, J. K. Liu, H. Zhao, Y. Li, and F. Miao, “A CNN and Transformer Hybrid Network for Multi -Class Arrhythmia Detection from Photoplethysmography,” in Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS , Institute of Electrical and Electronics Engineers Inc., 2024
work page 2024
-
[28]
N. Lee, S. H. Kim, M. Lee, and J. Woo, “Advancing Continuous Blood Pressure Estimation with Transformer on Photoplethysmography in Operation Room,” IEEE Access
-
[29]
ECG - Mamba: Cardiac Abnormality Classification with Non - Uniform-Mix Augmentation on 12-Lead ECGs,
H. Jiang, H. Mutahira, S. Wei, and M. S. Muhammad, “ECG - Mamba: Cardiac Abnormality Classification with Non - Uniform-Mix Augmentation on 12-Lead ECGs,” IEEE J Transl Eng Health Med, 2025, doi: 10.1109/JTEHM.2025.3613609
-
[30]
M. Najia and B. Faouzi, “Enhanced ECG Signal Classification Using Multi -Branch Convolutions and Mamba Blocks With State-Space Models,” Int J Imaging Syst Technol, vol. 35, no. 3, May 2025
work page 2025
-
[31]
K. Yang et al., “HRMamba: Fusing Luminance Information for Remote Physiological Measurement in Varied Lighting Conditions,” IEEE J Biomed Health Inform, 2025
work page 2025
-
[32]
RhythmMamba: Fast, Lightweight, and Accurate Remote Physiological Measurement,
B. Zou, Z. Guo, X. Hu, and H. Ma, “RhythmMamba: Fast, Lightweight, and Accurate Remote Physiological Measurement,” 2025
work page 2025
-
[33]
A Non -Invasive Blood Pressure Estimation Method Based on Mamba-UNet and PPG Signals,
H. Yu, H. Fan, Q. Li, F. Lu, and Y. Cheng, “A Non -Invasive Blood Pressure Estimation Method Based on Mamba-UNet and PPG Signals,” in ICAC 2025 - 30th International Conference on Automation and Computing , Institute of Electrical and Electronics Engineers Inc., 2025
work page 2025
-
[34]
Mamba-VA: A Mamba -Based Approach for Continuous Emotion Recognition in Valence -Arousal Space,
Y. Liang, Z. Wang, F. Liu, M. Liu, and Y. Ya o, “Mamba-VA: A Mamba -Based Approach for Continuous Emotion Recognition in Valence -Arousal Space,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops , IEEE Computer Society, 2025, pp. 5651–5656
work page 2025
-
[35]
S. Salehizadeh, D. Dao, J. Bolkhovsky, C. Cho, Y. Mendelson, and K. Chon, “A Novel Time -Varying Spectral Filtering Algorithm for Reconstruction of Motion Artifact Corrupted Heart Rate Signals During Intense Physical Activities Using a Wearable Photoplethysmogram Sensor,” Sensors
-
[36]
Culture and the categorization of emotions.,
J. A. Russell, “Culture and the categorization of emotions.,” Psychol Bull, vol. 110, no. 3, pp. 426–450, 1991
work page 1991
-
[37]
The Potential of Olfactory Stimuli in Stress Reduction Through Virtual Reality,
Y. E. Valdivieso et al., “The Potential of Olfactory Stimuli in Stress Reduction Through Virtual Reality,” in 2025 IEEE Medical Measurements & Applications (MeMeA), IEEE
work page 2025
-
[38]
C. Halimu, A. Kasem, and S. H. S. Newaz, “Empirical Comparison of Area under ROC curve (AUC) and Ma thew Correlation Coefficient (MCC) for Evaluating Machine Learning Algorithms on Imbalanced Datasets for Binary Classification,” in Proceedings of the 3rd International Conference on Machine Learning and Soft Computing , New York, NY, USA: ACM, Jan. 2019, pp. 1–6
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.