Remote Heart Rate Measurement from Highly Compressed Facial Videos: an End-to-end Deep Learning Solution with Video Enhancement

Guoying Zhao; Wei Peng; Xiaobai Li; Xiaopeng Hong; Zitong Yu

arxiv: 1907.11921 · v1 · pith:43KETZDInew · submitted 2019-07-27 · 📡 eess.IV · cs.CV

Remote Heart Rate Measurement from Highly Compressed Facial Videos: an End-to-end Deep Learning Solution with Video Enhancement

Zitong Yu , Wei Peng , Xiaobai Li , Xiaopeng Hong , Guoying Zhao This is my paper

Pith reviewed 2026-05-24 14:41 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords remote photoplethysmographyrPPGvideo compressionheart rate measurementdeep learningvideo enhancementfacial video analysisend-to-end network

0 comments

The pith

A two-stage neural network recovers heart rate signals from heavily compressed face videos by first restoring lost pulse information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage end-to-end deep learning method to measure heart rate remotely from facial videos that suffer heavy compression. It pairs a Spatio-Temporal Video Enhancement Network that restores hidden rPPG details with a dedicated rPPG measurement network, and the stages can be trained jointly. Experiments on benchmark datasets show the combined system outperforms prior approaches on compressed inputs and continues to work when only compressed videos are available for training. A sympathetic reader would care because everyday video transmission always applies compression, so a method that tolerates it could bring contactless heart monitoring into routine remote-healthcare settings.

Core claim

The central claim is that a Spatio-Temporal Video Enhancement Network (STVEN) can recover rPPG information lost to video compression and, when jointly trained with an rPPGNet, enables accurate heart-rate extraction from highly compressed facial videos. The rPPGNet alone already gives robust measurements; adding the jointly trained STVEN further improves results especially under strong compression. The same pipeline also generalizes to entirely new datasets that contain only compressed videos.

What carries the argument

The Spatio-Temporal Video Enhancement Network (STVEN) that restores hidden rPPG signals before they reach the rPPG measurement network, trained end-to-end with it.

If this is right

The rPPGNet component can be used by itself for robust measurement when enhancement is not needed.
Joint training of the two stages produces the largest gains precisely on the most compressed inputs.
Performance holds when the system is trained and tested on novel data that supplies only compressed videos.
The approach therefore opens the door to real-world remote-healthcare uses where video is always compressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recovery idea could be tested on other video-based physiological signals such as respiration rate.
In deployed systems the method might lower the bandwidth needed for remote monitoring without sacrificing accuracy.
Live-stream tests with changing compression rates would reveal whether the enhancement step remains stable under variable network conditions.

Load-bearing premise

That the hidden rPPG information lost to compression can be recovered by the STVEN enhancement network when the two stages are jointly trained.

What would settle it

If the jointly trained system fails to beat a plain rPPGNet on a fresh collection of highly compressed videos that have no high-quality reference pairs, the recovery claim would be falsified.

Figures

Figures reproduced from arXiv: 1907.11921 by Guoying Zhao, Wei Peng, Xiaobai Li, Xiaopeng Hong, Zitong Yu.

**Figure 2.** Figure 2: Illustration of the overall framework. There are two models in our framework: video quality enhancement model STVEN (left) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the skin-based attention module of the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Partition constraints with N = 4. by spatial and channel-wise convolutions with residual connections. As there is no ground truth skin map in related rPPG datasets, we generate the binary labels for each frame by adaptive skin segmentation algorithms [27]. With these binary skin labels, the skin segmentation branch is able to predict high quality skin maps S ∈ R T ×H×W . Here we adopt binary cross entropy… view at source ↗

**Figure 6.** Figure 6: Performance of video quality enhancement networks. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: HR measurement on OBF videos at different bitrates: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Visualization of model output images. (a) face image in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Predicted rPPG signals (top) and corresponding video [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Remote photoplethysmography (rPPG), which aims at measuring heart activities without any contact, has great potential in many applications (e.g., remote healthcare). Existing rPPG approaches rely on analyzing very fine details of facial videos, which are prone to be affected by video compression. Here we propose a two-stage, end-to-end method using hidden rPPG information enhancement and attention networks, which is the first attempt to counter video compression loss and recover rPPG signals from highly compressed videos. The method includes two parts: 1) a Spatio-Temporal Video Enhancement Network (STVEN) for video enhancement, and 2) an rPPG network (rPPGNet) for rPPG signal recovery. The rPPGNet can work on its own for robust rPPG measurement, and the STVEN network can be added and jointly trained to further boost the performance especially on highly compressed videos. Comprehensive experiments are performed on two benchmark datasets to show that, 1) the proposed method not only achieves superior performance on compressed videos with high-quality videos pair, 2) it also generalizes well on novel data with only compressed videos available, which implies the promising potential for real world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a jointly trained video enhancement stage to recover rPPG from compressed face videos and claims it generalizes to new compressed inputs, but the results depend on how closely the training compression matches real deployments.

read the letter

The central claim is that jointly training a spatio-temporal enhancement network with the rPPG estimator lets you recover heart rate signals from videos that have been heavily compressed, and that this works even when you only have compressed videos at test time. What is new is the explicit focus on countering compression loss rather than assuming clean input. Most prior rPPG work either ignores compression or treats it as noise to be filtered after the fact. Here the enhancement is learned specifically to preserve the rPPG information. The paper does a reasonable job of laying out the architecture and running experiments on two benchmark datasets. The fact that they show both the case with high-quality pairs available during training and the case where only compressed data is seen at test time is the practical angle. The soft spot is the transfer assumption. The method requires paired high-quality and compressed examples for training. If the compression parameters in a new deployment (different codec, bitrate, or GOP) do not match those used to create the training pairs, the enhancement network may not restore the lost signal and could introduce new distortions. The abstract states that it generalizes well, so the results section needs to include enough variation in the test compressions to support that. This paper is for people who want to run rPPG on real-world video feeds that are already compressed. It is the kind of work that deserves a serious referee because it tackles a deployment barrier that many groups will run into, even if the final numbers require careful scrutiny on the compression matching.

Referee Report

2 major / 0 minor

Summary. The paper proposes a two-stage end-to-end deep learning pipeline for remote photoplethysmography (rPPG) heart-rate measurement from highly compressed facial videos. It consists of a Spatio-Temporal Video Enhancement Network (STVEN) that is jointly trained with an rPPGNet to recover subtle color-change signals lost to compression; the rPPGNet can also be used standalone. The central claims are (1) superior performance when paired high-quality/compressed training data are available and (2) good generalization to novel compressed-only test videos on two benchmark datasets, with implications for real-world deployment.

Significance. If the recovery and generalization claims hold under realistic compression mismatch, the work would address a practical barrier in contactless vital-sign monitoring where video streams are routinely compressed. The joint-training architecture and the explicit separation of enhancement and measurement stages are technically interesting strengths; however, the manuscript provides no quantitative results, error bars, dataset statistics, or ablation studies in the supplied abstract, limiting immediate assessment of impact.

major comments (2)

[Abstract] Abstract: the generalization claim ('generalizes well on novel data with only compressed videos available') is load-bearing for the real-world applicability statement, yet the text supplies no information on how the compression parameters (codec, bitrate, GOP structure) of the training pairs compare to those of the novel test videos. Without such detail or a controlled mismatch experiment, it is impossible to evaluate whether STVEN recovers genuine rPPG components or merely learns dataset-specific artifacts.
[Abstract] The weakest assumption identified in the stress-test note is not addressed: joint training of STVEN + rPPGNet can recover information destroyed by compression only if the paired training distribution matches the test compressions. The manuscript does not report any cross-compression validation (e.g., training on H.264 and testing on VP9 or different bitrates), which directly undermines the transfer claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the generalization claims. We address the major comments point-by-point below and will revise the manuscript to improve clarity on experimental details.

read point-by-point responses

Referee: [Abstract] Abstract: the generalization claim ('generalizes well on novel data with only compressed videos available') is load-bearing for the real-world applicability statement, yet the text supplies no information on how the compression parameters (codec, bitrate, GOP structure) of the training pairs compare to those of the novel test videos. Without such detail or a controlled mismatch experiment, it is impossible to evaluate whether STVEN recovers genuine rPPG components or merely learns dataset-specific artifacts.

Authors: We agree the abstract should specify the compression parameters. The full manuscript (Sections 3.3 and 4.1) details that all videos use the H.264 codec; training pairs are created by compressing original high-quality videos at bitrates of 200-800 kbps, while test videos use the same codec at held-out bitrates (e.g., 100-300 kbps) and different subjects to simulate novel compressed data. We will update the abstract with a concise statement of these settings and add a summary table of codec/bitrate configurations. revision: yes
Referee: [Abstract] The weakest assumption identified in the stress-test note is not addressed: joint training of STVEN + rPPGNet can recover information destroyed by compression only if the paired training distribution matches the test compressions. The manuscript does not report any cross-compression validation (e.g., training on H.264 and testing on VP9 or different bitrates), which directly undermines the transfer claim.

Authors: The reported experiments evaluate generalization across unseen bitrates and subjects within the H.264 codec, which matches common deployment scenarios where the codec remains fixed. No cross-codec tests (H.264 to VP9) or explicit GOP-structure mismatch experiments appear in the manuscript. We will add a limitations paragraph acknowledging this scope and noting it as valuable future work, while retaining the within-codec results as evidence of robustness to bitrate variation. revision: partial

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmark training and evaluation, not self-referential definitions or fitted inputs.

full rationale

The paper describes a two-stage neural architecture (STVEN + rPPGNet) trained end-to-end on paired high-quality/compressed video data from external benchmarks. All reported performance numbers and generalization statements are empirical outcomes of that training process rather than quantities derived by algebraic reduction from the model's own parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are presented that would make any claimed result tautological with its inputs. The central claim therefore remains falsifiable against held-out data and does not reduce to a self-definition or fitted-input prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised deep learning assumptions plus the existence of paired high-quality and compressed training videos; the networks themselves introduce large numbers of fitted parameters.

free parameters (1)

STVEN and rPPGNet weights
Millions of parameters in the convolutional and attention layers are fitted during end-to-end training on the benchmark datasets.

axioms (1)

domain assumption Deep neural networks can recover rPPG-relevant features from compressed video after spatio-temporal enhancement
This is the core premise that justifies adding the STVEN stage before rPPGNet.

pith-pipeline@v0.9.0 · 5769 in / 1298 out tokens · 32809 ms · 2026-05-24T14:41:49.633351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Bellard, M

F. Bellard, M. Niedermayer, and et al. Ffmpeg. [online]. available: http://ffmpeg.org. 6

work page
[2]

Chaichulee, M

S. Chaichulee, M. Villarroel, J. Jorge, C. Arteta, G. Green, K. McCormick, A. Zisserman, and L. Tarassenko. Multi-task convolutional neural network for patient detection and skin segmentation in continuous non-contact vital sign monitor- ing. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on , pages 266–

work page 2017
[3]

Chen and D

W. Chen and D. McDuff. Deepphys: Video-based physiolog- ical measurement using convolutional attention networks. In ECCV , 2018. 2, 6, 8

work page 2018
[4]

de Haan and V

G. de Haan and V . Jeanne. Robust pulse rate from chrominance-based rppg. IEEE Trans. Biomed. Eng. , 60(10):2878–2886, 2013. 1, 2, 4, 6, 7, 8

work page 2013
[5]

C. Dong, Y . Deng, C. Change Loy, and X. Tang. Compres- sion artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Com- puter Vision, pages 576–584, 2015. 2, 8

work page 2015
[6]

Galteri, L

L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo. Deep generative adversarial compression artifact removal. In ICCV , 2017. 3

work page 2017
[7]

Hanﬂand and M

S. Hanﬂand and M. Paul. Video format dependency of ppgi signals. In Proceedings of the International Conference on Electrical Engineering, 2016. 1, 2

work page 2016
[8]

ITU-T. Rec. h.262 - information technology - generic coding of moving pictures and associated audio information: Video. International Telecommunication Union Telecommunication Standardization Sector (ITU-T), Tech. Rep., 1995. 2

work page 1995
[9]

Johnson, A

J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision , pages 694–711. Springer,

work page
[10]

Lam and Y

A. Lam and Y . Kuno. Robust heart rate measurement from video using select random patches. In Proceedings of the IEEE International Conference on Computer Vision , pages 3640–3648, 2015. 2

work page 2015
[11]

X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-V oltti, M. Tulppo, and G. Zhao. The obf database: A large face video database for remote physio- logical signal measurement and atrial ﬁbrillation detection. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) , pages 242–249. IEEE, 2018. 5, 6, 7

work page 2018
[12]

X. Li, J. Chen, G. Zhao, and M. Pietik ¨ainen. Remote heart rate measurement from face videos under realistic situations. in CVPR, 2014. 1, 2, 8

work page 2014
[13]

D. Liu, B. Wen, X. Liu, Z. Wang, and T. S. Huang. When im- age denoising meets high-level vision tasks: A deep learning approach. In IJCAI, 2018. 5

work page 2018
[14]

D. McDuff. Deep super resolution for recovering physi- ological information from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition Workshops, pages 1367–1374, 2018. 3

work page 2018
[15]

D. J. McDuff, E. B. Blackford, and J. R. Estepp. The impact of video compression on remote cardiac pulse measurement using imaging photoplethysmography. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE Interna- tional Conference on, pages 63–70. IEEE, 2017. 1, 2, 7

work page 2017
[16]

X. Niu, H. Han, S. Shan, and X. Chen. Synrhythm: Learning a deep heart rate estimator from general to speciﬁc. InICPR,

work page
[17]

M.-Z. Poh, D. J. McDuff, and R. W. Picard. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Opt. Express, 18(10):10762– 10774, 2010. 1, 2

work page 2010
[18]

M.-Z. Poh, D. J. McDuff, and R. W. Picard. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. , 58(1):7–11,

work page
[19]

Ponomarenko, F

N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. As- tola, and V . Lukin. On between-coefﬁcient contrast mask- ing of dct basis functions. In Proceedings of the third inter- national workshop on video processing and quality metrics , volume 4, 2007. 9

work page 2007
[20]

Puri and A

A. Puri and A. Eleftheriadis. Mpeg-4: An object-based mul- timedia coding standard supporting mobile applications.Mo- bile Networks and Applications, 3(1):5–32, 1998. 2

work page 1998
[21]

J. Shi, I. Alikhani, X. Li, Z. Yu, T. Sepp ¨anen, and G. Zhao. Atrial ﬁbrillation detection from face videos by fusing sub- tle variations. IEEE Transactions on Circuits and Systems for Video Technology, DOI 10.1109/TCSVT.2019.2926632,

work page doi:10.1109/tcsvt.2019.2926632 2019
[22]

Soleymani, J

M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic. A multimodal database for affect recognition and implicit tag- ging. IEEE Transactions on Affective Computing , 3(1):42– 55, 2012. 5, 6

work page 2012
[23]

Spetl ´ık, J

R. Spetl ´ık, J. Cech, and J. Matas. Non-contact reﬂectance photoplethysmography: Progress, limitations, and myths. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on , pages 702–709. IEEE, 2018. 2, 7 9

work page 2018
[24]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014. 5

work page 1929
[25]

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high efﬁciency video coding (hevc) stan- dard. IEEE Transactions on circuits and systems for video technology, 22(12):1649–1668, 2012. 2

work page 2012
[26]

C. Tang, J. Lu, and J. Liu. Non-contact heart rate monitor- ing by combining convolutional neural network skin detec- tion and remote photoplethysmography via a low-cost cam- era. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition Workshops, pages 1309–1315,

work page
[27]

M. J. Taylor and T. Morris. Adaptive skin segmentation via feature-based face detection. In Real-Time Image and Video Processing 2014, volume 9139, page 91390P. International Society for Optics and Photonics, 2014. 5

work page 2014
[28]

D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450– 6459, 2018. 4

work page 2018
[29]

Tulyakov, X

S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, and N. Sebe. Self-adaptive matrix completion for heart rate estimation from face videos under realistic con- ditions. in CVPR, 2016. 1, 2, 8

work page 2016
[30]

Verkruysse, L

W. Verkruysse, L. O. Svaasand, and J. S. Nelson. Remote plethysmographic imaging using ambient light. Opt. Ex- press, 16(26):21434–21445, Dec 2008. 1, 8

work page 2008
[31]

Viola and M

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In null, page 511. IEEE, 2001. 6

work page 2001
[32]

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y . Gong. Locality-constrained linear coding for image classiﬁcation. In 2010 IEEE computer society conference on computer vi- sion and pattern recognition , pages 3360–3367. Citeseer,

work page 2010
[33]

W. Wang, A. C. den Brinker, S. Stuijk, and G. de Haan. Al- gorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering, 64(7):1479–1491, 2017. 2, 4, 6, 7

work page 2017
[34]

W. Wang, S. Stuijk, and G. de Haan. A novel algorithm for remote photoplethysmography: Spatial subspace rota- tion. IEEE Trans. Biomed. Eng. , 63(9):1974–1984, 2016. 2

work page 1974
[35]

Wiegand, G

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology , 13(7):560–576, 2003. 2

work page 2003
[36]

R. Yang, M. Xu, Z. Wang, and T. Li. Multi-frame quality enhancement for compressed video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6664–6673, 2018. 3

work page 2018
[37]

Zhang, W

K. Zhang, W. Zuo, Y . Chen, D. Meng, and L. Zhang. Be- yond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017. 3, 8

work page 2017
[38]

Zhao, C.-L

C. Zhao, C.-L. Lin, W. Chen, and Z. Li. A novel framework for remote photoplethysmography pulse extraction on com- pressed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1299–1308, 2018. 3

work page 2018
[39]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image- to-image translation using cycle-consistent adversarial net- workss. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on, 2017. 4

work page 2017
[40]

petlk, V

R. petlk, V . Franc, and J. Matas. Visual heart rate estimation with convolutional neural network. In BMVC, 2018. 2, 6, 8 10

work page 2018

[1] [1]

Bellard, M

F. Bellard, M. Niedermayer, and et al. Ffmpeg. [online]. available: http://ffmpeg.org. 6

work page

[2] [2]

Chaichulee, M

S. Chaichulee, M. Villarroel, J. Jorge, C. Arteta, G. Green, K. McCormick, A. Zisserman, and L. Tarassenko. Multi-task convolutional neural network for patient detection and skin segmentation in continuous non-contact vital sign monitor- ing. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on , pages 266–

work page 2017

[3] [3]

Chen and D

W. Chen and D. McDuff. Deepphys: Video-based physiolog- ical measurement using convolutional attention networks. In ECCV , 2018. 2, 6, 8

work page 2018

[4] [4]

de Haan and V

G. de Haan and V . Jeanne. Robust pulse rate from chrominance-based rppg. IEEE Trans. Biomed. Eng. , 60(10):2878–2886, 2013. 1, 2, 4, 6, 7, 8

work page 2013

[5] [5]

C. Dong, Y . Deng, C. Change Loy, and X. Tang. Compres- sion artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Com- puter Vision, pages 576–584, 2015. 2, 8

work page 2015

[6] [6]

Galteri, L

L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo. Deep generative adversarial compression artifact removal. In ICCV , 2017. 3

work page 2017

[7] [7]

Hanﬂand and M

S. Hanﬂand and M. Paul. Video format dependency of ppgi signals. In Proceedings of the International Conference on Electrical Engineering, 2016. 1, 2

work page 2016

[8] [8]

ITU-T. Rec. h.262 - information technology - generic coding of moving pictures and associated audio information: Video. International Telecommunication Union Telecommunication Standardization Sector (ITU-T), Tech. Rep., 1995. 2

work page 1995

[9] [9]

Johnson, A

J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision , pages 694–711. Springer,

work page

[10] [10]

Lam and Y

A. Lam and Y . Kuno. Robust heart rate measurement from video using select random patches. In Proceedings of the IEEE International Conference on Computer Vision , pages 3640–3648, 2015. 2

work page 2015

[11] [11]

X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-V oltti, M. Tulppo, and G. Zhao. The obf database: A large face video database for remote physio- logical signal measurement and atrial ﬁbrillation detection. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) , pages 242–249. IEEE, 2018. 5, 6, 7

work page 2018

[12] [12]

X. Li, J. Chen, G. Zhao, and M. Pietik ¨ainen. Remote heart rate measurement from face videos under realistic situations. in CVPR, 2014. 1, 2, 8

work page 2014

[13] [13]

D. Liu, B. Wen, X. Liu, Z. Wang, and T. S. Huang. When im- age denoising meets high-level vision tasks: A deep learning approach. In IJCAI, 2018. 5

work page 2018

[14] [14]

D. McDuff. Deep super resolution for recovering physi- ological information from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition Workshops, pages 1367–1374, 2018. 3

work page 2018

[15] [15]

D. J. McDuff, E. B. Blackford, and J. R. Estepp. The impact of video compression on remote cardiac pulse measurement using imaging photoplethysmography. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE Interna- tional Conference on, pages 63–70. IEEE, 2017. 1, 2, 7

work page 2017

[16] [16]

X. Niu, H. Han, S. Shan, and X. Chen. Synrhythm: Learning a deep heart rate estimator from general to speciﬁc. InICPR,

work page

[17] [17]

M.-Z. Poh, D. J. McDuff, and R. W. Picard. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Opt. Express, 18(10):10762– 10774, 2010. 1, 2

work page 2010

[18] [18]

M.-Z. Poh, D. J. McDuff, and R. W. Picard. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. , 58(1):7–11,

work page

[19] [19]

Ponomarenko, F

N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. As- tola, and V . Lukin. On between-coefﬁcient contrast mask- ing of dct basis functions. In Proceedings of the third inter- national workshop on video processing and quality metrics , volume 4, 2007. 9

work page 2007

[20] [20]

Puri and A

A. Puri and A. Eleftheriadis. Mpeg-4: An object-based mul- timedia coding standard supporting mobile applications.Mo- bile Networks and Applications, 3(1):5–32, 1998. 2

work page 1998

[21] [21]

J. Shi, I. Alikhani, X. Li, Z. Yu, T. Sepp ¨anen, and G. Zhao. Atrial ﬁbrillation detection from face videos by fusing sub- tle variations. IEEE Transactions on Circuits and Systems for Video Technology, DOI 10.1109/TCSVT.2019.2926632,

work page doi:10.1109/tcsvt.2019.2926632 2019

[22] [22]

Soleymani, J

M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic. A multimodal database for affect recognition and implicit tag- ging. IEEE Transactions on Affective Computing , 3(1):42– 55, 2012. 5, 6

work page 2012

[23] [23]

Spetl ´ık, J

R. Spetl ´ık, J. Cech, and J. Matas. Non-contact reﬂectance photoplethysmography: Progress, limitations, and myths. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on , pages 702–709. IEEE, 2018. 2, 7 9

work page 2018

[24] [24]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014. 5

work page 1929

[25] [25]

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high efﬁciency video coding (hevc) stan- dard. IEEE Transactions on circuits and systems for video technology, 22(12):1649–1668, 2012. 2

work page 2012

[26] [26]

C. Tang, J. Lu, and J. Liu. Non-contact heart rate monitor- ing by combining convolutional neural network skin detec- tion and remote photoplethysmography via a low-cost cam- era. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition Workshops, pages 1309–1315,

work page

[27] [27]

M. J. Taylor and T. Morris. Adaptive skin segmentation via feature-based face detection. In Real-Time Image and Video Processing 2014, volume 9139, page 91390P. International Society for Optics and Photonics, 2014. 5

work page 2014

[28] [28]

D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450– 6459, 2018. 4

work page 2018

[29] [29]

Tulyakov, X

S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, and N. Sebe. Self-adaptive matrix completion for heart rate estimation from face videos under realistic con- ditions. in CVPR, 2016. 1, 2, 8

work page 2016

[30] [30]

Verkruysse, L

W. Verkruysse, L. O. Svaasand, and J. S. Nelson. Remote plethysmographic imaging using ambient light. Opt. Ex- press, 16(26):21434–21445, Dec 2008. 1, 8

work page 2008

[31] [31]

Viola and M

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In null, page 511. IEEE, 2001. 6

work page 2001

[32] [32]

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y . Gong. Locality-constrained linear coding for image classiﬁcation. In 2010 IEEE computer society conference on computer vi- sion and pattern recognition , pages 3360–3367. Citeseer,

work page 2010

[33] [33]

W. Wang, A. C. den Brinker, S. Stuijk, and G. de Haan. Al- gorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering, 64(7):1479–1491, 2017. 2, 4, 6, 7

work page 2017

[34] [34]

W. Wang, S. Stuijk, and G. de Haan. A novel algorithm for remote photoplethysmography: Spatial subspace rota- tion. IEEE Trans. Biomed. Eng. , 63(9):1974–1984, 2016. 2

work page 1974

[35] [35]

Wiegand, G

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology , 13(7):560–576, 2003. 2

work page 2003

[36] [36]

R. Yang, M. Xu, Z. Wang, and T. Li. Multi-frame quality enhancement for compressed video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6664–6673, 2018. 3

work page 2018

[37] [37]

Zhang, W

K. Zhang, W. Zuo, Y . Chen, D. Meng, and L. Zhang. Be- yond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017. 3, 8

work page 2017

[38] [38]

Zhao, C.-L

C. Zhao, C.-L. Lin, W. Chen, and Z. Li. A novel framework for remote photoplethysmography pulse extraction on com- pressed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1299–1308, 2018. 3

work page 2018

[39] [39]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image- to-image translation using cycle-consistent adversarial net- workss. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on, 2017. 4

work page 2017

[40] [40]

petlk, V

R. petlk, V . Franc, and J. Matas. Visual heart rate estimation with convolutional neural network. In BMVC, 2018. 2, 6, 8 10

work page 2018