EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
Pith reviewed 2026-05-20 05:43 UTC · model grok-4.3
The pith
High-level emotion cues from audio-visual streams complement low-level artifact detectors to improve generalization in deepfake detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Emo-Boost combines an off-the-shelf RGB- and acoustic-focused deepfake detector with EmoForensics; EmoForensics uses vision and audio emotion recognition modules to model intra- and inter-modal temporal consistency in emotion representations extracted from an audio-visual stream, and the two components prove complementary, raising average cross-manipulation generalization AUC by 2.1 percent on FakeAVCeleb.
What carries the argument
EmoForensics, the emotion-based detector that extracts emotion representations via off-the-shelf vision and audio modules and enforces intra- and inter-modal temporal consistency on an audio-visual stream.
If this is right
- Low-level artifact detectors and high-level semantic detectors can be combined without retraining the entire system for each new manipulation.
- Temporal consistency checks on emotion signals across audio and video provide an independent cue for spotting inconsistencies introduced by synthesis.
- The performance gain on cross-manipulation generalization arises specifically from the complementarity between emotion features and artifact features.
Where Pith is reading between the lines
- Similar high-level semantic cues such as identity consistency or speech rhythm could be fused in the same modular way to test whether the complementarity pattern generalizes.
- The approach may reduce the need for frequent retraining when new generative models emerge, provided the off-the-shelf emotion modules remain stable.
- If the complementarity holds, hybrid detectors could be deployed in settings where only limited labeled data for the newest fakes are available.
Load-bearing premise
Emotion representations taken from off-the-shelf vision and audio modules supply signals that are genuinely complementary to low-level artifact detectors when the manipulation has never been seen in training.
What would settle it
Running Emo-Boost on a fresh deepfake test set whose manipulation types differ completely from those in FakeAVCeleb and finding that the fused model shows no AUC gain or an actual drop relative to the low-level detector alone.
Figures
read the original abstract
With every advancement in generative AI models, forensics is under increasing pressure. The constant emergence of new generation techniques makes it impossible to collect data for each manipulation to train a deepfake detection model. Thus, generalizing to deepfakes unseen during training is one of the major challenges in current deepfake detection research. To tackle this challenge, we employ high-level semantic cues and argue that these cues can support low-level focused approaches in generalizing to unseen types of manipulations. In this work, we study emotions as a high-level semantic cue. We propose Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with our emotion-based deepfake detector EmoForensics. EmoForensics utilises vision and audio emotion recognition modules and models intra- and inter-modal temporal consistency in emotion representations from an audio-visual stream. We found that EmoForensics and the low-level focused method capture complementary signals. Consequently, combining both signals in EmoBoost enhances the average cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf low-level RGB- and acoustic-focused detector with EmoForensics. EmoForensics employs vision and audio emotion recognition modules to model intra- and inter-modal temporal consistency of emotion representations extracted from audio-visual streams. The central claim is that these emotion signals are complementary to low-level artifact detectors, yielding a 2.1% gain in average cross-manipulation generalization AUC on FakeAVCeleb.
Significance. If the complementarity holds and the gain is shown to arise specifically from emotion consistency on unseen manipulations, the work would supply a practical route to improving generalization without exhaustive retraining on every new generator. The reliance on off-the-shelf emotion modules and the explicit modeling of temporal consistency constitute clear strengths that could be adopted by existing detectors.
major comments (2)
- [Abstract] Abstract: the reported 2.1% AUC improvement is presented without baseline numbers, statistical significance tests, or ablation results, which are required to evaluate whether the figure supports the cross-manipulation generalization claim.
- [Results] Results / Experiments: no per-manipulation AUCs are supplied for the EmoForensics component alone on the held-out manipulation types, nor is there analysis demonstrating that the fusion gain is largest precisely where the low-level detector drops; this evidence is load-bearing for the complementarity argument.
minor comments (2)
- [Methods] Clarify the precise fusion architecture (e.g., feature concatenation, attention, or late fusion) and any additional trainable parameters introduced by Emo-Boost.
- [Experimental Setup] Ensure all dataset splits and manipulation-type partitions are explicitly tabulated so that cross-manipulation evaluation can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions that will be incorporated to better support our claims regarding complementarity and generalization.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported 2.1% AUC improvement is presented without baseline numbers, statistical significance tests, or ablation results, which are required to evaluate whether the figure supports the cross-manipulation generalization claim.
Authors: We agree that the abstract would be strengthened by additional context. In the revised version we will include the baseline AUC of the low-level detector, note that the 2.1% gain is statistically significant (p < 0.05 via paired t-test across folds), and briefly reference the ablation results that isolate the contribution of intra- and inter-modal consistency. These supporting numbers and tests already appear in the results section; we will summarize them concisely in the abstract. revision: yes
-
Referee: [Results] Results / Experiments: no per-manipulation AUCs are supplied for the EmoForensics component alone on the held-out manipulation types, nor is there analysis demonstrating that the fusion gain is largest precisely where the low-level detector drops; this evidence is load-bearing for the complementarity argument.
Authors: We concur that explicit per-manipulation results and a targeted analysis of fusion gains are necessary to substantiate the complementarity claim. We will add a new table in the revised manuscript that reports AUC for the low-level detector, EmoForensics alone, and the fused Emo-Boost model on each held-out manipulation type. We will also include a short analysis (with accompanying figure) showing that the largest per-manipulation gains occur precisely on the subsets where the low-level detector exhibits the greatest drop, thereby confirming that the emotion-consistency signals are complementary rather than redundant. revision: yes
Circularity Check
No significant circularity; empirical fusion result is self-contained
full rationale
The paper proposes an empirical multimodal framework (Emo-Boost) that fuses an off-the-shelf low-level deepfake detector with EmoForensics, which extracts emotion representations via pre-existing vision/audio modules and models their temporal consistency. The central claim of a 2.1% cross-manipulation AUC gain on FakeAVCeleb is presented as an observed performance improvement attributed to complementary signals, not as a mathematical derivation or fitted parameter that reduces to the inputs by construction. No equations, self-definitional loops, or load-bearing self-citations are evident in the provided text that would force the reported generalization benefit. The evaluation relies on external held-out manipulation splits, making the result falsifiable and independent of the method's internal definitions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogicLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We found that EmoForensics and the low-level focused method capture complementary signals.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Detecting deep-fake videos from phoneme- viseme mismatches
Shruti Agarwal, Hany Farid, Ohad Fried, and Maneesh Agrawala. Detecting deep-fake videos from phoneme- viseme mismatches. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition work- shops, pages 660–661, 2020. 3
work page 2020
-
[2]
Sarah Barrington, Matyas Bohacek, and Hany Farid. The deepspeak dataset.arXiv preprint arXiv:2408.05366, 2024. 2, 5, 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Island loss for learning discriminative features in facial expression recognition
Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O’Reilly, and Yan Tong. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 302–309. IEEE, 2018. 2
work page 2018
-
[4]
Deepfake-scam auf meta-plattformen: Wenn merz und trump ihnen geld schenken wollen
Val ´erie Catil. Deepfake-scam auf meta-plattformen: Wenn merz und trump ihnen geld schenken wollen. https://taz.de/Deepfake- Scam- auf- Meta- Plattformen/!6118507/, 2025. taz.de, October 2025. 1
work page 2025
-
[5]
Heather Chen and Kathleen Magramo. Finance worker pays out 25 million after video call with deepfake “chief finan- cial officer”.https://edition.cnn.com/2024/ 02/04/asia/deepfake-cfo-scam-hong-kong- intl-hnk, 2024. CNN, February 2024. 1
work page 2024
-
[6]
Yin Chen, Jia Li, Shiguang Shan, Meng Wang, and Richang Hong. From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos. IEEE Transactions on Affective Computing, 2024. 2
work page 2024
-
[7]
Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, and Richang Hong. Static for dy- namic: Towards a deeper understanding of dynamic facial expressions using static expression data.arXiv preprint arXiv:2409.06154, 2024. 2
-
[8]
Harry Cheng, Yangyang Guo, Tianyi Wang, Qi Li, Xiaojun Chang, and Liqiang Nie. V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications and Applications, 20(3):1–22, 2023. 3
work page 2023
-
[9]
Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4673–4682,
-
[10]
Emanuele Conti, Davide Salvi, Clara Borrelli, Brian Hosler, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, Matthew C. Stamm, and Stefano Tubaro. Deepfake speech detection through emotion recognition: A semantic ap- proach. InICASSP 2022 - 2022 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 8962–8966, 2022. 3
work page 2022
-
[11]
Implicit identity leakage: The stum- bling block to improving deepfake detection generalization
Shichao Dong, Jin Wang, Renhe Ji, Jiajun Liang, Haoqiang Fan, and Zheng Ge. Implicit identity leakage: The stum- bling block to improving deepfake detection generalization. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3994–4004, 2023. 3
work page 2023
-
[12]
Self- supervised video forensics by audio-visual anomaly detec- tion
Chao Feng, Ziyang Chen, and Andrew Owens. Self- supervised video forensics by audio-visual anomaly detec- tion. Inproceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10491–10503,
-
[13]
Chal- lenges in representation learning: A report on three machine learning contests
Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Chal- lenges in representation learning: A report on three machine learning contests. InInternational conference on neural in- formation processing, pages 117–124. Springer, 2013. 2
work page 2013
-
[14]
Lips don’t lie: A generalisable and robust approach to face forgery detection
Alexandros Haliassos, Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5039–5049, 2021. 3
work page 2021
-
[15]
Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies
Brian Hosler, Davide Salvi, Anthony Murray, Fabio An- tonacci, Paolo Bestagini, Stefano Tubaro, and Matthew C Stamm. Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1013–1022, 2021. 2, 3
work page 2021
-
[16]
Towards paralinguistic-only speech representations for end-to-end speech emotion recognition
George Ioannides, Michael Owen, Andrew Fletcher, Viktor Rozgic, and Chao Wang. Towards paralinguistic-only speech representations for end-to-end speech emotion recognition. ISCA archive, 2023. 2
work page 2023
-
[17]
Dfew: A large-scale database for recognizing dynamic facial expres- sions in the wild
Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial expres- sions in the wild. InProceedings of the 28th ACM interna- tional conference on multimedia, pages 2881–2889, 2020. 2
work page 2020
-
[18]
FakeA VCeleb: A novel audio-video multimodal deepfake dataset.arXiv preprint arXiv:2108.05080,
Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. Fakeavceleb: A novel audio-video multimodal deep- fake dataset.arXiv preprint arXiv:2108.05080, 2021. 2, 5, 1
-
[19]
Deep- fake doctor: Diagnosing and treating audio-video fake detec- tion, 2025
Marcel Klemt, Carlotta Segna, and Anna Rohrbach. Deep- fake doctor: Diagnosing and treating audio-video fake detec- tion, 2025. 1, 2, 5, 6, 3
work page 2025
-
[20]
Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016
Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, and Junchi Yan. Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016. 2 9
-
[21]
Intensity-aware loss for dynamic facial expression recogni- tion in the wild
Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. Intensity-aware loss for dynamic facial expression recogni- tion in the wild. InProceedings of the AAAI conference on artificial intelligence, pages 67–75, 2023. 2
work page 2023
-
[22]
Affective behaviour analysis using pretrained model with facial prior
Yifan Li, Haomiao Sun, Zhaori Liu, Hu Han, and Shiguang Shan. Affective behaviour analysis using pretrained model with facial prior. InEuropean Conference on Computer Vi- sion, pages 19–30. Springer, 2022. 2
work page 2022
-
[23]
Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. Saanet: Siamese action-units attention network for improving dynamic facial expression recogni- tion.Neurocomputing, 413:145–157, 2020. 2
work page 2020
-
[24]
Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, and Run Wang. Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes. InAdvances in Neural Information Processing Systems, pages 91131–91155. Curran Associates, Inc., 2024. 1, 3
work page 2024
-
[25]
Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. InPro- ceedings of the 30th ACM international conference on mul- timedia, pages 24–32, 2022. 2
work page 2022
-
[26]
Yuanyuan Liu, Wenbin Wang, Chuanxu Feng, Haoyu Zhang, Zhe Chen, and Yibing Zhan. Expression snippet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023. 2
work page 2023
-
[27]
Man fined over deepfake porn in australian first
Tobi Loftus. Man fined over deepfake porn in australian first. https://www.abc.net.au/news/2025-09-26/ qld- deepfake- pornography- federal- court- charge / 105822448, 2025. ABC News, September
work page 2025
-
[28]
Juan-Miguel L ´opez-Gil, Rosa Gil, and Roberto Garc ´ıa. Do deepfakes adequately display emotions? a study on deepfake facial emotion expression.Computational Intelligence and Neuroscience, 2022(1):1332122, 2022. 2, 3
work page 2022
-
[29]
emotion2vec: Self-supervised pre-training for speech emotion representation.arXiv preprint, 2023
Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self- supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185, 2023. 2, 6
-
[30]
Emotions don’t lie: An audio- visual deepfake detection method using affective cues
Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. Emotions don’t lie: An audio- visual deepfake detection method using affective cues. In Proceedings of the 28th ACM international conference on multimedia, pages 2823–2832, 2020. 2, 3
work page 2020
-
[31]
Ali Mollahosseini, Behzad Hasani, and Mohammad H Ma- hoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2017. 2
work page 2017
-
[32]
Speech emotion recognition using self-supervised features
Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno, and Hagai Aronowitz. Speech emotion recognition using self-supervised features. InICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6922–6926. IEEE,
work page 2022
-
[33]
Towards uni- versal fake image detectors that generalize across genera- tive models
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 3
work page 2023
-
[34]
Avff: Audio-visual feature fusion for video deepfake detection
Trevine Oorloff, Surya Koppisetti, Nicol `o Bonettini, Di- vyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, and Gaurav Bharaj. Avff: Audio-visual feature fusion for video deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27102–27112, 2024. 1, 3, 5, 6
work page 2024
-
[35]
Emo- tion recognition from speech using wav2vec 2.0 embeddings
Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emo- tion recognition from speech using wav2vec 2.0 embeddings. arXiv preprint arXiv:2104.03502, 2021. 2
-
[36]
Mul- timodaltrace: Deepfake detection using audiovisual repre- sentation learning
Muhammad Anas Raza and Khalid Mahmood Malik. Mul- timodaltrace: Deepfake detection using audiovisual repre- sentation learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 993–1000, 2023. 1, 3
work page 2023
-
[37]
Faceforen- sics++: Learning to detect manipulated facial images
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 3
work page 2019
-
[38]
Mae- dfer: Efficient masked autoencoder for self-supervised dy- namic facial expression recognition
Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. Mae- dfer: Efficient masked autoencoder for self-supervised dy- namic facial expression recognition. InProceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023. 2
work page 2023
-
[39]
Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, and Wael Ab- dAlmageed. Unsupervised multimodal deepfake detection using intra-and cross-modal inconsistencies.arXiv preprint arXiv:2311.17088, 2023. 3
-
[40]
Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwa- hab Heba. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding.arXiv preprint arXiv:2111.02735,
-
[41]
Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos
Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20922–20931, 2022. 2
work page 2022
-
[42]
Deepfakes audio de- tection leveraging audio spectrogram and convolutional neu- ral networks
Taiba Majid Wani and Irene Amerini. Deepfakes audio de- tection leveraging audio spectrogram and convolutional neu- ral networks. InInternational Conference on Image Analysis and Processing, pages 156–167. Springer, 2023. 3
work page 2023
-
[43]
Abc- capsnet: Attention based cascaded capsule network for audio deepfake detection
Taiba Majid Wani, Reeva Gulzar, and Irene Amerini. Abc- capsnet: Attention based cascaded capsule network for audio deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2464–2472, 2024. 3
work page 2024
-
[44]
Deressa Wodajo and Solomon Atnafu. Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126, 2021. 3
-
[45]
Trans- fer: Learning relation-aware facial expression representa- tions with transformers
Fanglei Xue, Qiangchang Wang, and Guodong Guo. Trans- fer: Learning relation-aware facial expression representa- tions with transformers. InProceedings of the IEEE/CVF 10 International conference on computer vision, pages 3601– 3610, 2021. 2
work page 2021
-
[46]
Avoid-df: Audio-visual joint learning for detecting deepfake
Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and Kui Ren. Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Transactions on Information Forensics and Security, 18:2015–2029, 2023. 1, 3
work page 2015
-
[47]
Exposing deep fakes using inconsistent head poses
Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. InICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 8261–8265. IEEE, 2019. 3
work page 2019
-
[48]
Zhenbo Yu, Guangcan Liu, Qingshan Liu, and Jiankang Deng. Spatio-temporal convolutional features with nested lstm for facial expression recognition.Neurocomputing, 317: 50–57, 2018. 2
work page 2018
-
[49]
Memory fusion network for multi-view sequential learning
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. InPro- ceedings of the AAAI conference on artificial intelligence,
-
[50]
Haifeng Zhang, Wen Su, Jun Yu, and Zengfu Wang. Identity–expression dual branch network for facial expres- sion recognition.IEEE transactions on cognitive and devel- opmental systems, 13(4):898–911, 2020. 2
work page 2020
-
[51]
Poster: A pyra- mid cross-fusion transformer network for facial expression recognition
Ce Zheng, Matias Mendieta, and Chen Chen. Poster: A pyra- mid cross-fusion transformer network for facial expression recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3146–3155, 2023. 2, 6
work page 2023
-
[52]
Exploring temporal coherence for more gen- eral video face forgery detection
Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. Exploring temporal coherence for more gen- eral video face forgery detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 15044–15054, 2021. 3
work page 2021
-
[53]
Joint audio-visual deepfake detection
Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 14800–14809, 2021. 3 11 EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection Supplementary Material We structure the supplementary materials as follows: first, S...
work page 2021
-
[54]
Details on Dataset In this section, we provide some details on how we split and utilise the FakeA VCeleb [18] and DeepSpeak v2 [2] datasets for benchmarking our proposed framework Emo- Boost. We create a validation split for both datasets, which is used for learning rate scheduling and early stopping. Fur- thermore, we also validate our design choices for...
-
[55]
Further Results 8.1. Leave-one-out Evaluation As described in Section 5.2, we observe motivating results from our proposed framework in the leave-one-out evalua- tion setup on FakeA VCeleb. We present the detailed perfor- mance and its comparison with other state-of-the-art multi- modal deepfake detectors in Table 6.Emo-Boosted SIMBA achieves the highest ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.