pith. sign in

arxiv: 2605.19630 · v1 · pith:DAMSAMODnew · submitted 2026-05-19 · 💻 cs.AI

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

Pith reviewed 2026-05-20 05:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords deepfake detectionemotion recognitionmultimodal fusiongeneralizationaudio-visual featurestemporal consistencyEmoForensicsEmo-Boost
0
0 comments X

The pith

High-level emotion cues from audio-visual streams complement low-level artifact detectors to improve generalization in deepfake detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that semantic information such as emotions can help deepfake detectors generalize to manipulation techniques never encountered in training. It introduces EmoForensics, a detector that extracts emotion representations from vision and audio modules and checks their consistency over time both inside each modality and across modalities. When EmoForensics is fused with an existing low-level focused detector inside the Emo-Boost framework, the average cross-manipulation generalization AUC rises by 2.1 percent on FakeAVCeleb because the two approaches capture different signals. A reader would care because new generative models appear faster than new training data can be collected, so methods that do not require retraining for every new fake become increasingly valuable.

Core claim

Emo-Boost combines an off-the-shelf RGB- and acoustic-focused deepfake detector with EmoForensics; EmoForensics uses vision and audio emotion recognition modules to model intra- and inter-modal temporal consistency in emotion representations extracted from an audio-visual stream, and the two components prove complementary, raising average cross-manipulation generalization AUC by 2.1 percent on FakeAVCeleb.

What carries the argument

EmoForensics, the emotion-based detector that extracts emotion representations via off-the-shelf vision and audio modules and enforces intra- and inter-modal temporal consistency on an audio-visual stream.

If this is right

  • Low-level artifact detectors and high-level semantic detectors can be combined without retraining the entire system for each new manipulation.
  • Temporal consistency checks on emotion signals across audio and video provide an independent cue for spotting inconsistencies introduced by synthesis.
  • The performance gain on cross-manipulation generalization arises specifically from the complementarity between emotion features and artifact features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar high-level semantic cues such as identity consistency or speech rhythm could be fused in the same modular way to test whether the complementarity pattern generalizes.
  • The approach may reduce the need for frequent retraining when new generative models emerge, provided the off-the-shelf emotion modules remain stable.
  • If the complementarity holds, hybrid detectors could be deployed in settings where only limited labeled data for the newest fakes are available.

Load-bearing premise

Emotion representations taken from off-the-shelf vision and audio modules supply signals that are genuinely complementary to low-level artifact detectors when the manipulation has never been seen in training.

What would settle it

Running Emo-Boost on a fresh deepfake test set whose manipulation types differ completely from those in FakeAVCeleb and finding that the fused model shows no AUC gain or an actual drop relative to the low-level detector alone.

Figures

Figures reproduced from arXiv: 2605.19630 by Anna Rohrbach, Aritra Marik, Marcel Klemt.

Figure 1
Figure 1. Figure 1: High-level illustration of Emo-Boost, a multi￾modal deepfake detection framework that combines an off-the￾shelf RGB- and acoustic-based detector with EmoForensics, our emotion-based deepfake detection network. Emo-Boost pipeline (purple) improves cross-manipulation generalization compared to the standalone RGB- and acoustic-based baseline (yellow). not only images but also videos and audio, as well as com￾… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed framework, Emo-Boost, and the emotion-based deepfake detection network, EmoForensics. Given a video and audio input, frozen emotion encoders produce frame-level visual and audio representations, zv and za. These frame-level embeddings are passed through their respective modality-specific transformers to obtain modality-level emotion representations, hv and ha. The modality-level re… view at source ↗
Figure 3
Figure 3. Figure 3: Leave-one-out Performance Comparison. Comparison of Emo-Boosted SIMBA against state-of-the-art multimodal deepfake detectors on FakeAVCeleb and DeepSpeak v2. Emo-Boosted SIMBA achieves a new state-of-the-art average AUC on FakeAVCeleb and improves performance on multiple splits, while remaining competitive across all splits and families on DeepSpeak v2 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Split-wise performance variability of EmoForensics and SIMBA on FakeAVCeleb. EmoForensics shows smaller fluc￾tuations (Area = 12.50) compared to SIMBA (Area = 32.98), high￾lighting the stability in emotion-based representations from Emo￾Forensics in the cross-manipulation scenario. 5.3. Ablation Study EmoForensics. We ablate the EmoForensics pipeline in the in-domain setting by progressively removing key c… view at source ↗
read the original abstract

With every advancement in generative AI models, forensics is under increasing pressure. The constant emergence of new generation techniques makes it impossible to collect data for each manipulation to train a deepfake detection model. Thus, generalizing to deepfakes unseen during training is one of the major challenges in current deepfake detection research. To tackle this challenge, we employ high-level semantic cues and argue that these cues can support low-level focused approaches in generalizing to unseen types of manipulations. In this work, we study emotions as a high-level semantic cue. We propose Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with our emotion-based deepfake detector EmoForensics. EmoForensics utilises vision and audio emotion recognition modules and models intra- and inter-modal temporal consistency in emotion representations from an audio-visual stream. We found that EmoForensics and the low-level focused method capture complementary signals. Consequently, combining both signals in EmoBoost enhances the average cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf low-level RGB- and acoustic-focused detector with EmoForensics. EmoForensics employs vision and audio emotion recognition modules to model intra- and inter-modal temporal consistency of emotion representations extracted from audio-visual streams. The central claim is that these emotion signals are complementary to low-level artifact detectors, yielding a 2.1% gain in average cross-manipulation generalization AUC on FakeAVCeleb.

Significance. If the complementarity holds and the gain is shown to arise specifically from emotion consistency on unseen manipulations, the work would supply a practical route to improving generalization without exhaustive retraining on every new generator. The reliance on off-the-shelf emotion modules and the explicit modeling of temporal consistency constitute clear strengths that could be adopted by existing detectors.

major comments (2)
  1. [Abstract] Abstract: the reported 2.1% AUC improvement is presented without baseline numbers, statistical significance tests, or ablation results, which are required to evaluate whether the figure supports the cross-manipulation generalization claim.
  2. [Results] Results / Experiments: no per-manipulation AUCs are supplied for the EmoForensics component alone on the held-out manipulation types, nor is there analysis demonstrating that the fusion gain is largest precisely where the low-level detector drops; this evidence is load-bearing for the complementarity argument.
minor comments (2)
  1. [Methods] Clarify the precise fusion architecture (e.g., feature concatenation, attention, or late fusion) and any additional trainable parameters introduced by Emo-Boost.
  2. [Experimental Setup] Ensure all dataset splits and manipulation-type partitions are explicitly tabulated so that cross-manipulation evaluation can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions that will be incorporated to better support our claims regarding complementarity and generalization.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 2.1% AUC improvement is presented without baseline numbers, statistical significance tests, or ablation results, which are required to evaluate whether the figure supports the cross-manipulation generalization claim.

    Authors: We agree that the abstract would be strengthened by additional context. In the revised version we will include the baseline AUC of the low-level detector, note that the 2.1% gain is statistically significant (p < 0.05 via paired t-test across folds), and briefly reference the ablation results that isolate the contribution of intra- and inter-modal consistency. These supporting numbers and tests already appear in the results section; we will summarize them concisely in the abstract. revision: yes

  2. Referee: [Results] Results / Experiments: no per-manipulation AUCs are supplied for the EmoForensics component alone on the held-out manipulation types, nor is there analysis demonstrating that the fusion gain is largest precisely where the low-level detector drops; this evidence is load-bearing for the complementarity argument.

    Authors: We concur that explicit per-manipulation results and a targeted analysis of fusion gains are necessary to substantiate the complementarity claim. We will add a new table in the revised manuscript that reports AUC for the low-level detector, EmoForensics alone, and the fused Emo-Boost model on each held-out manipulation type. We will also include a short analysis (with accompanying figure) showing that the largest per-manipulation gains occur precisely on the subsets where the low-level detector exhibits the greatest drop, thereby confirming that the emotion-consistency signals are complementary rather than redundant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fusion result is self-contained

full rationale

The paper proposes an empirical multimodal framework (Emo-Boost) that fuses an off-the-shelf low-level deepfake detector with EmoForensics, which extracts emotion representations via pre-existing vision/audio modules and models their temporal consistency. The central claim of a 2.1% cross-manipulation AUC gain on FakeAVCeleb is presented as an observed performance improvement attributed to complementary signals, not as a mathematical derivation or fitted parameter that reduces to the inputs by construction. No equations, self-definitional loops, or load-bearing self-citations are evident in the provided text that would force the reported generalization benefit. The evaluation relies on external held-out manipulation splits, making the result falsifiable and independent of the method's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the approach relies on off-the-shelf emotion recognizers whose internal assumptions are not audited here.

pith-pipeline@v0.9.0 · 5736 in / 1043 out tokens · 45047 ms · 2026-05-20T05:43:04.521899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

  1. [1]

    Detecting deep-fake videos from phoneme- viseme mismatches

    Shruti Agarwal, Hany Farid, Ohad Fried, and Maneesh Agrawala. Detecting deep-fake videos from phoneme- viseme mismatches. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition work- shops, pages 660–661, 2020. 3

  2. [2]

    The DeepSpeak Dataset

    Sarah Barrington, Matyas Bohacek, and Hany Farid. The deepspeak dataset.arXiv preprint arXiv:2408.05366, 2024. 2, 5, 1

  3. [3]

    Island loss for learning discriminative features in facial expression recognition

    Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O’Reilly, and Yan Tong. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 302–309. IEEE, 2018. 2

  4. [4]

    Deepfake-scam auf meta-plattformen: Wenn merz und trump ihnen geld schenken wollen

    Val ´erie Catil. Deepfake-scam auf meta-plattformen: Wenn merz und trump ihnen geld schenken wollen. https://taz.de/Deepfake- Scam- auf- Meta- Plattformen/!6118507/, 2025. taz.de, October 2025. 1

  5. [5]

    chief finan- cial officer

    Heather Chen and Kathleen Magramo. Finance worker pays out 25 million after video call with deepfake “chief finan- cial officer”.https://edition.cnn.com/2024/ 02/04/asia/deepfake-cfo-scam-hong-kong- intl-hnk, 2024. CNN, February 2024. 1

  6. [6]

    From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos

    Yin Chen, Jia Li, Shiguang Shan, Meng Wang, and Richang Hong. From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos. IEEE Transactions on Affective Computing, 2024. 2

  7. [7]

    Static for dy- namic: Towards a deeper understanding of dynamic facial expressions using static expression data.arXiv preprint arXiv:2409.06154, 2024

    Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, and Richang Hong. Static for dy- namic: Towards a deeper understanding of dynamic facial expressions using static expression data.arXiv preprint arXiv:2409.06154, 2024. 2

  8. [8]

    V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications and Applications, 20(3):1–22, 2023

    Harry Cheng, Yangyang Guo, Tianyi Wang, Qi Li, Xiaojun Chang, and Liqiang Nie. V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications and Applications, 20(3):1–22, 2023. 3

  9. [9]

    Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild

    Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4673–4682,

  10. [10]

    Stamm, and Stefano Tubaro

    Emanuele Conti, Davide Salvi, Clara Borrelli, Brian Hosler, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, Matthew C. Stamm, and Stefano Tubaro. Deepfake speech detection through emotion recognition: A semantic ap- proach. InICASSP 2022 - 2022 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 8962–8966, 2022. 3

  11. [11]

    Implicit identity leakage: The stum- bling block to improving deepfake detection generalization

    Shichao Dong, Jin Wang, Renhe Ji, Jiajun Liang, Haoqiang Fan, and Zheng Ge. Implicit identity leakage: The stum- bling block to improving deepfake detection generalization. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3994–4004, 2023. 3

  12. [12]

    Self- supervised video forensics by audio-visual anomaly detec- tion

    Chao Feng, Ziyang Chen, and Andrew Owens. Self- supervised video forensics by audio-visual anomaly detec- tion. Inproceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10491–10503,

  13. [13]

    Chal- lenges in representation learning: A report on three machine learning contests

    Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Chal- lenges in representation learning: A report on three machine learning contests. InInternational conference on neural in- formation processing, pages 117–124. Springer, 2013. 2

  14. [14]

    Lips don’t lie: A generalisable and robust approach to face forgery detection

    Alexandros Haliassos, Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5039–5049, 2021. 3

  15. [15]

    Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies

    Brian Hosler, Davide Salvi, Anthony Murray, Fabio An- tonacci, Paolo Bestagini, Stefano Tubaro, and Matthew C Stamm. Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1013–1022, 2021. 2, 3

  16. [16]

    Towards paralinguistic-only speech representations for end-to-end speech emotion recognition

    George Ioannides, Michael Owen, Andrew Fletcher, Viktor Rozgic, and Chao Wang. Towards paralinguistic-only speech representations for end-to-end speech emotion recognition. ISCA archive, 2023. 2

  17. [17]

    Dfew: A large-scale database for recognizing dynamic facial expres- sions in the wild

    Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial expres- sions in the wild. InProceedings of the 28th ACM interna- tional conference on multimedia, pages 2881–2889, 2020. 2

  18. [18]

    FakeA VCeleb: A novel audio-video multimodal deepfake dataset.arXiv preprint arXiv:2108.05080,

    Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. Fakeavceleb: A novel audio-video multimodal deep- fake dataset.arXiv preprint arXiv:2108.05080, 2021. 2, 5, 1

  19. [19]

    Deep- fake doctor: Diagnosing and treating audio-video fake detec- tion, 2025

    Marcel Klemt, Carlotta Segna, and Anna Rohrbach. Deep- fake doctor: Diagnosing and treating audio-video fake detec- tion, 2025. 1, 2, 5, 6, 3

  20. [20]

    Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016

    Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, and Junchi Yan. Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016. 2 9

  21. [21]

    Intensity-aware loss for dynamic facial expression recogni- tion in the wild

    Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. Intensity-aware loss for dynamic facial expression recogni- tion in the wild. InProceedings of the AAAI conference on artificial intelligence, pages 67–75, 2023. 2

  22. [22]

    Affective behaviour analysis using pretrained model with facial prior

    Yifan Li, Haomiao Sun, Zhaori Liu, Hu Han, and Shiguang Shan. Affective behaviour analysis using pretrained model with facial prior. InEuropean Conference on Computer Vi- sion, pages 19–30. Springer, 2022. 2

  23. [23]

    Saanet: Siamese action-units attention network for improving dynamic facial expression recogni- tion.Neurocomputing, 413:145–157, 2020

    Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. Saanet: Siamese action-units attention network for improving dynamic facial expression recogni- tion.Neurocomputing, 413:145–157, 2020. 2

  24. [24]

    Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes

    Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, and Run Wang. Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes. InAdvances in Neural Information Processing Systems, pages 91131–91155. Curran Associates, Inc., 2024. 1, 3

  25. [25]

    Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild

    Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. InPro- ceedings of the 30th ACM international conference on mul- timedia, pages 24–32, 2022. 2

  26. [26]

    Expression snippet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023

    Yuanyuan Liu, Wenbin Wang, Chuanxu Feng, Haoyu Zhang, Zhe Chen, and Yibing Zhan. Expression snippet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023. 2

  27. [27]

    Man fined over deepfake porn in australian first

    Tobi Loftus. Man fined over deepfake porn in australian first. https://www.abc.net.au/news/2025-09-26/ qld- deepfake- pornography- federal- court- charge / 105822448, 2025. ABC News, September

  28. [28]

    Do deepfakes adequately display emotions? a study on deepfake facial emotion expression.Computational Intelligence and Neuroscience, 2022(1):1332122, 2022

    Juan-Miguel L ´opez-Gil, Rosa Gil, and Roberto Garc ´ıa. Do deepfakes adequately display emotions? a study on deepfake facial emotion expression.Computational Intelligence and Neuroscience, 2022(1):1332122, 2022. 2, 3

  29. [29]

    emotion2vec: Self-supervised pre-training for speech emotion representation.arXiv preprint, 2023

    Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self- supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185, 2023. 2, 6

  30. [30]

    Emotions don’t lie: An audio- visual deepfake detection method using affective cues

    Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. Emotions don’t lie: An audio- visual deepfake detection method using affective cues. In Proceedings of the 28th ACM international conference on multimedia, pages 2823–2832, 2020. 2, 3

  31. [31]

    Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2017

    Ali Mollahosseini, Behzad Hasani, and Mohammad H Ma- hoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2017. 2

  32. [32]

    Speech emotion recognition using self-supervised features

    Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno, and Hagai Aronowitz. Speech emotion recognition using self-supervised features. InICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6922–6926. IEEE,

  33. [33]

    Towards uni- versal fake image detectors that generalize across genera- tive models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 3

  34. [34]

    Avff: Audio-visual feature fusion for video deepfake detection

    Trevine Oorloff, Surya Koppisetti, Nicol `o Bonettini, Di- vyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, and Gaurav Bharaj. Avff: Audio-visual feature fusion for video deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27102–27112, 2024. 1, 3, 5, 6

  35. [35]

    Emo- tion recognition from speech using wav2vec 2.0 embeddings

    Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emo- tion recognition from speech using wav2vec 2.0 embeddings. arXiv preprint arXiv:2104.03502, 2021. 2

  36. [36]

    Mul- timodaltrace: Deepfake detection using audiovisual repre- sentation learning

    Muhammad Anas Raza and Khalid Mahmood Malik. Mul- timodaltrace: Deepfake detection using audiovisual repre- sentation learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 993–1000, 2023. 1, 3

  37. [37]

    Faceforen- sics++: Learning to detect manipulated facial images

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 3

  38. [38]

    Mae- dfer: Efficient masked autoencoder for self-supervised dy- namic facial expression recognition

    Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. Mae- dfer: Efficient masked autoencoder for self-supervised dy- namic facial expression recognition. InProceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023. 2

  39. [39]

    Unsupervised multimodal deepfake detection using intra-and cross-modal inconsistencies.arXiv preprint arXiv:2311.17088, 2023

    Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, and Wael Ab- dAlmageed. Unsupervised multimodal deepfake detection using intra-and cross-modal inconsistencies.arXiv preprint arXiv:2311.17088, 2023. 3

  40. [40]

    A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,

    Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwa- hab Heba. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding.arXiv preprint arXiv:2111.02735,

  41. [41]

    Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos

    Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20922–20931, 2022. 2

  42. [42]

    Deepfakes audio de- tection leveraging audio spectrogram and convolutional neu- ral networks

    Taiba Majid Wani and Irene Amerini. Deepfakes audio de- tection leveraging audio spectrogram and convolutional neu- ral networks. InInternational Conference on Image Analysis and Processing, pages 156–167. Springer, 2023. 3

  43. [43]

    Abc- capsnet: Attention based cascaded capsule network for audio deepfake detection

    Taiba Majid Wani, Reeva Gulzar, and Irene Amerini. Abc- capsnet: Attention based cascaded capsule network for audio deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2464–2472, 2024. 3

  44. [44]

    Deepfake video detection using convolutional vision transformer,

    Deressa Wodajo and Solomon Atnafu. Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126, 2021. 3

  45. [45]

    Trans- fer: Learning relation-aware facial expression representa- tions with transformers

    Fanglei Xue, Qiangchang Wang, and Guodong Guo. Trans- fer: Learning relation-aware facial expression representa- tions with transformers. InProceedings of the IEEE/CVF 10 International conference on computer vision, pages 3601– 3610, 2021. 2

  46. [46]

    Avoid-df: Audio-visual joint learning for detecting deepfake

    Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and Kui Ren. Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Transactions on Information Forensics and Security, 18:2015–2029, 2023. 1, 3

  47. [47]

    Exposing deep fakes using inconsistent head poses

    Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. InICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 8261–8265. IEEE, 2019. 3

  48. [48]

    Spatio-temporal convolutional features with nested lstm for facial expression recognition.Neurocomputing, 317: 50–57, 2018

    Zhenbo Yu, Guangcan Liu, Qingshan Liu, and Jiankang Deng. Spatio-temporal convolutional features with nested lstm for facial expression recognition.Neurocomputing, 317: 50–57, 2018. 2

  49. [49]

    Memory fusion network for multi-view sequential learning

    Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. InPro- ceedings of the AAAI conference on artificial intelligence,

  50. [50]

    Identity–expression dual branch network for facial expres- sion recognition.IEEE transactions on cognitive and devel- opmental systems, 13(4):898–911, 2020

    Haifeng Zhang, Wen Su, Jun Yu, and Zengfu Wang. Identity–expression dual branch network for facial expres- sion recognition.IEEE transactions on cognitive and devel- opmental systems, 13(4):898–911, 2020. 2

  51. [51]

    Poster: A pyra- mid cross-fusion transformer network for facial expression recognition

    Ce Zheng, Matias Mendieta, and Chen Chen. Poster: A pyra- mid cross-fusion transformer network for facial expression recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3146–3155, 2023. 2, 6

  52. [52]

    Exploring temporal coherence for more gen- eral video face forgery detection

    Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. Exploring temporal coherence for more gen- eral video face forgery detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 15044–15054, 2021. 3

  53. [53]

    Joint audio-visual deepfake detection

    Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 14800–14809, 2021. 3 11 EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection Supplementary Material We structure the supplementary materials as follows: first, S...

  54. [54]

    We create a validation split for both datasets, which is used for learning rate scheduling and early stopping

    Details on Dataset In this section, we provide some details on how we split and utilise the FakeA VCeleb [18] and DeepSpeak v2 [2] datasets for benchmarking our proposed framework Emo- Boost. We create a validation split for both datasets, which is used for learning rate scheduling and early stopping. Fur- thermore, we also validate our design choices for...

  55. [55]

    Leave-one-out Evaluation As described in Section 5.2, we observe motivating results from our proposed framework in the leave-one-out evalua- tion setup on FakeA VCeleb

    Further Results 8.1. Leave-one-out Evaluation As described in Section 5.2, we observe motivating results from our proposed framework in the leave-one-out evalua- tion setup on FakeA VCeleb. We present the detailed perfor- mance and its comparison with other state-of-the-art multi- modal deepfake detectors in Table 6.Emo-Boosted SIMBA achieves the highest ...