EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

Anna Rohrbach; Aritra Marik; Marcel Klemt

arxiv: 2605.19630 · v1 · pith:DAMSAMODnew · submitted 2026-05-19 · 💻 cs.AI

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

Aritra Marik , Marcel Klemt , Anna Rohrbach This is my paper

Pith reviewed 2026-05-20 05:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords deepfake detectionemotion recognitionmultimodal fusiongeneralizationaudio-visual featurestemporal consistencyEmoForensicsEmo-Boost

0 comments

The pith

High-level emotion cues from audio-visual streams complement low-level artifact detectors to improve generalization in deepfake detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that semantic information such as emotions can help deepfake detectors generalize to manipulation techniques never encountered in training. It introduces EmoForensics, a detector that extracts emotion representations from vision and audio modules and checks their consistency over time both inside each modality and across modalities. When EmoForensics is fused with an existing low-level focused detector inside the Emo-Boost framework, the average cross-manipulation generalization AUC rises by 2.1 percent on FakeAVCeleb because the two approaches capture different signals. A reader would care because new generative models appear faster than new training data can be collected, so methods that do not require retraining for every new fake become increasingly valuable.

Core claim

Emo-Boost combines an off-the-shelf RGB- and acoustic-focused deepfake detector with EmoForensics; EmoForensics uses vision and audio emotion recognition modules to model intra- and inter-modal temporal consistency in emotion representations extracted from an audio-visual stream, and the two components prove complementary, raising average cross-manipulation generalization AUC by 2.1 percent on FakeAVCeleb.

What carries the argument

EmoForensics, the emotion-based detector that extracts emotion representations via off-the-shelf vision and audio modules and enforces intra- and inter-modal temporal consistency on an audio-visual stream.

If this is right

Low-level artifact detectors and high-level semantic detectors can be combined without retraining the entire system for each new manipulation.
Temporal consistency checks on emotion signals across audio and video provide an independent cue for spotting inconsistencies introduced by synthesis.
The performance gain on cross-manipulation generalization arises specifically from the complementarity between emotion features and artifact features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar high-level semantic cues such as identity consistency or speech rhythm could be fused in the same modular way to test whether the complementarity pattern generalizes.
The approach may reduce the need for frequent retraining when new generative models emerge, provided the off-the-shelf emotion modules remain stable.
If the complementarity holds, hybrid detectors could be deployed in settings where only limited labeled data for the newest fakes are available.

Load-bearing premise

Emotion representations taken from off-the-shelf vision and audio modules supply signals that are genuinely complementary to low-level artifact detectors when the manipulation has never been seen in training.

What would settle it

Running Emo-Boost on a fresh deepfake test set whose manipulation types differ completely from those in FakeAVCeleb and finding that the fused model shows no AUC gain or an actual drop relative to the low-level detector alone.

Figures

Figures reproduced from arXiv: 2605.19630 by Anna Rohrbach, Aritra Marik, Marcel Klemt.

**Figure 1.** Figure 1: High-level illustration of Emo-Boost, a multimodal deepfake detection framework that combines an off-theshelf RGB- and acoustic-based detector with EmoForensics, our emotion-based deepfake detection network. Emo-Boost pipeline (purple) improves cross-manipulation generalization compared to the standalone RGB- and acoustic-based baseline (yellow). not only images but also videos and audio, as well as com… view at source ↗

**Figure 2.** Figure 2: Overview of our proposed framework, Emo-Boost, and the emotion-based deepfake detection network, EmoForensics. Given a video and audio input, frozen emotion encoders produce frame-level visual and audio representations, zv and za. These frame-level embeddings are passed through their respective modality-specific transformers to obtain modality-level emotion representations, hv and ha. The modality-level re… view at source ↗

**Figure 3.** Figure 3: Leave-one-out Performance Comparison. Comparison of Emo-Boosted SIMBA against state-of-the-art multimodal deepfake detectors on FakeAVCeleb and DeepSpeak v2. Emo-Boosted SIMBA achieves a new state-of-the-art average AUC on FakeAVCeleb and improves performance on multiple splits, while remaining competitive across all splits and families on DeepSpeak v2 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Split-wise performance variability of EmoForensics and SIMBA on FakeAVCeleb. EmoForensics shows smaller fluctuations (Area = 12.50) compared to SIMBA (Area = 32.98), highlighting the stability in emotion-based representations from EmoForensics in the cross-manipulation scenario. 5.3. Ablation Study EmoForensics. We ablate the EmoForensics pipeline in the in-domain setting by progressively removing key c… view at source ↗

read the original abstract

With every advancement in generative AI models, forensics is under increasing pressure. The constant emergence of new generation techniques makes it impossible to collect data for each manipulation to train a deepfake detection model. Thus, generalizing to deepfakes unseen during training is one of the major challenges in current deepfake detection research. To tackle this challenge, we employ high-level semantic cues and argue that these cues can support low-level focused approaches in generalizing to unseen types of manipulations. In this work, we study emotions as a high-level semantic cue. We propose Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with our emotion-based deepfake detector EmoForensics. EmoForensics utilises vision and audio emotion recognition modules and models intra- and inter-modal temporal consistency in emotion representations from an audio-visual stream. We found that EmoForensics and the low-level focused method capture complementary signals. Consequently, combining both signals in EmoBoost enhances the average cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a 2.1% AUC lift on cross-manipulation deepfake tests by fusing emotion consistency signals with a low-level detector, but supplies no per-manipulation results or standalone EmoForensics numbers to show the gain comes from complementarity on unseen fakes.

read the letter

The main takeaway is that the authors add an emotion-based module called EmoForensics to an existing low-level deepfake detector and report a 2.1% average AUC improvement on FakeAVCeleb for cross-manipulation generalization. The specific contribution is the use of intra- and inter-modal temporal consistency in emotion representations extracted from off-the-shelf vision and audio modules as an auxiliary cue.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf low-level RGB- and acoustic-focused detector with EmoForensics. EmoForensics employs vision and audio emotion recognition modules to model intra- and inter-modal temporal consistency of emotion representations extracted from audio-visual streams. The central claim is that these emotion signals are complementary to low-level artifact detectors, yielding a 2.1% gain in average cross-manipulation generalization AUC on FakeAVCeleb.

Significance. If the complementarity holds and the gain is shown to arise specifically from emotion consistency on unseen manipulations, the work would supply a practical route to improving generalization without exhaustive retraining on every new generator. The reliance on off-the-shelf emotion modules and the explicit modeling of temporal consistency constitute clear strengths that could be adopted by existing detectors.

major comments (2)

[Abstract] Abstract: the reported 2.1% AUC improvement is presented without baseline numbers, statistical significance tests, or ablation results, which are required to evaluate whether the figure supports the cross-manipulation generalization claim.
[Results] Results / Experiments: no per-manipulation AUCs are supplied for the EmoForensics component alone on the held-out manipulation types, nor is there analysis demonstrating that the fusion gain is largest precisely where the low-level detector drops; this evidence is load-bearing for the complementarity argument.

minor comments (2)

[Methods] Clarify the precise fusion architecture (e.g., feature concatenation, attention, or late fusion) and any additional trainable parameters introduced by Emo-Boost.
[Experimental Setup] Ensure all dataset splits and manipulation-type partitions are explicitly tabulated so that cross-manipulation evaluation can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions that will be incorporated to better support our claims regarding complementarity and generalization.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 2.1% AUC improvement is presented without baseline numbers, statistical significance tests, or ablation results, which are required to evaluate whether the figure supports the cross-manipulation generalization claim.

Authors: We agree that the abstract would be strengthened by additional context. In the revised version we will include the baseline AUC of the low-level detector, note that the 2.1% gain is statistically significant (p < 0.05 via paired t-test across folds), and briefly reference the ablation results that isolate the contribution of intra- and inter-modal consistency. These supporting numbers and tests already appear in the results section; we will summarize them concisely in the abstract. revision: yes
Referee: [Results] Results / Experiments: no per-manipulation AUCs are supplied for the EmoForensics component alone on the held-out manipulation types, nor is there analysis demonstrating that the fusion gain is largest precisely where the low-level detector drops; this evidence is load-bearing for the complementarity argument.

Authors: We concur that explicit per-manipulation results and a targeted analysis of fusion gains are necessary to substantiate the complementarity claim. We will add a new table in the revised manuscript that reports AUC for the low-level detector, EmoForensics alone, and the fused Emo-Boost model on each held-out manipulation type. We will also include a short analysis (with accompanying figure) showing that the largest per-manipulation gains occur precisely on the subsets where the low-level detector exhibits the greatest drop, thereby confirming that the emotion-consistency signals are complementary rather than redundant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fusion result is self-contained

full rationale

The paper proposes an empirical multimodal framework (Emo-Boost) that fuses an off-the-shelf low-level deepfake detector with EmoForensics, which extracts emotion representations via pre-existing vision/audio modules and models their temporal consistency. The central claim of a 2.1% cross-manipulation AUC gain on FakeAVCeleb is presented as an observed performance improvement attributed to complementary signals, not as a mathematical derivation or fitted parameter that reduces to the inputs by construction. No equations, self-definitional loops, or load-bearing self-citations are evident in the provided text that would force the reported generalization benefit. The evaluation relies on external held-out manipulation splits, making the result falsifiable and independent of the method's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the approach relies on off-the-shelf emotion recognizers whose internal assumptions are not audited here.

pith-pipeline@v0.9.0 · 5736 in / 1043 out tokens · 45047 ms · 2026-05-20T05:43:04.521899+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We found that EmoForensics and the low-level focused method capture complementary signals.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

[1]

Detecting deep-fake videos from phoneme- viseme mismatches

Shruti Agarwal, Hany Farid, Ohad Fried, and Maneesh Agrawala. Detecting deep-fake videos from phoneme- viseme mismatches. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition work- shops, pages 660–661, 2020. 3

work page 2020
[2]

The DeepSpeak Dataset

Sarah Barrington, Matyas Bohacek, and Hany Farid. The deepspeak dataset.arXiv preprint arXiv:2408.05366, 2024. 2, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Island loss for learning discriminative features in facial expression recognition

Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O’Reilly, and Yan Tong. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 302–309. IEEE, 2018. 2

work page 2018
[4]

Deepfake-scam auf meta-plattformen: Wenn merz und trump ihnen geld schenken wollen

Val ´erie Catil. Deepfake-scam auf meta-plattformen: Wenn merz und trump ihnen geld schenken wollen. https://taz.de/Deepfake- Scam- auf- Meta- Plattformen/!6118507/, 2025. taz.de, October 2025. 1

work page 2025
[5]

chief finan- cial officer

Heather Chen and Kathleen Magramo. Finance worker pays out 25 million after video call with deepfake “chief finan- cial officer”.https://edition.cnn.com/2024/ 02/04/asia/deepfake-cfo-scam-hong-kong- intl-hnk, 2024. CNN, February 2024. 1

work page 2024
[6]

From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos

Yin Chen, Jia Li, Shiguang Shan, Meng Wang, and Richang Hong. From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos. IEEE Transactions on Affective Computing, 2024. 2

work page 2024
[7]

Static for dy- namic: Towards a deeper understanding of dynamic facial expressions using static expression data.arXiv preprint arXiv:2409.06154, 2024

Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, and Richang Hong. Static for dy- namic: Towards a deeper understanding of dynamic facial expressions using static expression data.arXiv preprint arXiv:2409.06154, 2024. 2

work page arXiv 2024
[8]

V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications and Applications, 20(3):1–22, 2023

Harry Cheng, Yangyang Guo, Tianyi Wang, Qi Li, Xiaojun Chang, and Liqiang Nie. V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications and Applications, 20(3):1–22, 2023. 3

work page 2023
[9]

Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild

Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4673–4682,

work page
[10]

Stamm, and Stefano Tubaro

Emanuele Conti, Davide Salvi, Clara Borrelli, Brian Hosler, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, Matthew C. Stamm, and Stefano Tubaro. Deepfake speech detection through emotion recognition: A semantic ap- proach. InICASSP 2022 - 2022 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 8962–8966, 2022. 3

work page 2022
[11]

Implicit identity leakage: The stum- bling block to improving deepfake detection generalization

Shichao Dong, Jin Wang, Renhe Ji, Jiajun Liang, Haoqiang Fan, and Zheng Ge. Implicit identity leakage: The stum- bling block to improving deepfake detection generalization. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3994–4004, 2023. 3

work page 2023
[12]

Self- supervised video forensics by audio-visual anomaly detec- tion

Chao Feng, Ziyang Chen, and Andrew Owens. Self- supervised video forensics by audio-visual anomaly detec- tion. Inproceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10491–10503,

work page
[13]

Chal- lenges in representation learning: A report on three machine learning contests

Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Chal- lenges in representation learning: A report on three machine learning contests. InInternational conference on neural in- formation processing, pages 117–124. Springer, 2013. 2

work page 2013
[14]

Lips don’t lie: A generalisable and robust approach to face forgery detection

Alexandros Haliassos, Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5039–5049, 2021. 3

work page 2021
[15]

Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies

Brian Hosler, Davide Salvi, Anthony Murray, Fabio An- tonacci, Paolo Bestagini, Stefano Tubaro, and Matthew C Stamm. Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1013–1022, 2021. 2, 3

work page 2021
[16]

Towards paralinguistic-only speech representations for end-to-end speech emotion recognition

George Ioannides, Michael Owen, Andrew Fletcher, Viktor Rozgic, and Chao Wang. Towards paralinguistic-only speech representations for end-to-end speech emotion recognition. ISCA archive, 2023. 2

work page 2023
[17]

Dfew: A large-scale database for recognizing dynamic facial expres- sions in the wild

Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial expres- sions in the wild. InProceedings of the 28th ACM interna- tional conference on multimedia, pages 2881–2889, 2020. 2

work page 2020
[18]

FakeA VCeleb: A novel audio-video multimodal deepfake dataset.arXiv preprint arXiv:2108.05080,

Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. Fakeavceleb: A novel audio-video multimodal deep- fake dataset.arXiv preprint arXiv:2108.05080, 2021. 2, 5, 1

work page arXiv 2021
[19]

Deep- fake doctor: Diagnosing and treating audio-video fake detec- tion, 2025

Marcel Klemt, Carlotta Segna, and Anna Rohrbach. Deep- fake doctor: Diagnosing and treating audio-video fake detec- tion, 2025. 1, 2, 5, 6, 3

work page 2025
[20]

Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016

Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, and Junchi Yan. Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016. 2 9

work page arXiv 2016
[21]

Intensity-aware loss for dynamic facial expression recogni- tion in the wild

Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. Intensity-aware loss for dynamic facial expression recogni- tion in the wild. InProceedings of the AAAI conference on artificial intelligence, pages 67–75, 2023. 2

work page 2023
[22]

Affective behaviour analysis using pretrained model with facial prior

Yifan Li, Haomiao Sun, Zhaori Liu, Hu Han, and Shiguang Shan. Affective behaviour analysis using pretrained model with facial prior. InEuropean Conference on Computer Vi- sion, pages 19–30. Springer, 2022. 2

work page 2022
[23]

Saanet: Siamese action-units attention network for improving dynamic facial expression recogni- tion.Neurocomputing, 413:145–157, 2020

Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. Saanet: Siamese action-units attention network for improving dynamic facial expression recogni- tion.Neurocomputing, 413:145–157, 2020. 2

work page 2020
[24]

Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes

Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, and Run Wang. Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes. InAdvances in Neural Information Processing Systems, pages 91131–91155. Curran Associates, Inc., 2024. 1, 3

work page 2024
[25]

Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild

Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. InPro- ceedings of the 30th ACM international conference on mul- timedia, pages 24–32, 2022. 2

work page 2022
[26]

Expression snippet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023

Yuanyuan Liu, Wenbin Wang, Chuanxu Feng, Haoyu Zhang, Zhe Chen, and Yibing Zhan. Expression snippet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023. 2

work page 2023
[27]

Man fined over deepfake porn in australian first

Tobi Loftus. Man fined over deepfake porn in australian first. https://www.abc.net.au/news/2025-09-26/ qld- deepfake- pornography- federal- court- charge / 105822448, 2025. ABC News, September

work page 2025
[28]

Do deepfakes adequately display emotions? a study on deepfake facial emotion expression.Computational Intelligence and Neuroscience, 2022(1):1332122, 2022

Juan-Miguel L ´opez-Gil, Rosa Gil, and Roberto Garc ´ıa. Do deepfakes adequately display emotions? a study on deepfake facial emotion expression.Computational Intelligence and Neuroscience, 2022(1):1332122, 2022. 2, 3

work page 2022
[29]

emotion2vec: Self-supervised pre-training for speech emotion representation.arXiv preprint, 2023

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self- supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185, 2023. 2, 6

work page arXiv 2023
[30]

Emotions don’t lie: An audio- visual deepfake detection method using affective cues

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. Emotions don’t lie: An audio- visual deepfake detection method using affective cues. In Proceedings of the 28th ACM international conference on multimedia, pages 2823–2832, 2020. 2, 3

work page 2020
[31]

Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2017

Ali Mollahosseini, Behzad Hasani, and Mohammad H Ma- hoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2017. 2

work page 2017
[32]

Speech emotion recognition using self-supervised features

Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno, and Hagai Aronowitz. Speech emotion recognition using self-supervised features. InICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6922–6926. IEEE,

work page 2022
[33]

Towards uni- versal fake image detectors that generalize across genera- tive models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 3

work page 2023
[34]

Avff: Audio-visual feature fusion for video deepfake detection

Trevine Oorloff, Surya Koppisetti, Nicol `o Bonettini, Di- vyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, and Gaurav Bharaj. Avff: Audio-visual feature fusion for video deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27102–27112, 2024. 1, 3, 5, 6

work page 2024
[35]

Emo- tion recognition from speech using wav2vec 2.0 embeddings

Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emo- tion recognition from speech using wav2vec 2.0 embeddings. arXiv preprint arXiv:2104.03502, 2021. 2

work page arXiv 2021
[36]

Mul- timodaltrace: Deepfake detection using audiovisual repre- sentation learning

Muhammad Anas Raza and Khalid Mahmood Malik. Mul- timodaltrace: Deepfake detection using audiovisual repre- sentation learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 993–1000, 2023. 1, 3

work page 2023
[37]

Faceforen- sics++: Learning to detect manipulated facial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 3

work page 2019
[38]

Mae- dfer: Efficient masked autoencoder for self-supervised dy- namic facial expression recognition

Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. Mae- dfer: Efficient masked autoencoder for self-supervised dy- namic facial expression recognition. InProceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023. 2

work page 2023
[39]

Unsupervised multimodal deepfake detection using intra-and cross-modal inconsistencies.arXiv preprint arXiv:2311.17088, 2023

Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, and Wael Ab- dAlmageed. Unsupervised multimodal deepfake detection using intra-and cross-modal inconsistencies.arXiv preprint arXiv:2311.17088, 2023. 3

work page arXiv 2023
[40]

A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,

Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwa- hab Heba. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding.arXiv preprint arXiv:2111.02735,

work page arXiv
[41]

Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos

Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20922–20931, 2022. 2

work page 2022
[42]

Deepfakes audio de- tection leveraging audio spectrogram and convolutional neu- ral networks

Taiba Majid Wani and Irene Amerini. Deepfakes audio de- tection leveraging audio spectrogram and convolutional neu- ral networks. InInternational Conference on Image Analysis and Processing, pages 156–167. Springer, 2023. 3

work page 2023
[43]

Abc- capsnet: Attention based cascaded capsule network for audio deepfake detection

Taiba Majid Wani, Reeva Gulzar, and Irene Amerini. Abc- capsnet: Attention based cascaded capsule network for audio deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2464–2472, 2024. 3

work page 2024
[44]

Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126, 2021

Deressa Wodajo and Solomon Atnafu. Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126, 2021. 3

work page arXiv 2021
[45]

Trans- fer: Learning relation-aware facial expression representa- tions with transformers

Fanglei Xue, Qiangchang Wang, and Guodong Guo. Trans- fer: Learning relation-aware facial expression representa- tions with transformers. InProceedings of the IEEE/CVF 10 International conference on computer vision, pages 3601– 3610, 2021. 2

work page 2021
[46]

Avoid-df: Audio-visual joint learning for detecting deepfake

Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and Kui Ren. Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Transactions on Information Forensics and Security, 18:2015–2029, 2023. 1, 3

work page 2015
[47]

Exposing deep fakes using inconsistent head poses

Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. InICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 8261–8265. IEEE, 2019. 3

work page 2019
[48]

Spatio-temporal convolutional features with nested lstm for facial expression recognition.Neurocomputing, 317: 50–57, 2018

Zhenbo Yu, Guangcan Liu, Qingshan Liu, and Jiankang Deng. Spatio-temporal convolutional features with nested lstm for facial expression recognition.Neurocomputing, 317: 50–57, 2018. 2

work page 2018
[49]

Memory fusion network for multi-view sequential learning

Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. InPro- ceedings of the AAAI conference on artificial intelligence,

work page
[50]

Identity–expression dual branch network for facial expres- sion recognition.IEEE transactions on cognitive and devel- opmental systems, 13(4):898–911, 2020

Haifeng Zhang, Wen Su, Jun Yu, and Zengfu Wang. Identity–expression dual branch network for facial expres- sion recognition.IEEE transactions on cognitive and devel- opmental systems, 13(4):898–911, 2020. 2

work page 2020
[51]

Poster: A pyra- mid cross-fusion transformer network for facial expression recognition

Ce Zheng, Matias Mendieta, and Chen Chen. Poster: A pyra- mid cross-fusion transformer network for facial expression recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3146–3155, 2023. 2, 6

work page 2023
[52]

Exploring temporal coherence for more gen- eral video face forgery detection

Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. Exploring temporal coherence for more gen- eral video face forgery detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 15044–15054, 2021. 3

work page 2021
[53]

Joint audio-visual deepfake detection

Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 14800–14809, 2021. 3 11 EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection Supplementary Material We structure the supplementary materials as follows: first, S...

work page 2021
[54]

We create a validation split for both datasets, which is used for learning rate scheduling and early stopping

Details on Dataset In this section, we provide some details on how we split and utilise the FakeA VCeleb [18] and DeepSpeak v2 [2] datasets for benchmarking our proposed framework Emo- Boost. We create a validation split for both datasets, which is used for learning rate scheduling and early stopping. Fur- thermore, we also validate our design choices for...

work page
[55]

Leave-one-out Evaluation As described in Section 5.2, we observe motivating results from our proposed framework in the leave-one-out evalua- tion setup on FakeA VCeleb

Further Results 8.1. Leave-one-out Evaluation As described in Section 5.2, we observe motivating results from our proposed framework in the leave-one-out evalua- tion setup on FakeA VCeleb. We present the detailed perfor- mance and its comparison with other state-of-the-art multi- modal deepfake detectors in Table 6.Emo-Boosted SIMBA achieves the highest ...

work page arXiv

[1] [1]

Detecting deep-fake videos from phoneme- viseme mismatches

Shruti Agarwal, Hany Farid, Ohad Fried, and Maneesh Agrawala. Detecting deep-fake videos from phoneme- viseme mismatches. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition work- shops, pages 660–661, 2020. 3

work page 2020

[2] [2]

The DeepSpeak Dataset

Sarah Barrington, Matyas Bohacek, and Hany Farid. The deepspeak dataset.arXiv preprint arXiv:2408.05366, 2024. 2, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Island loss for learning discriminative features in facial expression recognition

Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O’Reilly, and Yan Tong. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 302–309. IEEE, 2018. 2

work page 2018

[4] [4]

Deepfake-scam auf meta-plattformen: Wenn merz und trump ihnen geld schenken wollen

Val ´erie Catil. Deepfake-scam auf meta-plattformen: Wenn merz und trump ihnen geld schenken wollen. https://taz.de/Deepfake- Scam- auf- Meta- Plattformen/!6118507/, 2025. taz.de, October 2025. 1

work page 2025

[5] [5]

chief finan- cial officer

Heather Chen and Kathleen Magramo. Finance worker pays out 25 million after video call with deepfake “chief finan- cial officer”.https://edition.cnn.com/2024/ 02/04/asia/deepfake-cfo-scam-hong-kong- intl-hnk, 2024. CNN, February 2024. 1

work page 2024

[6] [6]

From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos

Yin Chen, Jia Li, Shiguang Shan, Meng Wang, and Richang Hong. From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos. IEEE Transactions on Affective Computing, 2024. 2

work page 2024

[7] [7]

Static for dy- namic: Towards a deeper understanding of dynamic facial expressions using static expression data.arXiv preprint arXiv:2409.06154, 2024

Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, and Richang Hong. Static for dy- namic: Towards a deeper understanding of dynamic facial expressions using static expression data.arXiv preprint arXiv:2409.06154, 2024. 2

work page arXiv 2024

[8] [8]

V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications and Applications, 20(3):1–22, 2023

Harry Cheng, Yangyang Guo, Tianyi Wang, Qi Li, Xiaojun Chang, and Liqiang Nie. V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications and Applications, 20(3):1–22, 2023. 3

work page 2023

[9] [9]

Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild

Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4673–4682,

work page

[10] [10]

Stamm, and Stefano Tubaro

Emanuele Conti, Davide Salvi, Clara Borrelli, Brian Hosler, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, Matthew C. Stamm, and Stefano Tubaro. Deepfake speech detection through emotion recognition: A semantic ap- proach. InICASSP 2022 - 2022 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 8962–8966, 2022. 3

work page 2022

[11] [11]

Implicit identity leakage: The stum- bling block to improving deepfake detection generalization

Shichao Dong, Jin Wang, Renhe Ji, Jiajun Liang, Haoqiang Fan, and Zheng Ge. Implicit identity leakage: The stum- bling block to improving deepfake detection generalization. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3994–4004, 2023. 3

work page 2023

[12] [12]

Self- supervised video forensics by audio-visual anomaly detec- tion

Chao Feng, Ziyang Chen, and Andrew Owens. Self- supervised video forensics by audio-visual anomaly detec- tion. Inproceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10491–10503,

work page

[13] [13]

Chal- lenges in representation learning: A report on three machine learning contests

Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Chal- lenges in representation learning: A report on three machine learning contests. InInternational conference on neural in- formation processing, pages 117–124. Springer, 2013. 2

work page 2013

[14] [14]

Lips don’t lie: A generalisable and robust approach to face forgery detection

Alexandros Haliassos, Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5039–5049, 2021. 3

work page 2021

[15] [15]

Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies

Brian Hosler, Davide Salvi, Anthony Murray, Fabio An- tonacci, Paolo Bestagini, Stefano Tubaro, and Matthew C Stamm. Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1013–1022, 2021. 2, 3

work page 2021

[16] [16]

Towards paralinguistic-only speech representations for end-to-end speech emotion recognition

George Ioannides, Michael Owen, Andrew Fletcher, Viktor Rozgic, and Chao Wang. Towards paralinguistic-only speech representations for end-to-end speech emotion recognition. ISCA archive, 2023. 2

work page 2023

[17] [17]

Dfew: A large-scale database for recognizing dynamic facial expres- sions in the wild

Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial expres- sions in the wild. InProceedings of the 28th ACM interna- tional conference on multimedia, pages 2881–2889, 2020. 2

work page 2020

[18] [18]

FakeA VCeleb: A novel audio-video multimodal deepfake dataset.arXiv preprint arXiv:2108.05080,

Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. Fakeavceleb: A novel audio-video multimodal deep- fake dataset.arXiv preprint arXiv:2108.05080, 2021. 2, 5, 1

work page arXiv 2021

[19] [19]

Deep- fake doctor: Diagnosing and treating audio-video fake detec- tion, 2025

Marcel Klemt, Carlotta Segna, and Anna Rohrbach. Deep- fake doctor: Diagnosing and treating audio-video fake detec- tion, 2025. 1, 2, 5, 6, 3

work page 2025

[20] [20]

Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016

Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, and Junchi Yan. Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016. 2 9

work page arXiv 2016

[21] [21]

Intensity-aware loss for dynamic facial expression recogni- tion in the wild

Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. Intensity-aware loss for dynamic facial expression recogni- tion in the wild. InProceedings of the AAAI conference on artificial intelligence, pages 67–75, 2023. 2

work page 2023

[22] [22]

Affective behaviour analysis using pretrained model with facial prior

Yifan Li, Haomiao Sun, Zhaori Liu, Hu Han, and Shiguang Shan. Affective behaviour analysis using pretrained model with facial prior. InEuropean Conference on Computer Vi- sion, pages 19–30. Springer, 2022. 2

work page 2022

[23] [23]

Saanet: Siamese action-units attention network for improving dynamic facial expression recogni- tion.Neurocomputing, 413:145–157, 2020

Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. Saanet: Siamese action-units attention network for improving dynamic facial expression recogni- tion.Neurocomputing, 413:145–157, 2020. 2

work page 2020

[24] [24]

Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes

Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, and Run Wang. Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes. InAdvances in Neural Information Processing Systems, pages 91131–91155. Curran Associates, Inc., 2024. 1, 3

work page 2024

[25] [25]

Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild

Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. InPro- ceedings of the 30th ACM international conference on mul- timedia, pages 24–32, 2022. 2

work page 2022

[26] [26]

Expression snippet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023

Yuanyuan Liu, Wenbin Wang, Chuanxu Feng, Haoyu Zhang, Zhe Chen, and Yibing Zhan. Expression snippet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023. 2

work page 2023

[27] [27]

Man fined over deepfake porn in australian first

Tobi Loftus. Man fined over deepfake porn in australian first. https://www.abc.net.au/news/2025-09-26/ qld- deepfake- pornography- federal- court- charge / 105822448, 2025. ABC News, September

work page 2025

[28] [28]

Do deepfakes adequately display emotions? a study on deepfake facial emotion expression.Computational Intelligence and Neuroscience, 2022(1):1332122, 2022

Juan-Miguel L ´opez-Gil, Rosa Gil, and Roberto Garc ´ıa. Do deepfakes adequately display emotions? a study on deepfake facial emotion expression.Computational Intelligence and Neuroscience, 2022(1):1332122, 2022. 2, 3

work page 2022

[29] [29]

emotion2vec: Self-supervised pre-training for speech emotion representation.arXiv preprint, 2023

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self- supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185, 2023. 2, 6

work page arXiv 2023

[30] [30]

Emotions don’t lie: An audio- visual deepfake detection method using affective cues

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. Emotions don’t lie: An audio- visual deepfake detection method using affective cues. In Proceedings of the 28th ACM international conference on multimedia, pages 2823–2832, 2020. 2, 3

work page 2020

[31] [31]

Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2017

Ali Mollahosseini, Behzad Hasani, and Mohammad H Ma- hoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2017. 2

work page 2017

[32] [32]

Speech emotion recognition using self-supervised features

Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno, and Hagai Aronowitz. Speech emotion recognition using self-supervised features. InICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6922–6926. IEEE,

work page 2022

[33] [33]

Towards uni- versal fake image detectors that generalize across genera- tive models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 3

work page 2023

[34] [34]

Avff: Audio-visual feature fusion for video deepfake detection

Trevine Oorloff, Surya Koppisetti, Nicol `o Bonettini, Di- vyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, and Gaurav Bharaj. Avff: Audio-visual feature fusion for video deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27102–27112, 2024. 1, 3, 5, 6

work page 2024

[35] [35]

Emo- tion recognition from speech using wav2vec 2.0 embeddings

Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emo- tion recognition from speech using wav2vec 2.0 embeddings. arXiv preprint arXiv:2104.03502, 2021. 2

work page arXiv 2021

[36] [36]

Mul- timodaltrace: Deepfake detection using audiovisual repre- sentation learning

Muhammad Anas Raza and Khalid Mahmood Malik. Mul- timodaltrace: Deepfake detection using audiovisual repre- sentation learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 993–1000, 2023. 1, 3

work page 2023

[37] [37]

Faceforen- sics++: Learning to detect manipulated facial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 3

work page 2019

[38] [38]

Mae- dfer: Efficient masked autoencoder for self-supervised dy- namic facial expression recognition

Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. Mae- dfer: Efficient masked autoencoder for self-supervised dy- namic facial expression recognition. InProceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023. 2

work page 2023

[39] [39]

Unsupervised multimodal deepfake detection using intra-and cross-modal inconsistencies.arXiv preprint arXiv:2311.17088, 2023

Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, and Wael Ab- dAlmageed. Unsupervised multimodal deepfake detection using intra-and cross-modal inconsistencies.arXiv preprint arXiv:2311.17088, 2023. 3

work page arXiv 2023

[40] [40]

A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,

Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwa- hab Heba. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding.arXiv preprint arXiv:2111.02735,

work page arXiv

[41] [41]

Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos

Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. Ferv39k: A large-scale multi-scene dataset for fa- cial expression recognition in videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20922–20931, 2022. 2

work page 2022

[42] [42]

Deepfakes audio de- tection leveraging audio spectrogram and convolutional neu- ral networks

Taiba Majid Wani and Irene Amerini. Deepfakes audio de- tection leveraging audio spectrogram and convolutional neu- ral networks. InInternational Conference on Image Analysis and Processing, pages 156–167. Springer, 2023. 3

work page 2023

[43] [43]

Abc- capsnet: Attention based cascaded capsule network for audio deepfake detection

Taiba Majid Wani, Reeva Gulzar, and Irene Amerini. Abc- capsnet: Attention based cascaded capsule network for audio deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2464–2472, 2024. 3

work page 2024

[44] [44]

Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126, 2021

Deressa Wodajo and Solomon Atnafu. Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126, 2021. 3

work page arXiv 2021

[45] [45]

Trans- fer: Learning relation-aware facial expression representa- tions with transformers

Fanglei Xue, Qiangchang Wang, and Guodong Guo. Trans- fer: Learning relation-aware facial expression representa- tions with transformers. InProceedings of the IEEE/CVF 10 International conference on computer vision, pages 3601– 3610, 2021. 2

work page 2021

[46] [46]

Avoid-df: Audio-visual joint learning for detecting deepfake

Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and Kui Ren. Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Transactions on Information Forensics and Security, 18:2015–2029, 2023. 1, 3

work page 2015

[47] [47]

Exposing deep fakes using inconsistent head poses

Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. InICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 8261–8265. IEEE, 2019. 3

work page 2019

[48] [48]

Spatio-temporal convolutional features with nested lstm for facial expression recognition.Neurocomputing, 317: 50–57, 2018

Zhenbo Yu, Guangcan Liu, Qingshan Liu, and Jiankang Deng. Spatio-temporal convolutional features with nested lstm for facial expression recognition.Neurocomputing, 317: 50–57, 2018. 2

work page 2018

[49] [49]

Memory fusion network for multi-view sequential learning

Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. InPro- ceedings of the AAAI conference on artificial intelligence,

work page

[50] [50]

Identity–expression dual branch network for facial expres- sion recognition.IEEE transactions on cognitive and devel- opmental systems, 13(4):898–911, 2020

Haifeng Zhang, Wen Su, Jun Yu, and Zengfu Wang. Identity–expression dual branch network for facial expres- sion recognition.IEEE transactions on cognitive and devel- opmental systems, 13(4):898–911, 2020. 2

work page 2020

[51] [51]

Poster: A pyra- mid cross-fusion transformer network for facial expression recognition

Ce Zheng, Matias Mendieta, and Chen Chen. Poster: A pyra- mid cross-fusion transformer network for facial expression recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3146–3155, 2023. 2, 6

work page 2023

[52] [52]

Exploring temporal coherence for more gen- eral video face forgery detection

Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. Exploring temporal coherence for more gen- eral video face forgery detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 15044–15054, 2021. 3

work page 2021

[53] [53]

Joint audio-visual deepfake detection

Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 14800–14809, 2021. 3 11 EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection Supplementary Material We structure the supplementary materials as follows: first, S...

work page 2021

[54] [54]

We create a validation split for both datasets, which is used for learning rate scheduling and early stopping

Details on Dataset In this section, we provide some details on how we split and utilise the FakeA VCeleb [18] and DeepSpeak v2 [2] datasets for benchmarking our proposed framework Emo- Boost. We create a validation split for both datasets, which is used for learning rate scheduling and early stopping. Fur- thermore, we also validate our design choices for...

work page

[55] [55]

Leave-one-out Evaluation As described in Section 5.2, we observe motivating results from our proposed framework in the leave-one-out evalua- tion setup on FakeA VCeleb

Further Results 8.1. Leave-one-out Evaluation As described in Section 5.2, we observe motivating results from our proposed framework in the leave-one-out evalua- tion setup on FakeA VCeleb. We present the detailed perfor- mance and its comparison with other state-of-the-art multi- modal deepfake detectors in Table 6.Emo-Boosted SIMBA achieves the highest ...

work page arXiv