pith. sign in

arxiv: 2604.28022 · v1 · submitted 2026-04-30 · 💻 cs.CV

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Pith reviewed 2026-05-07 07:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords DeepFake detectionsemantic mismatchaudio-visual forgeryRARV-SMMImageBind embeddingsmultimodal DeepFakesFakeAVCelebLAV-DF
0
0 comments X

The pith

DeepFake detectors often rely on data source integrity rather than semantic consistency between audio and video, but adding a mismatch class with ImageBind embeddings improves performance on FakeAVCeleb and LAV-DF.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current detection methods treat DeepFakes as binary or four-class problems based on which modality is manipulated, yet they can miss cases where both audio and video come from real sources but carry inconsistent meanings. The paper introduces the RARV-SMM class to explicitly model real audio-real video pairs with semantic mismatches, revealing that state-of-the-art models often fail because they check file origins instead of content coherence. Three variants of this class expose different model weaknesses as the degree of audio-visual divergence grows. A semantic reinforcement strategy is proposed that trains on the mismatch class while using ImageBind embeddings to strengthen semantic signals, yielding better results in both the new setting and existing benchmarks. This matters for building detectors that can handle realistic manipulations where authentic sources are combined inconsistently.

Core claim

By extending four-class audio-visual DeepFake formulations with the Real Audio-Real Video with Semantic Mismatch (RARV-SMM) class and its variants, the work shows that existing models have limitations when semantic inconsistency appears without source forgery. A semantic reinforcement strategy that incorporates the RARV-SMM class together with ImageBind embeddings is introduced, which improves detection accuracy in the proposed RARV-SMM setting and in state-of-the-art evaluation settings on the FakeAVCeleb and LAV-DF datasets.

What carries the argument

The RARV-SMM class (real audio-real video pairs with explicit semantic mismatch) combined with a semantic reinforcement strategy that integrates the mismatch class and ImageBind embeddings to guide detection toward content consistency rather than source integrity.

Load-bearing premise

The RARV-SMM class and its variants accurately represent the kinds of semantic inconsistencies that occur in real-world DeepFakes, and ImageBind embeddings supply a sufficient, unbiased signal for detecting those mismatches.

What would settle it

A model trained with the proposed semantic reinforcement strategy shows no accuracy gain over baselines when tested on an independently collected set of real audio-video pairs that contain deliberate semantic mismatches not present in FakeAVCeleb or LAV-DF.

Figures

Figures reproduced from arXiv: 2604.28022 by Hugo Proen\c{c}a, Joana C. Costa, Kailash A. Hambarde, Sharayu Nilesh Deshmukh, Tiago Roxo.

Figure 1
Figure 1. Figure 1: Illustration of the proposed threat: a four-class DeepFake view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed five-class audio-visual DeepFake detection framework. Stage 1 constructs RARV-SMM samples from VoxCeleb2 across three variants (V1: same identity, different context; V2: different identity, same gender; V3: different identity, different gender) and integrates them with FakeAVCeleb to form a unified five-class dataset. Stage 2 defines three experimental settings: multi-class perform… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the proposed RARV-SMM variants. view at source ↗
Figure 4
Figure 4. Figure 4: Per-class F1 comparison between Multi-Class and Real view at source ↗
Figure 5
Figure 5. Figure 5: Per-class F1 profiles across V1, V2 and V3 for FGMDF, FGI, and AVDF. Each axis represents one class; a larger polygon view at source ↗
read the original abstract

Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formulations address this by discriminating manipulation type, but introduce a unresolved problem: models may rely solely on data source integrity to detect DeepFakes without evaluating their semantic consistency. If the DeepFake origin is not in the data source but in its content, can semantic mismatch be assessed by the state-of-the-art? This paper proposes a new evaluation setup, extending the four-class formulation by explicitly modeling semantic-level inconsistency between authentic modalities with the introduction a new class: Real Audio-Real Video with Semantic Mismatch (RARV-SMM). We assess the robustness of state-of-the-art models in this new realistic DeepFake setting, using the FakeAVCeleb dataset, highlighting the limitations of existing approaches when faced with semantic mismatch data. We further introduce three RARV-SMM variants that expose distinct architectural vulnerabilities as audio-visual divergence increases. We also propose a semantic reinforcement strategy that incorporates the semantic mismatch class and ImageBind embeddings to improve DeepFake detection in both our proposed and state-of-the-art settings, on FakeAVCeleb and LAV-DF, paving the way to more realistic DeepFake detectors. The source code and data are available at https://github.com/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes extending the four-class audio-visual DeepFake detection formulation by introducing a new class, Real Audio-Real Video with Semantic Mismatch (RARV-SMM), to explicitly test semantic-level inconsistencies between authentic modalities. Using the FakeAVCeleb dataset, it evaluates state-of-the-art models and highlights their limitations in this setting. Three RARV-SMM variants with increasing audio-visual divergence are defined, and a semantic reinforcement strategy is introduced that incorporates the new class along with ImageBind embeddings; this strategy is claimed to improve detection performance in both the proposed RARV-SMM setting and standard four-class setups on FakeAVCeleb and LAV-DF. Code and data are released.

Significance. If the empirical claims hold after addressing construction details, the work identifies a meaningful gap in current DeepFake detectors—reliance on source integrity rather than semantic coherence—and offers a concrete path toward more realistic evaluation and training via the RARV-SMM class and ImageBind-based reinforcement. The public release of code and data is a positive factor for reproducibility. The significance is tempered by the need to confirm that RARV-SMM instances exercise genuine semantic mismatch rather than low-level cross-source cues.

major comments (3)
  1. [§3] §3 (RARV-SMM Construction): The central claim that RARV-SMM tests semantic inconsistency (rather than data-source integrity) depends on how the class is instantiated. The manuscript must explicitly state whether RARV-SMM pairs are formed from the same original recording or by cross-video pairing of unrelated real audio and real video clips. If the latter, models (including those using ImageBind) can exploit speaker identity, scene statistics, or acoustic background cues; this would make the reported gains non-diagnostic for the subtler semantic drift that occurs inside a single coherent DeepFake. Provide concrete generation procedure, statistics on divergence, and example pairs.
  2. [§5] §5 (Experimental Results): The abstract asserts that the semantic reinforcement strategy yields performance gains on FakeAVCeleb and LAV-DF, yet the provided abstract contains no numerical results, ablation tables, or error analysis. The full manuscript must include (i) accuracy/AUC numbers for baseline SOTA models versus the reinforced model in both the four-class and RARV-SMM settings, (ii) ablations isolating the contribution of the RARV-SMM class versus ImageBind embeddings, and (iii) per-variant breakdown showing that degradation scales with the intended semantic divergence rather than other distribution shifts.
  3. [§4] §4 (RARV-SMM Variants): The three variants are said to expose distinct architectural vulnerabilities “as audio-visual divergence increases.” Define the divergence metric (e.g., ImageBind cosine distance, manual semantic annotation, or other), report its distribution across variants, and demonstrate that performance drops are statistically significant and attributable to semantic mismatch rather than confounding factors such as class imbalance or low-level feature leakage.
minor comments (3)
  1. [Abstract] Abstract: The phrase “paving the way to more realistic DeepFake detectors” is vague; replace with a concise statement of the observed accuracy or AUC improvement on the two datasets.
  2. [§2] Notation: Ensure consistent use of “RARV-SMM” versus “semantic mismatch class” throughout; define the four-class baseline explicitly before introducing the fifth class.
  3. [§5] Figure clarity: Any t-SNE or embedding visualizations of ImageBind features on RARV-SMM versus standard classes should include axis labels, legend, and quantitative separation metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. The feedback highlights important aspects of clarity and rigor that we will address in the revision. We respond to each major comment below and commit to the necessary changes.

read point-by-point responses
  1. Referee: [§3] §3 (RARV-SMM Construction): The central claim that RARV-SMM tests semantic inconsistency (rather than data-source integrity) depends on how the class is instantiated. The manuscript must explicitly state whether RARV-SMM pairs are formed from the same original recording or by cross-video pairing of unrelated real audio and real video clips. If the latter, models (including those using ImageBind) can exploit speaker identity, scene statistics, or acoustic background cues; this would make the reported gains non-diagnostic for the subtler semantic drift that occurs inside a single coherent DeepFake. Provide concrete generation procedure, statistics on divergence, and example pairs.

    Authors: We appreciate this critical observation. RARV-SMM pairs are formed by cross-video pairing of unrelated real audio and real video clips from the FakeAVCeleb dataset. This is intentional to isolate semantic mismatch between authentic modalities without any manipulation. While cross-pairing can introduce cues such as speaker identity, the semantic reinforcement strategy with ImageBind is designed to prioritize higher-level semantic coherence. In the revision we will explicitly describe the generation procedure in §3 (including selection criteria for mismatch pairs), provide concrete example pairs with semantic descriptions, and report divergence statistics using ImageBind cosine distances. We will also add discussion distinguishing this setup from intra-video semantic drift while arguing that cross-source mismatch is a valid and realistic test case for semantic coherence in DeepFake detection. revision: yes

  2. Referee: [§5] §5 (Experimental Results): The abstract asserts that the semantic reinforcement strategy yields performance gains on FakeAVCeleb and LAV-DF, yet the provided abstract contains no numerical results, ablation tables, or error analysis. The full manuscript must include (i) accuracy/AUC numbers for baseline SOTA models versus the reinforced model in both the four-class and RARV-SMM settings, (ii) ablations isolating the contribution of the RARV-SMM class versus ImageBind embeddings, and (iii) per-variant breakdown showing that degradation scales with the intended semantic divergence rather than other distribution shifts.

    Authors: We agree that quantitative support must be explicit. In the revised version we will update the abstract with representative accuracy and AUC figures for the baseline SOTA models versus the reinforced model on both FakeAVCeleb and LAV-DF in the four-class and RARV-SMM settings. Section 5 will be expanded with full tables, ablation studies that separately quantify the contribution of the RARV-SMM class and the ImageBind embeddings, and a per-variant breakdown for the three RARV-SMM variants. Error analysis and controls for distribution shifts will be included to substantiate the claimed gains. revision: yes

  3. Referee: [§4] §4 (RARV-SMM Variants): The three variants are said to expose distinct architectural vulnerabilities “as audio-visual divergence increases.” Define the divergence metric (e.g., ImageBind cosine distance, manual semantic annotation, or other), report its distribution across variants, and demonstrate that performance drops are statistically significant and attributable to semantic mismatch rather than confounding factors such as class imbalance or low-level feature leakage.

    Authors: We will revise §4 to define the divergence metric explicitly as the cosine distance in ImageBind embedding space between audio and video features, with thresholds used to create the low-, medium-, and high-divergence variants. We will report the distribution (means, standard deviations, ranges) across variants and include supporting visualizations. To attribute performance drops to semantic mismatch, we will add balanced-sampling controls for class imbalance, compare models with and without the semantic reinforcement component to assess low-level leakage, and include statistical significance tests (e.g., paired t-tests across multiple runs) demonstrating that degradation scales with the defined divergence metric. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmarks with no self-referential derivations

full rationale

The paper introduces the RARV-SMM class and three variants as an extension of four-class audio-visual DeepFake detection, then reports empirical accuracy gains from a semantic reinforcement strategy that adds the new class plus ImageBind embeddings. All reported results use the external FakeAVCeleb and LAV-DF datasets and pre-existing ImageBind model; no equations, fitted parameters, or first-principles derivations are shown that reduce by construction to the paper's own inputs. No load-bearing self-citations appear, and the central claim remains an empirical observation rather than a tautological renaming or self-defined prediction. The construction of RARV-SMM instances is a data-generation choice whose realism is an external validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic mismatch constitutes a distinct and detectable phenomenon separable from source authenticity, and that ImageBind embeddings capture the relevant cross-modal semantics. No free parameters or invented physical entities are described in the abstract; the new class is a labeling convention rather than a postulated entity.

axioms (1)
  • domain assumption ImageBind embeddings can effectively represent and detect semantic mismatches between audio and video modalities
    The reinforcement strategy depends on this capability of ImageBind to improve detection.

pith-pipeline@v0.9.0 · 5563 in / 1283 out tokens · 128607 ms · 2026-05-07T07:44:24.958821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Agarwal, H

    S. Agarwal, H. Farid, O. Fried, and M. Agrawala. De- tecting deep-fake videos from phoneme-viseme mis- matches. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition work- shops, pages 660–661, 2020. 2

  2. [2]

    Alshehri, D

    A. Alshehri, D. Almalki, E. Alharbi, and S. Albaradei. Audio deep fake detection with sonic sleuth model. computers, 13 (10), 256, 2024. 2

  3. [3]

    Astrid, E

    M. Astrid, E. Ghorbel, and D. Aouada. Detecting audio-visual deepfakes with fine-grained inconsisten- cies.arXiv preprint arXiv:2408.06753, 2024. 2, 3, 4, 5

  4. [4]

    Astrid, E

    M. Astrid, E. Ghorbel, and D. Aouada. Statistics- aware audio-visual deepfake detector. In2024 IEEE International Conference on Image Processing (ICIP), pages 2557–2563. IEEE, 2024. 3

  5. [5]

    Bohacek and H

    M. Bohacek and H. Farid. Lost in translation: Lip- sync deepfake detection from audio-video mismatch. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4315– 4323, 2024. 2

  6. [6]

    Z. Cai, S. Ghosh, A. Dhall, T. Gedeon, K. Stefanov, and M. Hayat. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery de- tection and localization.Computer Vision and Image Understanding, 236:103818, 2023. 5, 8

  7. [7]

    Cheng, Y

    H. Cheng, Y . Guo, T. Wang, Q. Li, X. Chang, and L. Nie. V oice-face homogeneity tells deepfake.ACM Transactions on Multimedia Computing, Communica- tions and Applications, 20(3):1–22, 2023. 2

  8. [8]

    Chugh, P

    K. Chugh, P. Gupta, A. Dhall, and R. Subramanian. Not made for each other-audio-visual dissonance- based deepfake detection and localization. InPro- ceedings of the 28th ACM international conference on multimedia, pages 439–447, 2020. 2, 8

  9. [9]

    J. S. Chung, A. Nagrani, and A. Zisserman. V ox- celeb2: Deep speaker recognition.arXiv preprint arXiv:1806.05622, 2018. 4

  10. [10]

    J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016. 2

  11. [11]

    Dhall, Z

    A. Dhall, Z. Cai, and S. Ghosh. Multimodal deep- fake generation and detection: Challenges, methods, and future directions. InCompanion Proceedings of the 27th International Conference on Multimodal In- teraction, pages 65–66, 2025. 1, 2

  12. [12]

    C. Feng, Z. Chen, and A. Owens. Self-supervised video forensics by audio-visual anomaly detection. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10491–10503,

  13. [13]

    Girdhar, A

    R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Al- wala, A. Joulin, and I. Misra. Imagebind: One em- bedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023. 2, 4, 5

  14. [14]

    Z. Gu, Y . Chen, T. Yao, S. Ding, J. Li, F. Huang, and L. Ma. Spatiotemporal inconsistency learning for deepfake video detection. InProceedings of the 29th ACM international conference on multimedia, pages 3473–3481, 2021. 2

  15. [15]

    Haliassos, R

    A. Haliassos, R. Mira, S. Petridis, and M. Pan- tic. Leveraging real talking faces via self-supervision for robust forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14950–14962, 2022. 1, 2

  16. [16]

    Haliassos, K

    A. Haliassos, K. V ougioukas, S. Petridis, and M. Pan- tic. Lips don’t lie: A generalisable and robust ap- proach to face forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5039–5049, 2021. 1, 2

  17. [17]

    Y . Jia, Y . Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y . Wu, et al. Transfer learning from speaker verification to multi- speaker text-to-speech synthesis.Advances in neural information processing systems, 31, 2018. 1

  18. [18]

    Khalid, M

    H. Khalid, M. Kim, S. Tariq, and S. S. Woo. Evalu- ation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. InPro- ceedings of the 1st workshop on synthetic multimedia- audiovisual deepfake generation and detection, pages 7–15, 2021. 1, 2, 4, 5

  19. [19]

    FakeA VCeleb: A novel audio-video multimodal deepfake dataset.arXiv preprint arXiv:2108.05080,

    H. Khalid, S. Tariq, M. Kim, and S. S. Woo. Fakeavceleb: A novel audio-video multimodal deep- fake dataset.arXiv preprint arXiv:2108.05080, 2021. 5

  20. [20]

    Li, M.-C

    Y . Li, M.-C. Chang, and S. Lyu. In ictu oculi: Expos- ing ai created fake videos by detecting eye blinking. In2018 IEEE International workshop on information forensics and security (WIFS), pages 1–7. Ieee, 2018. 2

  21. [21]

    P. Liu, Q. Tao, and J. T. Zhou. Evolving from single-modal to multi-modal facial deepfake detec- tion: Progress and challenges.arXiv preprint arXiv:2406.06965, 2024. 1

  22. [22]

    Mirsky and W

    Y . Mirsky and W. Lee. The creation and detection of deepfakes: A survey.ACM computing surveys (CSUR), 54(1):1–41, 2021. 1, 2

  23. [23]

    Nirkin, Y

    Y . Nirkin, Y . Keller, and T. Hassner. Fsgan: Sub- ject agnostic face swapping and reenactment. InPro- ceedings of the IEEE/CVF international conference on computer vision, pages 7184–7193, 2019. 1

  24. [24]

    Prajwal, R

    K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multime- dia, pages 484–492, 2020. 1

  25. [25]

    Rossler, D

    A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 1, 2

  26. [26]

    T. Roxo, J. C. Costa, P. R. In ´acio, and H. Proenc ¸a. On exploring audio anomaly in speech. In2023 IEEE International Workshop on Information F orensics and Security (WIFS), pages 1–6. IEEE, 2023. 1

  27. [27]

    T. Roxo, J. C. Costa, P. R. In ´acio, and H. Proenc ¸a. Bias: A body-based interpretable active speaker ap- proach.IEEE Transactions on Biometrics, Behavior , and Identity Science, 2024. 1

  28. [28]

    T. Roxo, J. C. Costa, P. R. In ´acio, and H. Proenc ¸a. Asdnb: Merging face with body cues for robust active speaker detection. In2025 IEEE International Joint Conference on Biometrics (IJCB), pages 1–10. IEEE,

  29. [29]

    Learning audio-visual speech representa- tion by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184, 2022

    B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mo- hamed. Learning audio-visual speech representa- tion by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184, 2022. 2, 5

  30. [30]

    Thies, M

    J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face capture and reenactment of rgb videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2387–2395, 2016. 1

  31. [31]

    J. Wang, B. Wu, L. Liu, and Q. Liu. Fauforen- sics: Boosting audio-visual deepfake detection with facial action units.IEEE Transactions on Information F orensics and Security, 2026. 3

  32. [32]

    Q. Yin, W. Lu, X. Cao, X. Luo, Y . Zhou, and J. Huang. Fine-grained multimodal deepfake classification via heterogeneous graphs.International Journal of Com- puter Vision, 132(11):5255–5269, 2024. 1, 2, 4, 5, 8

  33. [33]

    Zhou and S.-N

    Y . Zhou and S.-N. Lim. Joint audio-visual deepfake detection. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 14800– 14809, 2021. 2, 8

  34. [34]

    H. Zou, M. Shen, Y . Hu, C. Chen, E. S. Chng, and D. Rajan. Cross-modality and within-modality reg- ularization for audio-visual deepfake detection. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4900–4904. IEEE, 2024. 1, 2