pith. machine review for the scientific record. sign in

arxiv: 2604.07763 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords modality-agnostic forgery detectionmultimodal deepfakesuniversal forgery tracesdark modalitiesgeneralizationdeepfake forensicsstyle decouplingMAF framework
0
0 comments X

The pith

Decoupling modality-specific styles from content isolates shared latent forgery traces that generalize to unseen media types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that forgery signals contain a universal component independent of the physical or stylistic differences between modalities such as images, video, and audio. Current detectors fail on new formats because they latch onto superficial artifacts tied to one modality; the proposed approach separates those style elements to keep only the cross-modal forgery core. A reader would care if this holds because it removes the need to retrain or collect data for every new manipulation technique or media format that appears. The work also supplies a benchmark that measures two levels of generalization to test whether the separation actually works on completely isolated signals.

Core claim

The paper claims that the modality-agnostic forgery detection framework extracts essential cross-modal latent forgery knowledge once modality-specific styles are explicitly decoupled, thereby proving the existence of universal forgery traces and delivering substantial gains on unknown modalities.

What carries the argument

The modality-agnostic forgery (MAF) detection framework, which decouples modality-specific styles to isolate cross-modal latent forgery knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling step could be reused in other cross-modal tasks where superficial differences currently block knowledge transfer.
  • Real-world deployment would require checking whether the extracted traces remain stable when forgeries combine multiple modalities at once.
  • If the universal traces prove stable, security systems might shift from per-modality models to a single shared detector.

Load-bearing premise

Shared latent forgery knowledge exists independently of modality-specific appearances and can be reliably decoupled and applied to entirely new modalities without any training examples from them.

What would settle it

Training MAF on known modalities and then measuring detection accuracy on a completely isolated dark modality from the DeepModal-Bench; accuracy no better than random guessing or prior methods would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.07763 by Chuancheng Shi, Fei Shen, Jian Wang, Jingtong Dou, Tat-Seng Chua, Zhiyong Wang.

Figure 1
Figure 1. Figure 1: From modality-binding to modality-agnostic forgery (MAF). Unlike existing tasks that overfit to specific physical representations seen during training, our proposed MAF task focuses on extracting shared latent forgery knowledge. This modality-agnostic feature learning enables robust source-free domain generalization against unknown modalities (e.g., unseen audio), systematically tackling both Weak and Stro… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the MAF framework. Following training on known modalities, the universal detector [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of feature distributions between semantic and forensics spaces. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance evaluation under the Weak MAF setting across three pre-trained perceptors (ImageBind, LanguageBind, and UniBind). Each bar reports [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Averaged performance comparison for differ [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance evaluation under Strong MAF across three pre-trained perceptors (ImageBind, LanguageBind, and UniBind). The bar chart illustrates the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Averaged performance comparison for different [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cross-modal distribution consistency analysis. The heatmaps [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cross-modal neuron co-activation. Visualizing the top-64 activated [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
read the original abstract

As generative artificial intelligence evolves, deepfake attacks have escalated from single-modality manipulations to complex, multimodal threats. Existing forensic techniques face a severe generalization bottleneck: by relying excessively on superficial, modality-specific artifacts, they neglect the shared latent forgery knowledge hidden beneath variable physical appearances. Consequently, these models suffer catastrophic performance degradation when confronted with unseen "dark modalities." To break this limitation, this paper introduces a paradigm shift that redefines multimodal forensics from conventional "feature fusion" to "modality generalization." We propose the first modality-agnostic forgery (MAF) detection framework. By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge. Furthermore, we define two progressive dimensions to quantify model generalization: transferability toward semantically correlated modalities (Weak MAF), and robustness against completely isolated signals of "dark modality" (Strong MAF). To rigorously assess these generalization limits, we introduce the DeepModal-Bench benchmark, which integrates diverse multimodal forgery detection algorithms and adapts state-of-the-art generalized learning methods. This study not only empirically proves the existence of universal forgery traces but also achieves significant performance breakthroughs on unknown modalities via the MAF framework, offering a pioneering technical pathway for universal multimodal defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces the Modality-Agnostic Forgery (MAF) detection framework as a shift from feature fusion to modality generalization in multimodal deepfake forensics. By decoupling modality-specific styles, MAF is claimed to extract shared latent forgery knowledge; the work defines Weak MAF (transfer to correlated modalities) and Strong MAF (robustness to isolated 'dark modalities'), introduces the DeepModal-Bench benchmark, and asserts that it empirically proves the existence of universal forgery traces while delivering significant performance gains on unknown modalities.

Significance. If the empirical claims and generalization results hold after proper validation, the work would offer a meaningful advance by addressing the generalization bottleneck in current forensic methods and providing a new benchmark for cross-modal robustness. The explicit framing of Weak versus Strong MAF dimensions could help standardize evaluation of modality-agnostic claims in the field.

major comments (3)
  1. [Abstract] Abstract: The central claim that the study 'empirically proves the existence of universal forgery traces' and 'achieves significant performance breakthroughs' is unsupported by any description of methods, datasets, quantitative metrics, error bars, ablations, or baseline comparisons. This absence is load-bearing because the paper's contribution rests entirely on these unshown results.
  2. [Abstract] Abstract: The decoupling step is asserted to 'precisely extract the essential, cross-modal latent forgery knowledge' without any stated mechanism, loss function, or validation that residual modality cues or dataset-specific correlations have been removed. This directly affects the Strong MAF claim of generalization to completely isolated signals with zero training examples from dark modalities.
  3. [Abstract] Abstract: No experimental protocol is given for the DeepModal-Bench benchmark (e.g., how modalities are partitioned into 'known' vs. 'unknown', what generative pipelines are used, or how 'semantically correlated' vs. 'completely isolated' is operationalized), preventing assessment of whether reported gains reflect universal traces or implicit statistical links across training modalities.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying the content of the full paper and indicating revisions where the abstract can be strengthened for clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the study 'empirically proves the existence of universal forgery traces' and 'achieves significant performance breakthroughs' is unsupported by any description of methods, datasets, quantitative metrics, error bars, ablations, or baseline comparisons. This absence is load-bearing because the paper's contribution rests entirely on these unshown results.

    Authors: The abstract serves as a high-level summary of the contributions. The full manuscript details the MAF framework in Section 3, the DeepModal-Bench construction in Section 4, and all quantitative results—including metrics, baselines, ablations, error bars, and comparisons across known/unknown modalities—in Section 5. We agree the abstract would benefit from a brief reference to key performance gains and will revise it accordingly to better anchor the claims. revision: yes

  2. Referee: [Abstract] Abstract: The decoupling step is asserted to 'precisely extract the essential, cross-modal latent forgery knowledge' without any stated mechanism, loss function, or validation that residual modality cues or dataset-specific correlations have been removed. This directly affects the Strong MAF claim of generalization to completely isolated signals with zero training examples from dark modalities.

    Authors: Section 3.2 describes the decoupling mechanism in detail: a style encoder combined with an adversarial objective and contrastive loss to isolate shared forgery representations while suppressing modality-specific cues. Section 5.2 includes ablations confirming reduced residual correlations. Strong MAF evaluation uses zero-shot testing on isolated modalities, as defined in Section 4. We will add a concise clause to the abstract summarizing the adversarial decoupling approach. revision: partial

  3. Referee: [Abstract] Abstract: No experimental protocol is given for the DeepModal-Bench benchmark (e.g., how modalities are partitioned into 'known' vs. 'unknown', what generative pipelines are used, or how 'semantically correlated' vs. 'completely isolated' is operationalized), preventing assessment of whether reported gains reflect universal traces or implicit statistical links across training modalities.

    Authors: Section 4 fully specifies the DeepModal-Bench protocol, including modality partitioning rules, the generative pipelines (specific GAN and diffusion models per modality), and operational definitions distinguishing Weak MAF (semantically correlated modalities) from Strong MAF (completely isolated signals with no training overlap). We will revise the abstract to include a short outline of this benchmark setup and evaluation criteria. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical framework and benchmark proposal

full rationale

The provided abstract and description contain no equations, derivations, or mathematical steps that reduce to inputs by construction. The core claims rest on an empirical MAF framework that decouples styles and a newly introduced DeepModal-Bench for testing generalization; these are presented as experimental results rather than a closed derivation chain. No self-citations, fitted parameters renamed as predictions, or ansatzes are visible that would create load-bearing circularity. The 'proof' of universal traces is explicitly empirical performance on the benchmark, which remains independent of any self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; claims rest on high-level assumptions about latent knowledge that are not formalized here.

pith-pipeline@v0.9.0 · 5533 in / 1146 out tokens · 65942 ms · 2026-05-10T18:02:02.981114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    Invariance principle meets information bottleneck for out-of-distribution generalization.Advances in Neural Information Processing Systems, 34:3438–3450, 2021

    Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization.Advances in Neural Information Processing Systems, 34:3438–3450, 2021

  2. [2]

    Invariance principle meets information bottleneck for out-of-distribution generalization, 2022

    Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization, 2022

  3. [3]

    Invariant Risk Minimization

    Martin Arjovsky, L ´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019

  4. [4]

    Invariant risk minimization, 2020

    Martin Arjovsky, L ´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization, 2020

  5. [5]

    A theory of learning from different domains.Machine learning, 79(1):151–175, 2010

    Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains.Machine learning, 79(1):151–175, 2010

  6. [6]

    Av- deepfake1m++: A large-scale audio-visual deepfake benchmark with real-world perturbations, 2025

    Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muham- mad Haris Khan, Usman Tariq, Tom Gedeon, and Abhinav Dhall. Av- deepfake1m++: A large-scale audio-visual deepfake benchmark with real-world perturbations, 2025

  7. [7]

    Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization, 2023

    Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization, 2023. 10

  8. [8]

    Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024

    Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024

  9. [9]

    Simmmdg: A simple and effective framework for multi-modal domain generalization.Advances in Neural Information Processing Systems, 36:78674–78695, 2023

    Hao Dong, Ismail Nejjar, Han Sun, Eleni Chatzi, and Olga Fink. Simmmdg: A simple and effective framework for multi-modal domain generalization.Advances in Neural Information Processing Systems, 36:78674–78695, 2023

  10. [10]

    An image is worth 16x16 words: Transformers for image recognition at scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

  11. [11]

    Dna: Uncovering universal latent forgery knowledge, 2026

    Jingtong Dou, Chuancheng Shi, Yemin Wang, Shiming Guo, Anqi Yi, Wenhua Wu, Li Zhang, Fei Shen, and Tat-Seng Chua. Dna: Uncovering universal latent forgery knowledge, 2026

  12. [12]

    Inference-time dynamic modality selection for incomplete multimodal classification

    Siyi Du, Xinzhe Luo, Declan P O’Regan, and Chen Qin. Inference-time dynamic modality selection for incomplete multimodal classification. arXiv preprint arXiv:2601.22853, 2026

  13. [13]

    Pappas, and Bernhard Sch ¨olkopf

    Cian Eastwood, Alexander Robey, Shashank Singh, Julius von K¨ugelgen, Hamed Hassani, George J. Pappas, and Bernhard Sch ¨olkopf. Probable domain generalization via quantile risk minimization, 2023

  14. [14]

    Domain-adversarial training of neural networks, 2016

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks, 2016

  15. [15]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180– 15190, 2023

  16. [16]

    Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector, 2025

    Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, and Xiaoming Liu. Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector, 2025

  17. [17]

    Point-bind & point-llm: Aligning poin t cloud with multi-modality for 3d understanding, generation, and instruction following

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023

  18. [18]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  19. [19]

    Revisiting Multimodal Positional Encoding in Vision-Language Models

    Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, and Shuai Bai. Revisiting multimodal positional encoding in vision- language models.arXiv preprint arXiv:2510.23095, 2025

  20. [20]

    Le, Hohyun Na, and Simon S

    Inho Jung, Hyeongjun Choi, Binh M. Le, Hohyun Na, and Simon S. Woo. A rich knowledge space for scalable deepfake detection. InThe Fourteenth International Conference on Learning Representations, 2026

  21. [21]

    Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S. Woo. Fakeavceleb: A novel audio-video multimodal deepfake dataset, 2022

  22. [22]

    Pindrop it! audio and visual deepfake countermeasures for robust detection and fine grained-localization, 2025

    Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, and Elie Khoury. Pindrop it! audio and visual deepfake countermeasures for robust detection and fine grained-localization, 2025

  23. [23]

    Uni- formly distributed feature representations for fair and robust learning

    Kiran Krishnamachari, See-Kiong Ng, and Chuan-Sheng Foo. Uni- formly distributed feature representations for fair and robust learning. Transactions on Machine Learning Research, 2024

  24. [24]

    Tell me habibi, is it real or fake?, 2025

    Kartik Kuckreja, Parul Gupta, Injy Hamed, Thamar Solorio, Muham- mad Haris Khan, and Abhinav Dhall. Tell me habibi, is it real or fake?, 2025

  25. [25]

    Klassify to verify: Audio-visual deepfake detection using ssl-based audio and handcrafted visual features, 2025

    Ivan Kukanov and Jun Wah Ng. Klassify to verify: Audio-visual deepfake detection using ssl-based audio and handcrafted visual features, 2025

  26. [26]

    Roy-Chowdhury

    Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K. Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content, 2025

  27. [27]

    Clip-powered domain generalization and domain adaptation: A comprehensive survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Jindong Li, Yongguang Li, Yali Fu, Jiahong Liu, Yixin Liu, Menglin Yang, and Irwin King. Clip-powered domain generalization and domain adaptation: A comprehensive survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  28. [28]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  29. [29]

    Deep domain generalization via conditional invariant adversarial networks

    Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. InProceedings of the European confer- ence on computer vision (ECCV), pages 624–639, 2018

  30. [30]

    Deep domain generalization via conditional invariant adversarial networks

    Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. InProceedings of the European Confer- ence on Computer Vision (ECCV), September 2018

  31. [31]

    Celeb-df: A large-scale challenging dataset for deepfake forensics

    Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207–3216, 2020

  32. [32]

    Celeb-DF++: A large-scale challenging video deepfake benchmark for generalizable forensics.arXiv preprint arXiv:2507.18015, 2025

    Yuezun Li, Delong Zhu, Xinjie Cui, and Siwei Lyu. Celeb-df++: A large- scale challenging video deepfake benchmark for generalizable forensics. arXiv preprint arXiv:2507.18015, 2025

  33. [33]

    Towards modality generalization: A benchmark and prospective analysis

    Xiaohao Liu, Xiaobo Xia, Zhuo Huang, See-Kiong Ng, and Tat-Seng Chua. Towards modality generalization: A benchmark and prospective analysis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12179–12188, 2025

  34. [34]

    Continual multimodal contrastive learning.arXiv preprint arXiv:2503.14963, 2025

    Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Continual multimodal contrastive learning.arXiv preprint arXiv:2503.14963, 2025

  35. [35]

    Unibind: Llm- augmented unified and balanced representation space to bind them all

    Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, and Lin Wang. Unibind: Llm- augmented unified and balanced representation space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26752–26762, 2024

  36. [36]

    Detecting deepfakes and false ads through analysis of text and social engineering techniques

    Alicja Martinek and Ewelina Bartuzi-Trokielewicz. Detecting deepfakes and false ads through analysis of text and social engineering techniques. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 8432–8448, ...

  37. [37]

    Reducing domain gap by reducing style bias, 2021

    Hyeonseob Nam, HyunJae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo. Reducing domain gap by reducing style bias, 2021

  38. [38]

    Domain adaptation via transfer component analysis.IEEE transactions on neural networks, 22(2):199–210, 2010

    Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis.IEEE transactions on neural networks, 22(2):199–210, 2010

  39. [39]

    Balanced multimodal learning via on-the-fly gradient modulation, 2022

    Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. Balanced multimodal learning via on-the-fly gradient modulation, 2022

  40. [40]

    DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

    Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, and Zhengzhong Tu. Decalign: Hierarchical cross-modal alignment for decoupled multimodal representation learning.arXiv preprint arXiv:2503.11892, 2025

  41. [41]

    Scaling up ai-generated image detection via generator-aware prototypes, 2025

    Ziheng Qin, Yuheng Ji, Renshuai Tao, Yuxuan Tian, Yuyang Liu, Yipu Wang, and Xiaolong Zheng. Scaling up ai-generated image detection via generator-aware prototypes, 2025

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  43. [43]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023

  44. [44]

    Maddison

    Yangjun Ruan, Yann Dubois, and Chris J. Maddison. Optimal represen- tations for covariate shift, 2022

  45. [45]

    Detecting and grounding multi-modal media manipulation and beyond, 2023

    Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Ziwei Liu. Detecting and grounding multi-modal media manipulation and beyond, 2023

  46. [46]

    How to bridge the gap between modalities: Survey on multimodal large language model

    Shezheng Song, Xiaopeng Li, Shasha Li, Shan Zhao, Jie Yu, Jun Ma, Xiaoguang Mao, Weimin Zhang, and Meng Wang. How to bridge the gap between modalities: Survey on multimodal large language model. IEEE Transactions on Knowledge and Data Engineering, 37(9):5311– 5329, 2025

  47. [47]

    On learning multi-modal forgery representation for diffusion generated video detection, 2025

    Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, Lei Bai, Xiaoming Liu, Guangtao Zhai, and Xiaohong Liu. On learning multi-modal forgery representation for diffusion generated video detection, 2025

  48. [48]

    Omnivec2-a novel transformer based network for large scale multimodal and multitask learning

    Siddharth Srivastava and Gaurav Sharma. Omnivec2-a novel transformer based network for large scale multimodal and multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27412–27424, 2024

  49. [49]

    Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Kate Saenko, and Bryan A. Plummer. Erm++: An improved baseline for domain generalization, 2024

  50. [50]

    Principles of risk minimization for learning theory

    Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991

  51. [51]

    GLUE: A multi-task benchmark and analysis platform for natural language understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz 11 Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, B...

  52. [52]

    Generalizing to unseen domains: A survey on domain generalization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022

    Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip S Yu. Generalizing to unseen domains: A survey on domain generalization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022

  53. [53]

    ONE-PEACE: exploring one general representation model toward unlimited modalities

    Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, and Chang Zhou. One-peace: Exploring one general representation model toward unlimited modalities.arXiv preprint arXiv:2305.11172, 2023

  54. [54]

    Modality-balanced collaborative distillation for multi-modal domain generalization

    Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, and Fan Zhou. Modality-balanced collaborative distillation for multi-modal domain generalization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 26535–26543, 2026

  55. [55]

    Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale, 2024

    Xin Wang, Hector Delgado, Hemlata Tak, Jee weon Jung, Hye jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, and Junichi Yamagishi. Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale, 2024

  56. [56]

    Open-vocabulary segmentation with unpaired mask-text supervision.arXiv preprint arXiv:2402.08960, 2024

    Zhaoqing Wang, Xiaobo Xia, Ziye Chen, Xiao He, Yandong Guo, Mingming Gong, and Tongliang Liu. Open-vocabulary segmentation with unpaired mask-text supervision.arXiv preprint arXiv:2402.08960, 2024

  57. [57]

    Indirect alignment and relationships preservation for domain generalization

    Wei Wei, Zixiong Li, Jing Yan, Mingwen Shao, and Lin Li. Indirect alignment and relationships preservation for domain generalization. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 2054–2062, 2025

  58. [58]

    Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation, 2025

    Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation, 2025

  59. [59]

    Mare: Multi- modal alignment and reinforcement for explainable deepfake detection via vision-language models, 2026

    Wenbo Xu, Wei Lu, Xiangyang Luo, and Jiantao Zhou. Mare: Multi- modal alignment and reinforcement for explainable deepfake detection via vision-language models, 2026

  60. [60]

    Improve unsupervised domain adaptation with mixup training, 2020

    Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training, 2020

  61. [61]

    Facilitating multimodal classification via dynamically learning modality gap

    Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, and Yi Xu. Facilitating multimodal classification via dynamically learning modality gap. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 62108–62122. Curran Associates, Inc., 2024

  62. [62]

    Multimodal aligned semantic knowledge for unpaired image-text matching

    Laiguo Yin, Yixin Zhang, Yuqing Sun, and Lizhen Cui. Multimodal aligned semantic knowledge for unpaired image-text matching. InThe Fourteenth International Conference on Learning Representations

  63. [63]

    Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection, 2025

    Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, and Chip Hong Chang. Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection, 2025

  64. [64]

    mixup: Beyond Empirical Risk Minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez- Paz. mixup: Beyond empirical risk minimization.arXiv preprint arXiv:1710.09412, 2017

  65. [65]

    Deepfakebench-mm: A comprehensive benchmark for multimodal deepfake detection, 2025

    Kangran Zhao, Yupeng Chen, Xiaoyu Zhang, Yize Chen, Weinan Guan, Baicheng Chen, Chengzhe Sun, Soumyya Kanti Datta, Qingshan Liu, Siwei Lyu, and Baoyuan Wu. Deepfakebench-mm: A comprehensive benchmark for multimodal deepfake detection, 2025

  66. [66]

    Deep domain-adversarial image generation for domain generalisation

    Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang. Deep domain-adversarial image generation for domain generalisation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13025–13032, 2020

  67. [67]

    Joint audio-visual deepfake detection

    Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14800–14809, October 2021

  68. [68]

    Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Language- bind: Extending video-language pretraining to n-modality by language- based semantic alignment.arXiv preprint arXiv:2310.01852, 2023. 12 APPENDIX The appendices provide additional details that support and extend the main pap...

  69. [69]

    meta-distribution

    when handling modality removal tasks. This further 15 TABLE VI CLASSIFICATION PERFORMANCE(MEAN±STD)UNDER THEWEAKMAFSETTING,USING LEAVE-ONE-MODALITY-OUT CROSS-VALIDATION FOR MODEL SELECTION. Perceptor Method LA V-DF [7] Fakeavcele [21] Cele+Asv [55], [32] Vid Aud Img Avg Vid Aud Img Avg Vid Aud Img Avg ImageBind [15] MML Concat 61.20±1.93 60.65±1.22 60.93±...