Recognition: 2 theorem links
· Lean TheoremBeyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3
The pith
Decoupling modality-specific styles from content isolates shared latent forgery traces that generalize to unseen media types.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the modality-agnostic forgery detection framework extracts essential cross-modal latent forgery knowledge once modality-specific styles are explicitly decoupled, thereby proving the existence of universal forgery traces and delivering substantial gains on unknown modalities.
What carries the argument
The modality-agnostic forgery (MAF) detection framework, which decouples modality-specific styles to isolate cross-modal latent forgery knowledge.
Where Pith is reading between the lines
- The decoupling step could be reused in other cross-modal tasks where superficial differences currently block knowledge transfer.
- Real-world deployment would require checking whether the extracted traces remain stable when forgeries combine multiple modalities at once.
- If the universal traces prove stable, security systems might shift from per-modality models to a single shared detector.
Load-bearing premise
Shared latent forgery knowledge exists independently of modality-specific appearances and can be reliably decoupled and applied to entirely new modalities without any training examples from them.
What would settle it
Training MAF on known modalities and then measuring detection accuracy on a completely isolated dark modality from the DeepModal-Bench; accuracy no better than random guessing or prior methods would falsify the claim.
Figures
read the original abstract
As generative artificial intelligence evolves, deepfake attacks have escalated from single-modality manipulations to complex, multimodal threats. Existing forensic techniques face a severe generalization bottleneck: by relying excessively on superficial, modality-specific artifacts, they neglect the shared latent forgery knowledge hidden beneath variable physical appearances. Consequently, these models suffer catastrophic performance degradation when confronted with unseen "dark modalities." To break this limitation, this paper introduces a paradigm shift that redefines multimodal forensics from conventional "feature fusion" to "modality generalization." We propose the first modality-agnostic forgery (MAF) detection framework. By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge. Furthermore, we define two progressive dimensions to quantify model generalization: transferability toward semantically correlated modalities (Weak MAF), and robustness against completely isolated signals of "dark modality" (Strong MAF). To rigorously assess these generalization limits, we introduce the DeepModal-Bench benchmark, which integrates diverse multimodal forgery detection algorithms and adapts state-of-the-art generalized learning methods. This study not only empirically proves the existence of universal forgery traces but also achieves significant performance breakthroughs on unknown modalities via the MAF framework, offering a pioneering technical pathway for universal multimodal defense.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Modality-Agnostic Forgery (MAF) detection framework as a shift from feature fusion to modality generalization in multimodal deepfake forensics. By decoupling modality-specific styles, MAF is claimed to extract shared latent forgery knowledge; the work defines Weak MAF (transfer to correlated modalities) and Strong MAF (robustness to isolated 'dark modalities'), introduces the DeepModal-Bench benchmark, and asserts that it empirically proves the existence of universal forgery traces while delivering significant performance gains on unknown modalities.
Significance. If the empirical claims and generalization results hold after proper validation, the work would offer a meaningful advance by addressing the generalization bottleneck in current forensic methods and providing a new benchmark for cross-modal robustness. The explicit framing of Weak versus Strong MAF dimensions could help standardize evaluation of modality-agnostic claims in the field.
major comments (3)
- [Abstract] Abstract: The central claim that the study 'empirically proves the existence of universal forgery traces' and 'achieves significant performance breakthroughs' is unsupported by any description of methods, datasets, quantitative metrics, error bars, ablations, or baseline comparisons. This absence is load-bearing because the paper's contribution rests entirely on these unshown results.
- [Abstract] Abstract: The decoupling step is asserted to 'precisely extract the essential, cross-modal latent forgery knowledge' without any stated mechanism, loss function, or validation that residual modality cues or dataset-specific correlations have been removed. This directly affects the Strong MAF claim of generalization to completely isolated signals with zero training examples from dark modalities.
- [Abstract] Abstract: No experimental protocol is given for the DeepModal-Bench benchmark (e.g., how modalities are partitioned into 'known' vs. 'unknown', what generative pipelines are used, or how 'semantically correlated' vs. 'completely isolated' is operationalized), preventing assessment of whether reported gains reflect universal traces or implicit statistical links across training modalities.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying the content of the full paper and indicating revisions where the abstract can be strengthened for clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the study 'empirically proves the existence of universal forgery traces' and 'achieves significant performance breakthroughs' is unsupported by any description of methods, datasets, quantitative metrics, error bars, ablations, or baseline comparisons. This absence is load-bearing because the paper's contribution rests entirely on these unshown results.
Authors: The abstract serves as a high-level summary of the contributions. The full manuscript details the MAF framework in Section 3, the DeepModal-Bench construction in Section 4, and all quantitative results—including metrics, baselines, ablations, error bars, and comparisons across known/unknown modalities—in Section 5. We agree the abstract would benefit from a brief reference to key performance gains and will revise it accordingly to better anchor the claims. revision: yes
-
Referee: [Abstract] Abstract: The decoupling step is asserted to 'precisely extract the essential, cross-modal latent forgery knowledge' without any stated mechanism, loss function, or validation that residual modality cues or dataset-specific correlations have been removed. This directly affects the Strong MAF claim of generalization to completely isolated signals with zero training examples from dark modalities.
Authors: Section 3.2 describes the decoupling mechanism in detail: a style encoder combined with an adversarial objective and contrastive loss to isolate shared forgery representations while suppressing modality-specific cues. Section 5.2 includes ablations confirming reduced residual correlations. Strong MAF evaluation uses zero-shot testing on isolated modalities, as defined in Section 4. We will add a concise clause to the abstract summarizing the adversarial decoupling approach. revision: partial
-
Referee: [Abstract] Abstract: No experimental protocol is given for the DeepModal-Bench benchmark (e.g., how modalities are partitioned into 'known' vs. 'unknown', what generative pipelines are used, or how 'semantically correlated' vs. 'completely isolated' is operationalized), preventing assessment of whether reported gains reflect universal traces or implicit statistical links across training modalities.
Authors: Section 4 fully specifies the DeepModal-Bench protocol, including modality partitioning rules, the generative pipelines (specific GAN and diffusion models per modality), and operational definitions distinguishing Weak MAF (semantically correlated modalities) from Strong MAF (completely isolated signals with no training overlap). We will revise the abstract to include a short outline of this benchmark setup and evaluation criteria. revision: yes
Circularity Check
No circularity detected in empirical framework and benchmark proposal
full rationale
The provided abstract and description contain no equations, derivations, or mathematical steps that reduce to inputs by construction. The core claims rest on an empirical MAF framework that decouples styles and a newly introduced DeepModal-Bench for testing generalization; these are presented as experimental results rather than a closed derivation chain. No self-citations, fitted parameters renamed as predictions, or ansatzes are visible that would create load-bearing circularity. The 'proof' of universal traces is explicitly empirical performance on the benchmark, which remains independent of any self-referential definition.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose the first modality-agnostic forgery (MAF) detection framework.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Invariance principle meets information bottleneck for out-of-distribution generalization.Advances in Neural Information Processing Systems, 34:3438–3450, 2021
Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization.Advances in Neural Information Processing Systems, 34:3438–3450, 2021
2021
-
[2]
Invariance principle meets information bottleneck for out-of-distribution generalization, 2022
Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization, 2022
2022
-
[3]
Martin Arjovsky, L ´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review arXiv 1907
-
[4]
Invariant risk minimization, 2020
Martin Arjovsky, L ´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization, 2020
2020
-
[5]
A theory of learning from different domains.Machine learning, 79(1):151–175, 2010
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains.Machine learning, 79(1):151–175, 2010
2010
-
[6]
Av- deepfake1m++: A large-scale audio-visual deepfake benchmark with real-world perturbations, 2025
Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muham- mad Haris Khan, Usman Tariq, Tom Gedeon, and Abhinav Dhall. Av- deepfake1m++: A large-scale audio-visual deepfake benchmark with real-world perturbations, 2025
2025
-
[7]
Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization, 2023
Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization, 2023. 10
2023
-
[8]
Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024
Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024
2024
-
[9]
Simmmdg: A simple and effective framework for multi-modal domain generalization.Advances in Neural Information Processing Systems, 36:78674–78695, 2023
Hao Dong, Ismail Nejjar, Han Sun, Eleni Chatzi, and Olga Fink. Simmmdg: A simple and effective framework for multi-modal domain generalization.Advances in Neural Information Processing Systems, 36:78674–78695, 2023
2023
-
[10]
An image is worth 16x16 words: Transformers for image recognition at scale, 2021
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021
2021
-
[11]
Dna: Uncovering universal latent forgery knowledge, 2026
Jingtong Dou, Chuancheng Shi, Yemin Wang, Shiming Guo, Anqi Yi, Wenhua Wu, Li Zhang, Fei Shen, and Tat-Seng Chua. Dna: Uncovering universal latent forgery knowledge, 2026
2026
-
[12]
Inference-time dynamic modality selection for incomplete multimodal classification
Siyi Du, Xinzhe Luo, Declan P O’Regan, and Chen Qin. Inference-time dynamic modality selection for incomplete multimodal classification. arXiv preprint arXiv:2601.22853, 2026
work page internal anchor Pith review arXiv 2026
-
[13]
Pappas, and Bernhard Sch ¨olkopf
Cian Eastwood, Alexander Robey, Shashank Singh, Julius von K¨ugelgen, Hamed Hassani, George J. Pappas, and Bernhard Sch ¨olkopf. Probable domain generalization via quantile risk minimization, 2023
2023
-
[14]
Domain-adversarial training of neural networks, 2016
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks, 2016
2016
-
[15]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180– 15190, 2023
2023
-
[16]
Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector, 2025
Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, and Xiaoming Liu. Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector, 2025
2025
-
[17]
Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023
-
[18]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021
2021
-
[19]
Revisiting Multimodal Positional Encoding in Vision-Language Models
Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, and Shuai Bai. Revisiting multimodal positional encoding in vision- language models.arXiv preprint arXiv:2510.23095, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Le, Hohyun Na, and Simon S
Inho Jung, Hyeongjun Choi, Binh M. Le, Hohyun Na, and Simon S. Woo. A rich knowledge space for scalable deepfake detection. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[21]
Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S. Woo. Fakeavceleb: A novel audio-video multimodal deepfake dataset, 2022
2022
-
[22]
Pindrop it! audio and visual deepfake countermeasures for robust detection and fine grained-localization, 2025
Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, and Elie Khoury. Pindrop it! audio and visual deepfake countermeasures for robust detection and fine grained-localization, 2025
2025
-
[23]
Uni- formly distributed feature representations for fair and robust learning
Kiran Krishnamachari, See-Kiong Ng, and Chuan-Sheng Foo. Uni- formly distributed feature representations for fair and robust learning. Transactions on Machine Learning Research, 2024
2024
-
[24]
Tell me habibi, is it real or fake?, 2025
Kartik Kuckreja, Parul Gupta, Injy Hamed, Thamar Solorio, Muham- mad Haris Khan, and Abhinav Dhall. Tell me habibi, is it real or fake?, 2025
2025
-
[25]
Klassify to verify: Audio-visual deepfake detection using ssl-based audio and handcrafted visual features, 2025
Ivan Kukanov and Jun Wah Ng. Klassify to verify: Audio-visual deepfake detection using ssl-based audio and handcrafted visual features, 2025
2025
-
[26]
Roy-Chowdhury
Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K. Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content, 2025
2025
-
[27]
Clip-powered domain generalization and domain adaptation: A comprehensive survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
Jindong Li, Yongguang Li, Yali Fu, Jiahong Liu, Yixin Liu, Menglin Yang, and Irwin King. Clip-powered domain generalization and domain adaptation: A comprehensive survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
2026
-
[28]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
2023
-
[29]
Deep domain generalization via conditional invariant adversarial networks
Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. InProceedings of the European confer- ence on computer vision (ECCV), pages 624–639, 2018
2018
-
[30]
Deep domain generalization via conditional invariant adversarial networks
Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. InProceedings of the European Confer- ence on Computer Vision (ECCV), September 2018
2018
-
[31]
Celeb-df: A large-scale challenging dataset for deepfake forensics
Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207–3216, 2020
2020
-
[32]
Yuezun Li, Delong Zhu, Xinjie Cui, and Siwei Lyu. Celeb-df++: A large- scale challenging video deepfake benchmark for generalizable forensics. arXiv preprint arXiv:2507.18015, 2025
-
[33]
Towards modality generalization: A benchmark and prospective analysis
Xiaohao Liu, Xiaobo Xia, Zhuo Huang, See-Kiong Ng, and Tat-Seng Chua. Towards modality generalization: A benchmark and prospective analysis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12179–12188, 2025
2025
-
[34]
Continual multimodal contrastive learning.arXiv preprint arXiv:2503.14963, 2025
Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Continual multimodal contrastive learning.arXiv preprint arXiv:2503.14963, 2025
-
[35]
Unibind: Llm- augmented unified and balanced representation space to bind them all
Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, and Lin Wang. Unibind: Llm- augmented unified and balanced representation space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26752–26762, 2024
2024
-
[36]
Detecting deepfakes and false ads through analysis of text and social engineering techniques
Alicja Martinek and Ewelina Bartuzi-Trokielewicz. Detecting deepfakes and false ads through analysis of text and social engineering techniques. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 8432–8448, ...
2025
-
[37]
Reducing domain gap by reducing style bias, 2021
Hyeonseob Nam, HyunJae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo. Reducing domain gap by reducing style bias, 2021
2021
-
[38]
Domain adaptation via transfer component analysis.IEEE transactions on neural networks, 22(2):199–210, 2010
Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis.IEEE transactions on neural networks, 22(2):199–210, 2010
2010
-
[39]
Balanced multimodal learning via on-the-fly gradient modulation, 2022
Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. Balanced multimodal learning via on-the-fly gradient modulation, 2022
2022
-
[40]
DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, and Zhengzhong Tu. Decalign: Hierarchical cross-modal alignment for decoupled multimodal representation learning.arXiv preprint arXiv:2503.11892, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Scaling up ai-generated image detection via generator-aware prototypes, 2025
Ziheng Qin, Yuheng Ji, Renshuai Tao, Yuxuan Tian, Yuyang Liu, Yipu Wang, and Xiaolong Zheng. Scaling up ai-generated image detection via generator-aware prototypes, 2025
2025
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[43]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023
2023
-
[44]
Maddison
Yangjun Ruan, Yann Dubois, and Chris J. Maddison. Optimal represen- tations for covariate shift, 2022
2022
-
[45]
Detecting and grounding multi-modal media manipulation and beyond, 2023
Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Ziwei Liu. Detecting and grounding multi-modal media manipulation and beyond, 2023
2023
-
[46]
How to bridge the gap between modalities: Survey on multimodal large language model
Shezheng Song, Xiaopeng Li, Shasha Li, Shan Zhao, Jie Yu, Jun Ma, Xiaoguang Mao, Weimin Zhang, and Meng Wang. How to bridge the gap between modalities: Survey on multimodal large language model. IEEE Transactions on Knowledge and Data Engineering, 37(9):5311– 5329, 2025
2025
-
[47]
On learning multi-modal forgery representation for diffusion generated video detection, 2025
Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, Lei Bai, Xiaoming Liu, Guangtao Zhai, and Xiaohong Liu. On learning multi-modal forgery representation for diffusion generated video detection, 2025
2025
-
[48]
Omnivec2-a novel transformer based network for large scale multimodal and multitask learning
Siddharth Srivastava and Gaurav Sharma. Omnivec2-a novel transformer based network for large scale multimodal and multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27412–27424, 2024
2024
-
[49]
Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Kate Saenko, and Bryan A. Plummer. Erm++: An improved baseline for domain generalization, 2024
2024
-
[50]
Principles of risk minimization for learning theory
Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991
1991
-
[51]
GLUE: A multi-task benchmark and analysis platform for natural language understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz 11 Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, B...
2018
-
[52]
Generalizing to unseen domains: A survey on domain generalization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022
Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip S Yu. Generalizing to unseen domains: A survey on domain generalization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022
2022
-
[53]
ONE-PEACE: exploring one general representation model toward unlimited modalities
Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, and Chang Zhou. One-peace: Exploring one general representation model toward unlimited modalities.arXiv preprint arXiv:2305.11172, 2023
-
[54]
Modality-balanced collaborative distillation for multi-modal domain generalization
Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, and Fan Zhou. Modality-balanced collaborative distillation for multi-modal domain generalization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 26535–26543, 2026
2026
-
[55]
Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale, 2024
Xin Wang, Hector Delgado, Hemlata Tak, Jee weon Jung, Hye jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, and Junichi Yamagishi. Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale, 2024
2024
-
[56]
Zhaoqing Wang, Xiaobo Xia, Ziye Chen, Xiao He, Yandong Guo, Mingming Gong, and Tongliang Liu. Open-vocabulary segmentation with unpaired mask-text supervision.arXiv preprint arXiv:2402.08960, 2024
-
[57]
Indirect alignment and relationships preservation for domain generalization
Wei Wei, Zixiong Li, Jing Yan, Mingwen Shao, and Lin Li. Indirect alignment and relationships preservation for domain generalization. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 2054–2062, 2025
2054
-
[58]
Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation, 2025
Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation, 2025
2025
-
[59]
Mare: Multi- modal alignment and reinforcement for explainable deepfake detection via vision-language models, 2026
Wenbo Xu, Wei Lu, Xiangyang Luo, and Jiantao Zhou. Mare: Multi- modal alignment and reinforcement for explainable deepfake detection via vision-language models, 2026
2026
-
[60]
Improve unsupervised domain adaptation with mixup training, 2020
Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training, 2020
2020
-
[61]
Facilitating multimodal classification via dynamically learning modality gap
Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, and Yi Xu. Facilitating multimodal classification via dynamically learning modality gap. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 62108–62122. Curran Associates, Inc., 2024
2024
-
[62]
Multimodal aligned semantic knowledge for unpaired image-text matching
Laiguo Yin, Yixin Zhang, Yuqing Sun, and Lizhen Cui. Multimodal aligned semantic knowledge for unpaired image-text matching. InThe Fourteenth International Conference on Learning Representations
-
[63]
Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection, 2025
Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, and Chip Hong Chang. Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection, 2025
2025
-
[64]
mixup: Beyond Empirical Risk Minimization
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez- Paz. mixup: Beyond empirical risk minimization.arXiv preprint arXiv:1710.09412, 2017
work page internal anchor Pith review arXiv 2017
-
[65]
Deepfakebench-mm: A comprehensive benchmark for multimodal deepfake detection, 2025
Kangran Zhao, Yupeng Chen, Xiaoyu Zhang, Yize Chen, Weinan Guan, Baicheng Chen, Chengzhe Sun, Soumyya Kanti Datta, Qingshan Liu, Siwei Lyu, and Baoyuan Wu. Deepfakebench-mm: A comprehensive benchmark for multimodal deepfake detection, 2025
2025
-
[66]
Deep domain-adversarial image generation for domain generalisation
Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang. Deep domain-adversarial image generation for domain generalisation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13025–13032, 2020
2020
-
[67]
Joint audio-visual deepfake detection
Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14800–14809, October 2021
2021
-
[68]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Language- bind: Extending video-language pretraining to n-modality by language- based semantic alignment.arXiv preprint arXiv:2310.01852, 2023. 12 APPENDIX The appendices provide additional details that support and extend the main pap...
-
[69]
meta-distribution
when handling modality removal tasks. This further 15 TABLE VI CLASSIFICATION PERFORMANCE(MEAN±STD)UNDER THEWEAKMAFSETTING,USING LEAVE-ONE-MODALITY-OUT CROSS-VALIDATION FOR MODEL SELECTION. Perceptor Method LA V-DF [7] Fakeavcele [21] Cele+Asv [55], [32] Vid Aud Img Avg Vid Aud Img Avg Vid Aud Img Avg ImageBind [15] MML Concat 61.20±1.93 60.65±1.22 60.93±...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.