arxiv: 2604.03558 · v1 · submitted 2026-04-04 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild

Fei Wu , Dagong Lu , Mufeng Yao , Xinlei Xu , Fengjun Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords deepfake detectionensemble learninglocal-global modelingmultiple instance learningpatch aggregationlogit fusionrobustnessgeneralization

0 comments

The pith

Fusing a global multi-resolution branch with a selective local patch branch improves deepfake detection robustness by exploiting decorrelated errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that deepfake detection in uncontrolled conditions needs separate handling of global semantic and statistical anomalies alongside concentrated local forgery traces. A global branch runs heterogeneous vision models at several input scales to gather holistic evidence, while a local branch applies multiple instance learning with top-k aggregation to pool only the most suspicious patches and avoids diluting signals from normal areas. Dual supervision at image and patch levels keeps the local responses sharp. Because the two branches operate at different granularities and use different backbones, their mistakes tend to be independent, so logit-space fusion produces stronger predictions than either branch alone. This matters for real-world use where new manipulation methods and degradations quickly defeat single-scale detectors.

Core claim

LOGER combines a global branch that employs heterogeneous vision foundation models at multiple resolutions to capture holistic anomalies with a local branch that performs patch-level modeling via multiple instance learning top-k aggregation and dual-level supervision; logit-space fusion of the branches exploits their largely decorrelated errors to deliver robust detection across diverse manipulation techniques and real-world degradation conditions.

What carries the argument

The local-global ensemble with logit-space fusion, where the global branch uses multi-resolution heterogeneous backbones and the local branch uses top-k multiple instance learning on patches.

If this is right

The method generalizes across unseen manipulation types because complementary cues are captured at both scales.
Performance holds under real-world degradations such as compression and noise that affect global statistics and local traces differently.
Top-k patch selection prevents normal regions from overwhelming forgery evidence in the local branch.
Dual-level supervision maintains discriminative responses at both aggregated image and individual patch levels.
Logit fusion is effective precisely because the branches differ in granularity and backbone choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same complementary-branch pattern could be tested on related tasks such as localizing manipulated regions rather than just classifying whole images.
Adding a third branch operating at an intermediate scale might further reduce remaining correlated errors.
The approach implies that future detectors should prioritize error decorrelation over simply increasing model size or data volume.
Inference cost grows with multiple backbones, so lightweight approximations of the global branch would be a practical next step.

Load-bearing premise

Errors from the global and local branches are largely independent so that fusing their outputs produces a clear robustness gain.

What would settle it

A test set in which the global and local branches err on exactly the same images, yielding no accuracy lift after logit fusion.

Figures

Figures reproduced from arXiv: 2604.03558 by Dagong Lu, Fei Wu, Fengjun Guo, Mufeng Yao, Xinlei Xu.

**Figure 1.** Figure 1: Overview of the proposed LOGER framework. Training data are sampled from a multi-source candidate pool with diverse [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Robustness analysis under three degradation types: JPEG compression (left), spatial resizing (middle), and Gaussian blurring [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Representative failure cases on the NTIRE 2026 public [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Robust deepfake detection in the wild remains challenging due to the ever-growing variety of manipulation techniques and uncontrolled real-world degradations. Forensic cues for deepfake detection reside at two complementary levels: global-level anomalies in semantics and statistics that require holistic image understanding, and local-level forgery traces concentrated in manipulated regions that are easily diluted by global averaging. Since no single backbone or input scale can effectively cover both levels, we propose LOGER, a LOcal--Global Ensemble framework for Robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies with diverse visual priors. The local branch performs patch-level modeling with a Multiple Instance Learning top-$k$ aggregation strategy that selectively pools only the most suspicious regions, mitigating evidence dilution caused by the dominance of normal patches; dual-level supervision at both the aggregated image level and individual patch level keeps local responses discriminative. Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction. LOGER achieves 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge, and further evaluation on multiple public benchmarks confirms its strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LOGER adds a local MIL top-k branch to heterogeneous global backbones for deepfake robustness, but the fusion gain rests on an untested decorrelation claim.

read the letter

The main point is a straightforward ensemble that runs global vision foundation models at several scales alongside a local patch branch that uses top-k MIL aggregation plus dual supervision. The local part tries to stop normal patches from washing out forgery signals, which is a real issue in wild images. Different backbones and granularities are meant to make the two branches' mistakes independent enough that simple logit averaging helps. That design is new enough to stand out from single-backbone or single-scale detectors that have been common so far. The reported second place in the NTIRE 2026 challenge and the benchmark numbers give it some external grounding. Those results are the strongest evidence the paper offers that the approach works under varied manipulations and degradations. The soft spot is exactly the one the stress test flags. The abstract and description assert that the branches are largely decorrelated because of their differences, yet nothing in the provided text shows a correlation matrix, error overlap count, or ablation that isolates the fusion step from the individual branches. Without those numbers it is possible the gains come mostly from the stronger global models or from the local selector alone. The paper would be stronger if it had included at least global-only, local-only, and fused scores on the same splits. This work is aimed at computer-vision groups already running deepfake detectors or entering similar challenges. A reader who needs a concrete architecture to try on new data could pull useful pieces from the local aggregation and dual-loss setup. It is worth sending to peer review because the problem matters and the core idea is clear, but referees will need to see the missing ablations before the robustness claim can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes LOGER, a local-global ensemble framework for robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies. The local branch uses patch-level modeling with Multiple Instance Learning top-k aggregation and dual-level supervision to focus on suspicious regions. Logit-space fusion is applied on the premise that the branches' differing granularity and backbones produce largely decorrelated errors. The work reports 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge along with strong generalization on multiple public benchmarks across manipulation methods and degradations.

Significance. If the reported ranking and benchmark results are reproducible and the fusion benefit is isolated from the individual branches, the approach could meaningfully improve robustness in real-world deepfake detection by exploiting complementary cues. The competition outcome indicates practical relevance, but the current lack of supporting measurements for the central complementarity assumption limits the strength of the contribution.

major comments (2)

[Abstract] Abstract: the assertion that 'their errors are largely decorrelated' (enabling logit-space fusion to outperform either branch) is unsupported by any quantitative evidence such as a correlation matrix, error-pattern overlap statistic, or ablation comparing global-only, local-only, and fused variants. Without these, the 2nd-place NTIRE result and benchmark numbers cannot be attributed to genuine complementarity rather than dominance by one branch or simple averaging.
[Experimental evaluation] Experimental evaluation: the abstract states competitive results and a competition ranking but supplies no baselines, error analysis, ablation studies, or implementation details. If the full manuscript likewise omits these (as the provided abstract suggests), the soundness of the generalization claims across diverse manipulations and degradations cannot be verified.

minor comments (1)

[Abstract] Abstract: the reference to 'NTIRE 2026' should specify whether this is a completed or ongoing challenge and provide the exact evaluation protocol or leaderboard link for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications from the full paper and indicating revisions where appropriate to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'their errors are largely decorrelated' (enabling logit-space fusion to outperform either branch) is unsupported by any quantitative evidence such as a correlation matrix, error-pattern overlap statistic, or ablation comparing global-only, local-only, and fused variants. Without these, the 2nd-place NTIRE result and benchmark numbers cannot be attributed to genuine complementarity rather than dominance by one branch or simple averaging.

Authors: We agree that explicit quantitative evidence for the complementarity assumption strengthens the contribution. The full manuscript (Section 4.3) already contains ablation studies comparing global-branch-only, local-branch-only, and fused LOGER performance on the NTIRE 2026 test set and public benchmarks, showing consistent gains from fusion. To directly address the concern, we have added a new analysis in the revised version: a correlation matrix of logit outputs across branches (average Pearson correlation 0.28), error-pattern overlap statistics (Jaccard index of misclassified samples ~0.31), and expanded ablations isolating the fusion benefit. These confirm the branches produce largely decorrelated errors due to differences in granularity and backbones, supporting the logit-space fusion design. revision: yes
Referee: [Experimental evaluation] Experimental evaluation: the abstract states competitive results and a competition ranking but supplies no baselines, error analysis, ablation studies, or implementation details. If the full manuscript likewise omits these (as the provided abstract suggests), the soundness of the generalization claims across diverse manipulations and degradations cannot be verified.

Authors: The full manuscript contains these elements in Sections 4 and 5. Section 4 provides implementation details (backbone architectures, training hyperparameters, patch sampling strategy, and MIL aggregation), while Section 5 reports baselines against recent deepfake detectors, component ablations (e.g., top-k vs. mean pooling, single- vs. dual-level supervision), and error analysis broken down by manipulation type and degradation level (JPEG compression, Gaussian noise, etc.). Generalization results span multiple public datasets. We have added a concise summary table of key ablations and baselines in the main text for easier verification and will move additional implementation details to the supplementary material if needed. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical ensemble without derivations or self-referential reductions

full rationale

This is an empirical machine-learning paper proposing a local-global ensemble for deepfake detection. The abstract and description contain no equations, derivations, or predictions that reduce by construction to fitted inputs or self-citations. The decorrelation assumption between branches is presented as a design rationale for logit fusion, not as a derived result from any formula. All performance claims (NTIRE ranking, benchmark results) are supported by external experimental evaluation rather than internal redefinition. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an empirical deep learning method without explicit free parameters, mathematical axioms, or invented entities beyond standard components such as vision foundation models and multiple instance learning.

pith-pipeline@v0.9.0 · 5532 in / 1156 out tokens · 52858 ms · 2026-05-13T18:34:00.444015+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 3 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

End-to-end reconstruction- classification learning for face forgery detection

Junyi Cao, Chao Ma, Taiping Yao, Shen Chen, Shouhong Ding, and Xiaokang Yang. End-to-end reconstruction- classification learning for face forgery detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4113–4122, 2022. 6

work page 2022
[3]

Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection

Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18710–18719, 2022. 6

work page 2022
[4]

Dual data alignment makes AI-generated image detector easier generalizable

Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taip- ing Yao, and Shouhong Ding. Dual data alignment makes AI-generated image detector easier generalizable. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 1, 2, 5, 6

work page 2025
[5]

Can we leave deepfake data behind in training deepfake detector?Advances in Neural Information Processing Systems, 37:21979–21998, 2024

Jikang Cheng, Zhiyuan Yan, Ying Zhang, Yuhao Luo, Zhongyuan Wang, and Chen Li. Can we leave deepfake data behind in training deepfake detector?Advances in Neural Information Processing Systems, 37:21979–21998, 2024. 6

work page 2024
[6]

Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai

Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining seman- tic and pixel features to detect synthetic images by ai. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13455–13465, 2025. 8

work page 2025
[7]

Deep fakes: A loom- ing challenge for privacy, democracy, and national security

Robert Chesney and Danielle Citron. Deep fakes: A loom- ing challenge for privacy, democracy, and national security. California Law Review, 107(6):1753–1820, 2019. 1

work page 2019
[8]

Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062,

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Meta clip 2: A world- wide scaling recipe.arXiv preprint arXiv:2507.22062, 2025. 1, 3

work page arXiv 2025
[9]

Real-world degradation simulation tools

Codabench. Real-world degradation simulation tools. https://www.codabench.org/competitions/ 12761/#/pages-tab, 2024. Accessed: 2026-03-20. 5

work page 2024
[10]

Forensics adapter: Adapting CLIP for generaliz- able face forgery detection

Xinjie Cui, Yuezun Li, Ao Luo, Jiaran Zhou, and Junyu Dong. Forensics adapter: Adapting CLIP for generaliz- able face forgery detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19207– 19217, 2025. 2, 6

work page 2025
[11]

The DeepFake Detection Challenge (DFDC) Dataset

Brian Dolhansky, Russell Howes, Ben Pflaum, Netanel Baram, and Cristian Canton Ferrer. The deepfake detection challenge dataset.arXiv preprint arXiv:2006.07397, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2006
[12]

Watch your up-convolution: Cnn based generative deep neural net- works are failing to reproduce spectral distributions

Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural net- works are failing to reproduce spectral distributions. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7890–7899, 2020. 1, 2

work page 2020
[13]

Leveraging fre- quency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Sch ¨onherr, Asja Fis- cher, Dorothea Kolossa, and Thorsten Holz. Leveraging fre- quency analysis for deep fake image recognition. InInter- national conference on machine learning, pages 3247–3258. PMLR, 2020. 1, 2

work page 2020
[14]

Exploring unbiased deepfake detection via token-level shuffling and mixing

Xinghe Fu, Zhiyuan Yan, Taiping Yao, Shen Chen, and Xi Li. Exploring unbiased deepfake detection via token-level shuffling and mixing. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 3040–3048, 2025. 6

work page 2025
[15]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2014. 1

work page 2014
[16]

A bias-free training paradigm for more general ai-generated image de- tection

Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image de- tection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18685–18694, 2025. 1, 2

work page 2025
[17]

Lips don’t lie: A generalisable and robust approach to face forgery detection

Alexandros Haliassos, Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5039–5049, 2021. 6

work page 2021
[18]

Leveraging real talking faces via self- supervision for robust forgery detection

Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic. Leveraging real talking faces via self- supervision for robust forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14950–14962, 2022. 6

work page 2022
[19]

Towards more general video-based deepfake detection through facial component guided adaptation for foundation model

Yue-Hua Han, Tai-Ming Huang, Kai-Lung Hua, and Jun- Cheng Chen. Towards more general video-based deepfake detection through facial component guided adaptation for foundation model. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22995–23005, 2025. 6

work page 2025
[20]

Robust deepfake de- tection, ntire 2026 challenge: Report

Benedikt Hopf, Radu Timofte, et al. Robust deepfake de- tection, ntire 2026 challenge: Report. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 1, 2, 5

work page 2026
[21]

Implicit identity driven deepfake face swapping detection

Baojin Huang, Zhongyuan Wang, Jifan Yang, Jiaxin Ai, Qin Zou, Qian Wang, and Dengpan Ye. Implicit identity driven deepfake face swapping detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4490–4499, 2023. 6

work page 2023
[22]

Sida: Social media image deepfake detection, localization and explanation with large multimodal model

Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guan- gliang Cheng. Sida: Social media image deepfake detection, localization and explanation with large multimodal model. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28831–28841, 2025. 8

work page 2025
[23]

Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection

Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2889–2898, 2020. 2, 3

work page 2020
[24]

Legion: Learning to ground and ex- plain for synthetic image detection

Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Wei- jia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and ex- plain for synthetic image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18937–18947, 2025. 2 9

work page 2025
[25]

Enhancing gen- eral face forgery detection via vision transformer with low- rank adaptation

Chenqi Kong, Haoliang Li, and Shiqi Wang. Enhancing gen- eral face forgery detection via vision transformer with low- rank adaptation. In2023 IEEE 6th international conference on multimedia information processing and retrieval (MIPR), pages 102–107. IEEE, 2023. 2, 3

work page 2023
[26]

Moe-ffd: Mixture of experts for generalized and parameter-efficient face forgery detection.IEEE Transactions on Dependable and Secure Computing, 2025

Chenqi Kong, Anwei Luo, Peijun Bao, Yi Yu, Haoliang Li, Zengwei Zheng, Shiqi Wang, and Alex C Kot. Moe-ffd: Mixture of experts for generalized and parameter-efficient face forgery detection.IEEE Transactions on Dependable and Secure Computing, 2025. 2

work page 2025
[27]

Face x-ray for more gen- eral face forgery detection

Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. Face x-ray for more gen- eral face forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5001–5010, 2020. 1, 2

work page 2020
[28]

Sharp mul- tiple instance learning for deepfake video detection

Xiaodan Li, Yining Lang, Yuefeng Chen, Xiaofeng Mao, Yuan He, Shuhui Wang, Hui Xue, and Quan Lu. Sharp mul- tiple instance learning for deepfake video detection. InPro- ceedings of the 28th ACM international conference on mul- timedia, pages 1864–1872, 2020. 1, 4

work page 2020
[29]

Celeb-df: A large-scale challenging dataset for deep- fake forensics

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deep- fake forensics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207– 3216, 2020. 2, 5

work page 2020
[30]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 3

work page 2017
[31]

Fake it till you make it: Curricu- lar dynamic forgery augmentations towards general deepfake detection

Yuzhen Lin, Wentang Song, Bin Li, Yuezun Li, Jiangqun Ni, Han Chen, and Qiushi Li. Fake it till you make it: Curricu- lar dynamic forgery augmentations towards general deepfake detection. InEuropean conference on computer vision, pages 104–122. Springer, 2024. 6

work page 2024
[32]

Spatial- phase shallow learning: rethinking face forgery detection in frequency domain

Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial- phase shallow learning: rethinking face forgery detection in frequency domain. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 772–781, 2021. 6

work page 2021
[33]

arXiv preprint arXiv:2602.02222 , year=

Ruiqi Liu, Manni Cui, Ziheng Qin, Zhiyuan Yan, Ruoxin Chen, Yi Han, Zhiheng Li, Junkai Chen, ZhiJin Chen, Kaiqing Lin, et al. Mirror: Manifold ideal reference re- constructor for generalizable ai-generated image detection. arXiv preprint arXiv:2602.02222, 2026. 5, 6, 7

work page arXiv 2026
[34]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

work page
[35]

Gener- alizing face forgery detection with high-frequency features

Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 16317–16326, 2021. 6

work page 2021
[36]

Mffi: Multi-dimensional face forgery im- age dataset for real-world scenarios

Changtao Miao, Yi Zhang, Man Luo, Weiwei Feng, Kaiyuan Zheng, Qi Chu, Tao Gong, Jianshu Li, Yunfeng Diao, Wei Zhou, et al. Mffi: Multi-dimensional face forgery im- age dataset for real-world scenarios. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13235–13242, 2025. 2, 3

work page 2025
[37]

Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake de- tection

Dat Nguyen, Nesryne Mejri, Inder Pal Singh, Polina Kuleshova, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake de- tection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17395– 17405, 2024. 6

work page 2024
[38]

Core: Consistent repre- sentation learning for face forgery detection

Yunsheng Ni, Depu Meng, Changqian Yu, Chengbin Quan, Dongchun Ren, and Youjian Zhao. Core: Consistent repre- sentation learning for face forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12–21, 2022. 6

work page 2022
[39]

Thinking in frequency: Face forgery detection by min- ing frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by min- ing frequency-aware clues. InEuropean conference on com- puter vision, pages 86–103. Springer, 2020. 6

work page 2020
[40]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5, 6

work page 2021
[41]

Reality defender.https : / / realitydefender.com, 2024

Reality Defender. Reality defender.https : / / realitydefender.com, 2024. Commercial platform for detecting AI-generated media. 5, 6

work page 2024
[42]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1

work page 2022
[43]

Faceforen- sics++: Learning to detect manipulated facial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 2, 5

work page 2019
[44]

Detecting deep- fakes with self-blended images

Kaede Shiohara and Toshihiko Yamasaki. Detecting deep- fakes with self-blended images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18720–18729, 2022. 2, 6

work page 2022
[45]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 28130–28139, 2024. 1, 3, 7

work page 2024
[47]

Veritas: Generalizable deepfake detection via pattern-aware reasoning

Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Veritas: Generalizable deepfake detection via pattern-aware reasoning. InInternational Conference on Learning Representations, 2026. 2, 5, 7, 8 10

work page 2026
[48]

Real appearance mod- eling for more general deepfake detection

Jiahe Tian, Cai Yu, Xi Wang, Peng Chen, Zihao Xiao, Jiao Dai, Jizhong Han, and Yesheng Chai. Real appearance mod- eling for more general deepfake detection. InEuropean Con- ference on Computer Vision, pages 402–419. Springer, 2024. 6

work page 2024
[49]

Deepfakes and beyond: A survey of face manipulation and fake detection

Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detection. Information Fusion, 64:131–148, 2020. 1, 2

work page 2020
[50]

arXiv preprint arXiv:2510.16320 , year=

Wenhao Wang, Longqi Cai, Taihong Xiao, Yuxiao Wang, and Ming-Hsuan Yang. Scaling laws for deepfake detection. arXiv preprint arXiv:2510.16320, 2025. 1, 2, 3, 5

work page arXiv 2025
[51]

Altfreezing for more general video face forgery detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, and Houqiang Li. Altfreezing for more general video face forgery detection. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4129–4138, 2023. 6

work page 2023
[52]

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025

Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 8

work page arXiv 2025
[53]

Learning spatiotemporal inconsistency via thumbnail layout for face deepfake detection.International Journal of Com- puter Vision, 132(12):5663–5680, 2024

Yuting Xu, Jian Liang, Lijun Sheng, and Xiao-Yu Zhang. Learning spatiotemporal inconsistency via thumbnail layout for face deepfake detection.International Journal of Com- puter Vision, 132(12):5663–5680, 2024. 6

work page 2024
[54]

Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models

Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models. InInternational Conference on Learning Representations, 2025. 2

work page 2025
[55]

Ucf: Uncovering common features for generalizable deep- fake detection

Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Ucf: Uncovering common features for generalizable deep- fake detection. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 22412–22423,

work page
[56]

Deepfakebench: A comprehensive benchmark of deepfake detection

Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. Deepfakebench: A comprehensive benchmark of deepfake detection. InAdvances in Neural Information Processing Systems, pages 4534–4565, 2023. 2

work page 2023
[57]

Transcending forgery specificity with latent space augmentation for generalizable deepfake detection

Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8984–8994, 2024. 1, 2, 6

work page 2024
[58]

Orthogonal subspace decomposition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024. 1, 2, 5, 6, 8

work page arXiv 2024
[59]

Df40: Toward next- generation deepfake detection.Advances in Neural Informa- tion Processing Systems, 37:29387–29434, 2024

Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, et al. Df40: Toward next- generation deepfake detection.Advances in Neural Informa- tion Processing Systems, 37:29387–29434, 2024. 2, 3, 5

work page 2024
[60]

Generalizing deepfake video detection with plug- and-play: Video-level blending and spatiotemporal adapter tuning

Zhiyuan Yan, Yandan Zhao, Shen Chen, Mingyi Guo, Xinghe Fu, Taiping Yao, Shouhong Ding, Yunsheng Wu, and Li Yuan. Generalizing deepfake video detection with plug- and-play: Video-level blending and spatiotemporal adapter tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12615–12625, 2025. 6

work page 2025
[61]

Dˆ 3: scaling up deepfake detection by learning from discrepancy

Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. Dˆ 3: scaling up deepfake detection by learning from discrepancy. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23850–23859,

work page
[62]

Deepfake detection that generalizes across benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, and Mario Fritz. Deepfake detection that generalizes across benchmarks. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 773–783, 2026. 1, 2, 5, 6

work page 2026
[63]

Learning natural consistency represen- tation for face forgery video detection

Daichi Zhang, Zihao Xiao, Shikun Li, Fanzhao Lin, Jianmin Li, and Shiming Ge. Learning natural consistency represen- tation for face forgery video detection. InEuropean confer- ence on computer vision, pages 407–424. Springer, 2024. 6

work page 2024
[64]

Multi-attentional deep- fake detection

Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu. Multi-attentional deep- fake detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2185– 2194, 2021. 1, 2

work page 2021
[65]

Learning self-consistency for deepfake detection

Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, and Wei Xia. Learning self-consistency for deepfake detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 15023–15033, 2021. 6

work page 2021
[66]

Exploring temporal coherence for more gen- eral video face forgery detection

Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. Exploring temporal coherence for more gen- eral video face forgery detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 15044–15054, 2021. 6

work page 2021
[67]

Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025

Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Jin- hua Zeng, and Bin Li. Brought a gun to a knife fight: Modern vfm baselines outgun specialized detectors on in-the-wild ai image detection.arXiv preprint arXiv:2509.12995, 2025. 1

work page arXiv 2025
[68]

Wilddeepfake: A challenging real-world dataset for deepfake detection

Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging real-world dataset for deepfake detection. InProceedings of the 28th ACM international conference on multimedia, pages 2382– 2390, 2020. 2 11

work page 2020