Recognition: unknown
Towards Generalizable Deepfake Image Detection with Vision Transformers
Pith reviewed 2026-05-10 06:32 UTC · model grok-4.3
The pith
An ensemble of fine-tuned vision transformers detects deepfake images with 96.77% AUC on a diverse test set of manipulations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuning and ensembling DINOv2, AIMv2, and OpenCLIP ViT-L/14 models on the DF-Wild dataset yields 96.77 percent AUC and 9 percent EER on the held-out test set, outperforming both individual transformers and strong CNN baselines as well as the prior state-of-the-art Effort detector by 7.05 percentage points in AUC and 8 percentage points in EER.
What carries the argument
An ensemble formed by averaging predictions from three separately fine-tuned vision transformers (DINOv2, AIMv2, and OpenCLIP ViT-L/14) trained on spatial features from the DF-Wild dataset.
If this is right
- The ensemble approach can be applied directly to other image-forensics tasks that require robustness across varied synthesis methods.
- Single vision transformers underperform the ensemble, indicating that diversity across pre-trained backbones contributes measurably to generalization.
- CNN baselines lag behind the transformer ensemble on the same DF-Wild split, confirming the value of transformer architectures for this task.
- The reported 9 percent EER suggests the method could support practical deployment where low false-positive rates matter.
Where Pith is reading between the lines
- If new deepfake generators continue to be released, periodic re-ensembling with updated transformer checkpoints may be needed to maintain the reported accuracy.
- The success of mixing DINOv2, AIMv2, and CLIP-derived features hints that self-supervised and contrastive pre-training together capture complementary cues for forgery detection.
- Because the method was the winning entry in the 2025 SP Cup, similar ensembles may become a standard baseline for future generalization benchmarks in deepfake detection.
Load-bearing premise
Performance measured on the current mix of manipulations in DF-Wild will continue to hold when entirely new deepfake generation methods appear in the future.
What would settle it
Running the trained ensemble on a fresh collection of deepfake images created by a generation technique absent from the DF-Wild training and test splits and measuring whether the AUC falls below 90 percent.
Figures
read the original abstract
In today's day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP's ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an ensemble of fine-tuned Vision Transformers (DINOv2, AIMv2, and OpenCLIP ViT-L/14) as a generalizable approach for deepfake image detection. Using the DF-Wild dataset from the IEEE SP Cup 2025, which features diverse manipulations, the ensemble is shown to outperform individual ViT models, CNN baselines, and the prior SOTA method Effort, achieving 96.77% AUC and 9% EER on the DF-Wild test set while winning the competition.
Significance. If the empirical results hold, the work provides evidence that ViT ensembles can deliver strong performance on challenging, multi-technique deepfake data, advancing the field beyond CNN-based detectors. The competition win supplies practical validation of the approach under contest conditions. Credit is due for the direct benchmark comparison to Effort and the focus on a dataset with varied generation techniques.
major comments (2)
- [Abstract] Abstract: The central claim of a 'generalizable method' is load-bearing for the title and contribution, yet all quantitative results (AUC 96.77%, EER 9%) are reported exclusively on DF-Wild internal splits with no cross-dataset evaluation on established benchmarks such as FaceForensics++, Celeb-DF, or DFDC; this leaves the representativeness assumption for future unseen generators untested and makes the generalization claim dataset-contingent.
- [Experimental results] Experimental results section: No ablation studies, training details (hyperparameters, data splits, fine-tuning protocol), error bars, or statistical significance tests are described for the ensemble versus individual models or CNN baselines, which undermines the ability to attribute the 7.05% AUC margin specifically to the ViT ensemble design rather than implementation choices.
minor comments (2)
- [Abstract] Abstract: The phrase 'we started our experiments with CNN classifiers' is unclear without subsequent details on how this informed the final ViT ensemble; a brief transition sentence would improve flow.
- [Results] The manuscript would benefit from a table summarizing per-model AUC/EER scores alongside the ensemble to make the 'outperforms individual models' claim immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of our evaluation and experimental rigor. We address each major comment below and describe the revisions we will implement.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of a 'generalizable method' is load-bearing for the title and contribution, yet all quantitative results (AUC 96.77%, EER 9%) are reported exclusively on DF-Wild internal splits with no cross-dataset evaluation on established benchmarks such as FaceForensics++, Celeb-DF, or DFDC; this leaves the representativeness assumption for future unseen generators untested and makes the generalization claim dataset-contingent.
Authors: We agree that explicit cross-dataset evaluation on benchmarks such as FaceForensics++ would strengthen the generalization claim. DF-Wild was selected precisely because it contains diverse manipulations across multiple generators, and the competition setting (IEEE SP Cup 2025) inherently tests performance on held-out data with unseen techniques. Nevertheless, we will revise the abstract to moderate the generalization language and add a dedicated limitations and future work section that discusses the scope of our claims and the value of DF-Wild as a challenging proxy. We will also attempt to include at least one cross-dataset experiment in the revision if feasible. revision: partial
-
Referee: [Experimental results] Experimental results section: No ablation studies, training details (hyperparameters, data splits, fine-tuning protocol), error bars, or statistical significance tests are described for the ensemble versus individual models or CNN baselines, which undermines the ability to attribute the 7.05% AUC margin specifically to the ViT ensemble design rather than implementation choices.
Authors: We concur that these details are essential for reproducibility and for isolating the contribution of the ensemble. In the revised manuscript we will expand the Experimental Results section to include: full hyperparameter specifications and fine-tuning protocols, explicit data split descriptions, ablation studies comparing the full ensemble against each individual ViT and against the CNN baselines, error bars computed over multiple random seeds, and statistical significance tests (e.g., paired t-tests) on the reported AUC and EER differences. revision: yes
Circularity Check
No circularity: purely empirical benchmark on held-out test split
full rationale
The paper reports an ensemble of fine-tuned ViTs evaluated via standard train/test splits on the DF-Wild competition dataset, with AUC/EER numbers obtained directly from inference on the held-out test set. No equations, derivations, parameter-fitting steps, or first-principles claims exist that could reduce to the inputs by construction. No self-citations are load-bearing for any uniqueness theorem or ansatz. The reported performance is a direct measurement, not a renamed fit or self-referential definition.
Axiom & Free-Parameter Ledger
free parameters (1)
- Model selection and ensemble combination
axioms (1)
- domain assumption Pre-trained vision transformers capture transferable features useful for deepfake classification after fine-tuning
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2818–2829 (2023)
2023
-
[2]
In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., Verdo- liva, L.: On the detection of synthetic images generated by diffusion models. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
2023
-
[3]
The DeepFake Detection Challenge (DFDC) Dataset
Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397 (2020)
work page internal anchor Pith review arXiv 2006
-
[4]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp. 7890–7899 (2020)
2020
-
[5]
Advances in Neural Information Processing Systems33, 3022–3032 (2020)
Dzanic, T., Shah, K., Witherden, F.: Fourier spectrum discrepancies in deep network generated images. Advances in Neural Information Processing Systems33, 3022–3032 (2020)
2020
-
[6]
Susskind, and Alaaeldin El-Nouby
Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., Brabandere, B.D., Rybkin, O., Galuba, W., et al.: Mul- timodal autoregressive pre-training of large vision encoders. arXiv preprint arXiv:2411.14402 (2024)
-
[7]
In: International Conference on Machine Learning
Frank, J., Eisenhofer, T., Sch ¨onherr, L., Fischer, A., Kolossa, D., Holz, T.: Leveraging frequency analysis for deep fake image recognition. In: International Conference on Machine Learning. pp. 3247–3258. PMLR (2020)
2020
-
[8]
In: Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., Vaughan, J.W
Karras, T., Aittala, M., Laine, S., H ¨ark¨onen, E., Hellsten, J., Lehtinen, J., Aila, T.: Alias-free generative adversarial networks. In: Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 852–863. Curran Associates, Inc. (2021)
2021
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: Advancing high fidelity iden- tity swapping for forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5074–5083 (2020)
2020
-
[10]
arXiv preprint arXiv:1806.02877 (2018)
Li, Y ., Chang, M.C., Lyu, S.: In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. arXiv preprint arXiv:1806.02877 (2018)
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
Li, Y ., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A large-scale chal- lenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
2020
-
[12]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Laptev, I., Sivic, J., Neverova, N., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 10684–10695 (2022)
2022
-
[14]
In: International Conference on Computer Vision (ICCV) (2019)
R ¨ossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: FaceForensics++: Learning to detect manipulated facial images. In: International Conference on Computer Vision (ICCV) (2019)
2019
-
[15]
Wang, S.Y ., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn- generated images are surprisingly easy to spot... for now. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8695–8704 (2020)
2020
-
[16]
Yan, Z., Wang, J., Jin, P., Zhang, K.Y ., Liu, C., Chen, S., Yao, T., Ding, S., Wu, B., Yuan, L.: Orthogonal subspace decomposition for gener- alizable ai-generated image detection. arXiv preprint arXiv:2411.15633 (2024)
-
[17]
In: 2019 IEEE International Workshop on Informa- tion Forensics and Security (WIFS)
Zhang, X., Karaman, S., Chang, S.F.: Detecting and simulating artifacts in gan fake images. In: 2019 IEEE International Workshop on Informa- tion Forensics and Security (WIFS). pp. 1–6. IEEE (2019)
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.