arxiv: 2604.17376 · v1 · submitted 2026-04-19 · 💻 cs.CV · cs.AI· cs.LG· eess.IV

Recognition: unknown

Towards Generalizable Deepfake Image Detection with Vision Transformers

Kaliki V Srinanda , M Manvith Prabhu , Hemanth K Mogilipalem , Jayavarapu S Abhinai , Vaibhav Santhosh , Aryan Herur , Deepu Vijayasenan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGeess.IV

keywords deepfake detectionvision transformersensemble learningDF-Wild datasetgeneralizationimage forensicsAUC evaluation

0 comments

The pith

An ensemble of fine-tuned vision transformers detects deepfake images with 96.77% AUC on a diverse test set of manipulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that combining several vision transformer models, each fine-tuned on the DF-Wild dataset, produces a detector that generalizes better than single models or traditional CNNs. Existing deepfake detectors often fail when new generation techniques appear, so the authors test on DF-Wild, which mixes many manipulation types. They report that the ensemble reaches an AUC of 96.77 percent and an equal error rate of 9 percent, surpassing the prior Effort method by several points. This result is presented as evidence that transformer-based ensembles can handle the variety of real-world deepfakes more reliably than earlier approaches.

Core claim

Fine-tuning and ensembling DINOv2, AIMv2, and OpenCLIP ViT-L/14 models on the DF-Wild dataset yields 96.77 percent AUC and 9 percent EER on the held-out test set, outperforming both individual transformers and strong CNN baselines as well as the prior state-of-the-art Effort detector by 7.05 percentage points in AUC and 8 percentage points in EER.

What carries the argument

An ensemble formed by averaging predictions from three separately fine-tuned vision transformers (DINOv2, AIMv2, and OpenCLIP ViT-L/14) trained on spatial features from the DF-Wild dataset.

If this is right

The ensemble approach can be applied directly to other image-forensics tasks that require robustness across varied synthesis methods.
Single vision transformers underperform the ensemble, indicating that diversity across pre-trained backbones contributes measurably to generalization.
CNN baselines lag behind the transformer ensemble on the same DF-Wild split, confirming the value of transformer architectures for this task.
The reported 9 percent EER suggests the method could support practical deployment where low false-positive rates matter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If new deepfake generators continue to be released, periodic re-ensembling with updated transformer checkpoints may be needed to maintain the reported accuracy.
The success of mixing DINOv2, AIMv2, and CLIP-derived features hints that self-supervised and contrastive pre-training together capture complementary cues for forgery detection.
Because the method was the winning entry in the 2025 SP Cup, similar ensembles may become a standard baseline for future generalization benchmarks in deepfake detection.

Load-bearing premise

Performance measured on the current mix of manipulations in DF-Wild will continue to hold when entirely new deepfake generation methods appear in the future.

What would settle it

Running the trained ensemble on a fresh collection of deepfake images created by a generation technique absent from the DF-Wild training and test splits and measuring whether the AUC falls below 90 percent.

Figures

Figures reproduced from arXiv: 2604.17376 by Aryan Herur, Deepu Vijayasenan, Hemanth K Mogilipalem, Jayavarapu S Abhinai, Kaliki V Srinanda, M Manvith Prabhu, Vaibhav Santhosh.

**Figure 2.** Figure 2: Images from the DF-Wild Challenge test set [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

In today's day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP's ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This ensemble of fine-tuned ViTs wins on DF-Wild with strong numbers but the generalization claim stays untested outside that one competition dataset.

read the letter

The main point is that fine-tuning DINOv2, AIMv2, and OpenCLIP ViT-L/14 then ensembling them produces 96.77% AUC and 9% EER on the DF-Wild test set, beating the Effort baseline by a clear margin and taking the SP Cup win. That is a concrete empirical result on a dataset built with varied manipulations, and it shows these pre-trained transformers can be adapted effectively for this task without new architecture work. Starting from CNN spatial baselines and scaling to the ensemble is a straightforward progression that delivers measurable gains in the competition setting. The paper earns credit for reporting the exact metrics and for using a recent, challenging dataset rather than recycling older benchmarks alone. The implementation appears reproducible enough to win the event, which counts as real evidence of practical utility. The soft spot is the generalization story. Everything stays inside DF-Wild splits; the abstract and stress-test note give no sign of cross-testing on FF++, Celeb-DF, DFDC, or diffusion-only sets. Without those checks, the claim that the method will hold for future unseen generators rests on the assumption that DF-Wild's mix is representative, which is plausible but unverified. Minor gaps like missing error bars or detailed ablations in the summary text could be fixed in revision, but they do not break the central numbers. This work is mainly for researchers who follow deepfake detection competitions or need quick ViT baselines on diverse manipulation data. A reader building forensic tools or tracking benchmark progress would find the performance numbers useful. It deserves a serious referee because the task matters and the empirical result is competitive; reviewers can press on the cross-dataset question and any missing protocol details without the paper collapsing.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an ensemble of fine-tuned Vision Transformers (DINOv2, AIMv2, and OpenCLIP ViT-L/14) as a generalizable approach for deepfake image detection. Using the DF-Wild dataset from the IEEE SP Cup 2025, which features diverse manipulations, the ensemble is shown to outperform individual ViT models, CNN baselines, and the prior SOTA method Effort, achieving 96.77% AUC and 9% EER on the DF-Wild test set while winning the competition.

Significance. If the empirical results hold, the work provides evidence that ViT ensembles can deliver strong performance on challenging, multi-technique deepfake data, advancing the field beyond CNN-based detectors. The competition win supplies practical validation of the approach under contest conditions. Credit is due for the direct benchmark comparison to Effort and the focus on a dataset with varied generation techniques.

major comments (2)

[Abstract] Abstract: The central claim of a 'generalizable method' is load-bearing for the title and contribution, yet all quantitative results (AUC 96.77%, EER 9%) are reported exclusively on DF-Wild internal splits with no cross-dataset evaluation on established benchmarks such as FaceForensics++, Celeb-DF, or DFDC; this leaves the representativeness assumption for future unseen generators untested and makes the generalization claim dataset-contingent.
[Experimental results] Experimental results section: No ablation studies, training details (hyperparameters, data splits, fine-tuning protocol), error bars, or statistical significance tests are described for the ensemble versus individual models or CNN baselines, which undermines the ability to attribute the 7.05% AUC margin specifically to the ViT ensemble design rather than implementation choices.

minor comments (2)

[Abstract] Abstract: The phrase 'we started our experiments with CNN classifiers' is unclear without subsequent details on how this informed the final ViT ensemble; a brief transition sentence would improve flow.
[Results] The manuscript would benefit from a table summarizing per-model AUC/EER scores alongside the ensemble to make the 'outperforms individual models' claim immediately verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our evaluation and experimental rigor. We address each major comment below and describe the revisions we will implement.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 'generalizable method' is load-bearing for the title and contribution, yet all quantitative results (AUC 96.77%, EER 9%) are reported exclusively on DF-Wild internal splits with no cross-dataset evaluation on established benchmarks such as FaceForensics++, Celeb-DF, or DFDC; this leaves the representativeness assumption for future unseen generators untested and makes the generalization claim dataset-contingent.

Authors: We agree that explicit cross-dataset evaluation on benchmarks such as FaceForensics++ would strengthen the generalization claim. DF-Wild was selected precisely because it contains diverse manipulations across multiple generators, and the competition setting (IEEE SP Cup 2025) inherently tests performance on held-out data with unseen techniques. Nevertheless, we will revise the abstract to moderate the generalization language and add a dedicated limitations and future work section that discusses the scope of our claims and the value of DF-Wild as a challenging proxy. We will also attempt to include at least one cross-dataset experiment in the revision if feasible. revision: partial
Referee: [Experimental results] Experimental results section: No ablation studies, training details (hyperparameters, data splits, fine-tuning protocol), error bars, or statistical significance tests are described for the ensemble versus individual models or CNN baselines, which undermines the ability to attribute the 7.05% AUC margin specifically to the ViT ensemble design rather than implementation choices.

Authors: We concur that these details are essential for reproducibility and for isolating the contribution of the ensemble. In the revised manuscript we will expand the Experimental Results section to include: full hyperparameter specifications and fine-tuning protocols, explicit data split descriptions, ablation studies comparing the full ensemble against each individual ViT and against the CNN baselines, error bars computed over multiple random seeds, and statistical significance tests (e.g., paired t-tests) on the reported AUC and EER differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark on held-out test split

full rationale

The paper reports an ensemble of fine-tuned ViTs evaluated via standard train/test splits on the DF-Wild competition dataset, with AUC/EER numbers obtained directly from inference on the held-out test set. No equations, derivations, parameter-fitting steps, or first-principles claims exist that could reduce to the inputs by construction. No self-citations are load-bearing for any uniqueness theorem or ansatz. The reported performance is a direct measurement, not a renamed fit or self-referential definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard transfer-learning assumptions in computer vision plus the representativeness of the competition dataset. No new entities or parameters are invented beyond model selection.

free parameters (1)

Model selection and ensemble combination
Choice of DINOv2, AIMv2, and OpenCLIP ViT-L/14 plus any weighting or fusion rule is selected and likely tuned on validation data.

axioms (1)

domain assumption Pre-trained vision transformers capture transferable features useful for deepfake classification after fine-tuning
Invoked implicitly when the authors start with these models and report improved performance.

pith-pipeline@v0.9.0 · 5525 in / 1337 out tokens · 35812 ms · 2026-05-10T06:32:15.669893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 5 canonical work pages · 2 internal anchors

[1]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2818–2829 (2023)

2023
[2]

In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., Verdo- liva, L.: On the detection of synthetic images generated by diffusion models. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

2023
[3]

The DeepFake Detection Challenge (DFDC) Dataset

Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397 (2020)

work page internal anchor Pith review arXiv 2006
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp. 7890–7899 (2020)

2020
[5]

Advances in Neural Information Processing Systems33, 3022–3032 (2020)

Dzanic, T., Shah, K., Witherden, F.: Fourier spectrum discrepancies in deep network generated images. Advances in Neural Information Processing Systems33, 3022–3032 (2020)

2020
[6]

Susskind, and Alaaeldin El-Nouby

Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., Brabandere, B.D., Rybkin, O., Galuba, W., et al.: Mul- timodal autoregressive pre-training of large vision encoders. arXiv preprint arXiv:2411.14402 (2024)

work page arXiv 2024
[7]

In: International Conference on Machine Learning

Frank, J., Eisenhofer, T., Sch ¨onherr, L., Fischer, A., Kolossa, D., Holz, T.: Leveraging frequency analysis for deep fake image recognition. In: International Conference on Machine Learning. pp. 3247–3258. PMLR (2020)

2020
[8]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., Vaughan, J.W

Karras, T., Aittala, M., Laine, S., H ¨ark¨onen, E., Hellsten, J., Lehtinen, J., Aila, T.: Alias-free generative adversarial networks. In: Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 852–863. Curran Associates, Inc. (2021)

2021
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: Advancing high fidelity iden- tity swapping for forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5074–5083 (2020)

2020
[10]

arXiv preprint arXiv:1806.02877 (2018)

Li, Y ., Chang, M.C., Lyu, S.: In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. arXiv preprint arXiv:1806.02877 (2018)

work page arXiv 2018
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

Li, Y ., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A large-scale chal- lenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020
[12]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Laptev, I., Sivic, J., Neverova, N., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 10684–10695 (2022)

2022
[14]

In: International Conference on Computer Vision (ICCV) (2019)

R ¨ossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: FaceForensics++: Learning to detect manipulated facial images. In: International Conference on Computer Vision (ICCV) (2019)

2019
[15]

Wang, S.Y ., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn- generated images are surprisingly easy to spot... for now. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8695–8704 (2020)

2020
[16]

Orthogonal subspace decomposition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024

Yan, Z., Wang, J., Jin, P., Zhang, K.Y ., Liu, C., Chen, S., Yao, T., Ding, S., Wu, B., Yuan, L.: Orthogonal subspace decomposition for gener- alizable ai-generated image detection. arXiv preprint arXiv:2411.15633 (2024)

work page arXiv 2024
[17]

In: 2019 IEEE International Workshop on Informa- tion Forensics and Security (WIFS)

Zhang, X., Karaman, S., Chang, S.F.: Detecting and simulating artifacts in gan fake images. In: 2019 IEEE International Workshop on Informa- tion Forensics and Security (WIFS). pp. 1–6. IEEE (2019)

2019