Comparative Evaluation of Deep Learning Models for Fake Image Detection
Pith reviewed 2026-05-21 04:55 UTC · model grok-4.3
The pith
VGG16 achieves 91% accuracy detecting GAN-manipulated images, ahead of three other CNNs at 90%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a unified preprocessing and training pipeline on a dataset of real and manipulated images, the evaluation finds that VGG16 reaches 91% accuracy while XceptionNet, ResNet50, and EfficientNetB0 each reach 90%. EfficientNetB0 exhibits stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. The work supplies a reproducible baseline and calls attention to the requirements for balanced datasets, advanced augmentation, and fairness-aware training to build more reliable detection systems.
What carries the argument
A unified preprocessing and training pipeline applied to four pretrained CNN architectures (VGG16, ResNet50, EfficientNetB0, XceptionNet) for binary classification of real versus GAN-manipulated images.
If this is right
- The reported numbers establish a reproducible baseline for comparing CNN performance on fake image detection.
- Dataset imbalance produces measurable bias, with some models more reliable on fake images than on real ones.
- Better generalization would require balanced datasets and advanced augmentation to reduce overfitting.
- Fairness-aware training methods could improve reliability for applications in digital forensics.
Where Pith is reading between the lines
- If the performance ranking holds on other collections, VGG16 could be used as an initial model in content verification pipelines.
- The comparative approach could be repeated with additional manipulation types or with hybrid systems that combine CNNs and other detection signals.
- Applying the identical pipeline to video frames or higher-resolution images would test whether the accuracy ordering remains stable.
Load-bearing premise
The single dataset after resizing, normalization, and augmentation is sufficiently representative and balanced for the performance numbers and observed biases to generalize beyond this specific collection.
What would settle it
Evaluating the same four models on a new, independently collected dataset with equal numbers of real and manipulated images would show whether the 91% accuracy and the imbalance bias persist or shift.
Figures
read the original abstract
The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates four pretrained CNN architectures (VGG16, ResNet50, EfficientNetB0, XceptionNet) for fake image detection on a dataset of real and manipulated images. It applies a unified pipeline of resizing, normalization, and augmentation to address class imbalance, then reports performance using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 is stated to reach the highest accuracy of 91%, with the other three models each at 90%; the abstract notes imbalance-driven bias, overfitting risk, and limited cross-domain robustness while positioning the work as a reproducible baseline.
Significance. If the reported ordering is shown to be statistically stable, the study supplies a practical baseline comparison of standard CNN backbones under a shared preprocessing regime for digital forensics. The explicit discussion of imbalance bias and the call for fairness-aware training add modest practical value, though the absence of variance controls restricts broader claims about model superiority.
major comments (1)
- [Abstract] Abstract (performance claims): VGG16 is reported at 91% accuracy and XceptionNet/ResNet50/EfficientNetB0 at 90% on a single train/test split, yet no standard deviations, multiple random seeds, k-fold results, or paired significance tests are supplied. Given the abstract's own acknowledgment of imbalance-driven bias, the 1% gap cannot be treated as a reliable ranking without these controls; the central comparative claim therefore rests on unquantified sampling variation.
minor comments (1)
- The abstract and methods description would benefit from explicit dataset statistics (total images, real/fake split) and the precise augmentation operations applied.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The concern regarding the lack of statistical controls for the reported performance differences is valid, and we address it directly below while committing to revisions that improve the robustness of our comparative claims without overstating the current results.
read point-by-point responses
-
Referee: [Abstract] Abstract (performance claims): VGG16 is reported at 91% accuracy and XceptionNet/ResNet50/EfficientNetB0 at 90% on a single train/test split, yet no standard deviations, multiple random seeds, k-fold results, or paired significance tests are supplied. Given the abstract's own acknowledgment of imbalance-driven bias, the 1% gap cannot be treated as a reliable ranking without these controls; the central comparative claim therefore rests on unquantified sampling variation.
Authors: We agree that the 1% accuracy difference observed on a single train/test split cannot be interpreted as a statistically reliable ranking, particularly in light of the class imbalance and potential sampling variation we already note in the manuscript. In the revised version, we will conduct additional experiments using five different random seeds for each model, reporting mean accuracy, precision, recall, F1-score, and ROC-AUC along with standard deviations. This will allow us to quantify variability and evaluate the stability of the observed ordering. We will also update the abstract and results section to reflect these new statistics and strengthen the discussion of limitations. Full k-fold cross-validation remains computationally prohibitive for the current dataset size and model training regime, but the multiple-seed approach provides a practical improvement. We will not claim statistical superiority in the revision but will position the work more explicitly as an initial baseline comparison. revision: partial
Circularity Check
No significant circularity in empirical model comparison
full rationale
The paper reports direct empirical accuracies from training and evaluating four pretrained CNN architectures (VGG16, ResNet50, EfficientNetB0, XceptionNet) on a preprocessed dataset of real and fake images. Results such as VGG16 reaching 91% accuracy are measured outcomes on a held-out test set using standard metrics, with no equations, derivations, fitted parameters, or mathematical chains that could reduce to inputs by construction. The abstract and description contain no self-citations, ansatzes, or uniqueness claims that bear load on the central results. This is a standard experimental ML evaluation self-contained against external benchmarks, consistent with the reader's assessment of minimal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The collected real and manipulated images form a representative sample of the distribution encountered in real-world digital forensics.
Reference graph
Works this paper leans on
-
[1]
Deepfake detection by analyzing convolutional traces,
F. Guarnera, O. Giudice, and S. Battiato, “Deepfake detection by analyzing convolutional traces,” IEEE Transactions on Information Forensics and Security, vol. 19, pp. 1–12, 2024
work page 2024
-
[2]
Review of deep learning: concepts, CNN architec- tures, challenges, applications, future directions,
Alzubaidi, J. Zhang, A. Humaidi, et al., “Review of deep learning: concepts, CNN architec- tures, challenges, applications, future directions,” Journal of Big Data, vol. 8, no. 1, pp. 1 – 74, 2021
work page 2021
-
[3]
Deep learning for image forgery detection: CNN architectures and optimization strategies,
S. Raza, M. Munir, and A. Almutairi, “Deep learning for image forgery detection: CNN architectures and optimization strategies,” Multimedia Tools and Applications, vol. 81, no. 23, pp. 33421–33445, 2022
work page 2022
-
[4]
Adversarial attacks and defenses in deep learn- ing: A survey,
S. Mukta, A. Rahman, and M. S. Hossain, “Adversarial attacks and defenses in deep learn- ing: A survey,” IEEE Access, vol. 11, pp. 12345–12367, 2023
work page 2023
-
[5]
Convolutional neural network (CNN) for image detection and recognition,
M. Chauhan, S. Ghanshala, and R. Joshi, “Convolutional neural network (CNN) for image detection and recognition,” International Journal of Computer Applications, vol. 180, no. 5, pp. 1–5, 2018
work page 2018
-
[6]
Explainable AI for deepfake detection: Challenges and opportuni- ties,
Mansoor and R. Iliev, “Explainable AI for deepfake detection: Challenges and opportuni- ties,” in Proc. IEEE Int. Conf. Artificial Intelligence and Ethics, 2025, pp. 1–8
work page 2025
-
[7]
Deep learning in image classification: A survey,
Y. Liu and W. Deng, “Deep learning in image classification: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 12, pp. 2485–2500, 2015. 10 Pakala A. et al
work page 2015
-
[8]
FaceForensics++: Learning to detect manipulated facial images,
Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “FaceForensics++: Learning to detect manipulated facial images,” in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2019, pp. 1–11
work page 2019
-
[9]
Limitations of shallow CNNs in detecting adversarial manipulations,
Amin, S. Khan, and M. Hussain, “Limitations of shallow CNNs in detecting adversarial manipulations,” Multimedia Tools and Applications, vol. 83, no. 4, pp. 11245–11260, 2024
work page 2024
-
[10]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 770–778
work page 2015
-
[11]
Deepfake detection with residual networks,
H. Dang, F. Liu, and J. Stehouwer, “Deepfake detection with residual networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 1– 7
work page 2019
-
[12]
EfficientNet: Rethinking model scaling for convolutional neural net- works,
M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural net- works,” in Proc. Int. Conf. Machine Learning (ICML), 2019, pp. 6105–6114
work page 2019
-
[13]
Ensemble learning for deepfake detection using ResNet and Effi- cientNet,
S. Rana and W. Sung, “Ensemble learning for deepfake detection using ResNet and Effi- cientNet,” Applied Sciences, vol. 10, no. 23, pp. 1–15, 2020
work page 2020
-
[14]
EfficientNet for mobile deepfake detection,
D. Pokroy and A. Egorov, “EfficientNet for mobile deepfake detection,” Journal of Real Time Image Processing, vol. 18, no. 2, pp. 145–156, 2021
work page 2021
-
[15]
Challenges in CNN based deepfake detection,
G. Petmezas, A. Tefas, and I. Pitas, “Challenges in CNN based deepfake detection,” Pattern Recognition Letters, vol. 165, pp. 1–10, 2025
work page 2025
-
[16]
Xception: Deep learning with depthwise separable convolutions,
F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1251–1258
work page 2017
-
[17]
Hybrid deepfake detection using CNNs and trans- formers,
S. Ganguly, R. Singh, and M. Vatsa, “Hybrid deepfake detection using CNNs and trans- formers,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 4, no. 2, pp. 1–12, 2022
work page 2022
-
[18]
Comparative study of CNN architectures for video deep- fake detection,
M. Ritter, J. Kim, and T. Nguyen, “Comparative study of CNN architectures for video deep- fake detection,” Multimedia Tools and Applications, vol. 82, no. 5, pp. 6543–6561, 2023
work page 2023
-
[19]
Media forensics and deepfakes: An overview,
L. Verdoliva, “Media forensics and deepfakes: An overview,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 910–932, 2020
work page 2020
-
[20]
Hybrid CNN transformer architectures for robust deepfake detection,
J. Wang, Y. Zhang, and H. Li, “Hybrid CNN transformer architectures for robust deepfake detection,” Neural Networks, vol. 165, pp. 1–15, 2024
work page 2024
-
[21]
Generalization challenges in deepfake detection: Emerging synthesis methods,
R. Babu, K. Sharma, and P. Gupta, “Generalization challenges in deepfake detection: Emerging synthesis methods,” IEEE Access, vol. 13, pp. 1–12, 2025
work page 2025
-
[22]
A survey on image data augmentation for deep learn- ing,
T. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learn- ing,” Journal of Big Data, vol. 6, no. 1, pp. 1–48, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.