AI Powered Image Analysis for Phishing Detection
Pith reviewed 2026-05-10 13:59 UTC · model grok-4.3
The pith
ConvNeXt-Tiny classifies phishing webpage screenshots more accurately and efficiently than ViT-Base when decision thresholds are tuned.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConvNeXt-Tiny performs the best overall by achieving the highest F1-score at the optimised threshold while running more efficiently than ViT-Base on the task of distinguishing phishing from legitimate webpage screenshots using ImageNet transfer learning.
What carries the argument
ConvNeXt-Tiny and ViT-Base models with transfer learning from ImageNet, applied to binary classification of webpage screenshots for phishing detection.
If this is right
- Threshold tuning produces operating points that trade off detection rate against false-positive rate for realistic deployment.
- Convolutional networks show better suitability than transformer models for capturing visual imitation patterns under the tested conditions.
- Releasing the screenshot dataset enables direct reproduction and extensions by other researchers.
Where Pith is reading between the lines
- The same screenshot-based approach could be paired with existing URL or content filters to create layered detection systems.
- The efficiency edge of ConvNeXt-Tiny suggests possible use in browser extensions or mobile scanners where compute is limited.
- Visual classification techniques may transfer to spotting other copied-appearance attacks such as cloned app stores or fake social media pages.
Load-bearing premise
The collected phishing screenshot dataset and ImageNet-pretrained weights will match the visual tactics used in future real-world phishing pages without large performance drops.
What would settle it
Measure F1-score and runtime on a new collection of phishing and legitimate website screenshots assembled after the original dataset curation, checking for clear drops relative to the reported numbers.
Figures
read the original abstract
Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates two vision models—ConvNeXt-Tiny and ViT-Base—for phishing detection from webpage screenshots. It describes dataset curation, ImageNet-pretrained transfer learning, preprocessing, and threshold-based evaluation of precision, recall, and F1-score, concluding that ConvNeXt-Tiny achieves the highest F1 at the optimized threshold while being more efficient than ViT-Base. The work emphasizes threshold tuning for real-world use and plans to release the dataset.
Significance. If the performance claims are substantiated with complete metrics and controls, the paper would provide practical, deployment-oriented guidance on convolutional versus transformer architectures for visual phishing detection and underscore the value of threshold-aware metrics over accuracy alone. The dataset release would support further reproducibility in this area.
major comments (3)
- [Abstract] Abstract and Results: The central claim that ConvNeXt-Tiny 'performs the best overall, achieving the highest F1-score at the optimised threshold' supplies no numerical F1 values, dataset size, class balance, train/test split ratios, or error bars. Without these quantities the performance comparison to ViT-Base cannot be verified or reproduced.
- [Methods] Methods / Experiments: The transfer-learning pipeline relies on ImageNet-pretrained weights, yet no ablation compares fine-tuning against training from scratch, no feature-space analysis (e.g., t-SNE or domain-adversarial metrics) quantifies the domain shift between natural ImageNet images and structured phishing screenshots, and no temporal hold-out or cross-year test set is reported. These omissions leave the generalization assumption untested.
- [Evaluation] Evaluation: The paper states that 'threshold tuning is important' and reports results 'across different decision thresholds,' but provides neither the threshold values tested, the optimization criterion used to select the operating point, nor the full precision-recall curves. This prevents assessment of whether the reported F1 advantage is robust or an artifact of a single chosen threshold.
minor comments (2)
- [Abstract] The abstract mentions 'dataset creation and preprocessing' but the manuscript does not specify the exact number of phishing versus benign samples or the curation criteria used to capture current visual imitation tactics.
- [Results] Computational-efficiency claims ('running more efficiently') lack concrete metrics such as inference latency, FLOPs, or memory footprint on a stated hardware platform.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and Results: The central claim that ConvNeXt-Tiny 'performs the best overall, achieving the highest F1-score at the optimised threshold' supplies no numerical F1 values, dataset size, class balance, train/test split ratios, or error bars. Without these quantities the performance comparison to ViT-Base cannot be verified or reproduced.
Authors: We agree that the abstract lacks the specific numerical details needed for immediate verification. In the revised manuscript we will update the abstract (and the corresponding Results section) to report the exact F1-scores achieved by ConvNeXt-Tiny and ViT-Base at the optimised threshold, the total number of screenshots in the dataset, the class balance, the train/test split ratios, and error bars obtained from repeated runs where applicable. revision: yes
-
Referee: [Methods] Methods / Experiments: The transfer-learning pipeline relies on ImageNet-pretrained weights, yet no ablation compares fine-tuning against training from scratch, no feature-space analysis (e.g., t-SNE or domain-adversarial metrics) quantifies the domain shift between natural ImageNet images and structured phishing screenshots, and no temporal hold-out or cross-year test set is reported. These omissions leave the generalization assumption untested.
Authors: The study was scoped to a direct comparison of the two architectures under standard ImageNet-pretrained transfer learning. We will add a short discussion in the Methods section on the rationale for using pretrained weights and the expected domain shift. However, performing the requested ablations from scratch, feature-space analyses, and temporal hold-out splits would require substantial additional experiments and data curation that exceed the current scope and resources; these will be noted explicitly as limitations and suggested for future work. revision: partial
-
Referee: [Evaluation] Evaluation: The paper states that 'threshold tuning is important' and reports results 'across different decision thresholds,' but provides neither the threshold values tested, the optimization criterion used to select the operating point, nor the full precision-recall curves. This prevents assessment of whether the reported F1 advantage is robust or an artifact of a single chosen threshold.
Authors: We agree that the threshold selection process must be fully documented. The revised manuscript will list the specific threshold values evaluated, state the optimisation criterion (maximising F1 on the validation set), and include the precision-recall curves (or tabulated metrics at multiple operating points) for both models so that readers can judge the robustness of the reported F1 advantage. revision: yes
Circularity Check
No circularity: purely empirical evaluation of pretrained vision models on held-out phishing screenshot data
full rationale
The paper conducts standard transfer learning from ImageNet weights, trains ConvNeXt-Tiny and ViT-Base on a curated phishing dataset, and evaluates F1, precision, recall, and efficiency metrics at multiple decision thresholds on held-out test data. No equations, derivations, or self-referential definitions appear; performance claims are direct outputs of training and threshold search rather than quantities defined in terms of themselves. No self-citations are load-bearing for the central comparison, and no fitted parameters are relabeled as predictions. The evaluation is self-contained against external benchmarks (standard ML practice) with no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- decision threshold
- ImageNet pretraining weights
axioms (1)
- domain assumption Transfer learning from ImageNet weights transfers effectively to webpage screenshot classification for phishing detection
Reference graph
Works this paper leans on
-
[1]
Multimodal phishing detection on social networking sites: A systematic review,
T. Wangchuk and T. Gonsalves, “Multimodal phishing detection on social networking sites: A systematic review,”IEEE Access, 2025
2025
-
[2]
A comprehensive literature review on phishing url detection using deep learning techniques,
E. Kritika, “A comprehensive literature review on phishing url detection using deep learning techniques,”Journal of Cyber Security Technology, vol. 9, no. 4, pp. 315–343, 2025
2025
-
[3]
A convnet for the 2020s,
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986
2022
-
[4]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations
-
[5]
The answer is in the text: Multi-stage methods for phishing detection based on feature engineering,
E. S. Gualberto, R. T. De Sousa, T. P. D. B. Vieira, J. P. C. L. Da Costa, and C. G. Duque, “The answer is in the text: Multi-stage methods for phishing detection based on feature engineering,”IEEE Access, vol. 8, pp. 223 529–223 547, 2020
2020
-
[6]
A systematic literature review on phishing email detection using natural language processing techniques,
S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “A systematic literature review on phishing email detection using natural language processing techniques,”IEEE Access, vol. 10, pp. 65 703–65 727, 2022
2022
-
[7]
Email phishing: Text classification using natural language processing,
P. Verma, A. Goyal, and Y . Gigras, “Email phishing: Text classification using natural language processing,”Computer Science and Information Technologies, vol. 1, no. 1, pp. 1–12, 2020
2020
-
[8]
Phishing or not phishing? a survey on the detection of phishing websites,
R. Zieni, L. Massari, and M. C. Calzarossa, “Phishing or not phishing? a survey on the detection of phishing websites,”IEEE Access, vol. 11, pp. 18 499–18 519, 2023
2023
-
[9]
A deep learning-based phishing detection system using CNN, LSTM, and LSTM-CNN,
Z. Alshingiti, R. Alaqel, J. Al-Muhtadi, Q. E. U. Haq, K. Saleem, and M. H. Faheem, “A deep learning-based phishing detection system using CNN, LSTM, and LSTM-CNN,”Electronics, vol. 12, no. 1, p. 232, 2023
2023
-
[10]
Phishpedia: A hybrid deep learning based approach to visually identify phishing webpages,
Y . Lin, R. Liu, D. M. Divakaran, J. Y . Ng, Q. Z. Chan, Y . Lu, Y . Si, F. Zhang, and J. S. Dong, “Phishpedia: A hybrid deep learning based approach to visually identify phishing webpages,” in30th USENIX Security Symposium, 2021, pp. 3793–3810
2021
-
[11]
Visual similarity-based phishing detection using deep learning,
U. Saeed, “Visual similarity-based phishing detection using deep learning,” Journal of Electronic Imaging, vol. 31, no. 5, pp. 051 607–051 607, 2022
2022
-
[12]
Vision gnn based phishing website detection,
J. Lindamulage, L. MandiraPabasari, S. Yapa, I. Perera, and J. Krishara, “Vision gnn based phishing website detection,” in2023 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES). IEEE, 2023, pp. 1–7
2023
-
[13]
Phish-iris: A new approach for vision based brand prediction of phishing web pages via compact visual descriptors,
F. C. Dalgic, A. S. Bozkir, and M. Aydos, “Phish-iris: A new approach for vision based brand prediction of phishing web pages via compact visual descriptors,” in2018 2nd international symposium on multidisciplinary studies and innovative technologies (ISMSIT). IEEE, 2018, pp. 1–8
2018
-
[14]
Phishing detection system through hybrid machine learning based on url,
A. Karim, M. Shahroz, K. Mustofa, S. B. Belhaouari, and S. R. K. Joga, “Phishing detection system through hybrid machine learning based on url,”IEEE Access, vol. 11, pp. 36 805–36 822, 2023
2023
-
[15]
Phishdef: Url names say it all,
A. Le, A. Markopoulou, and M. Faloutsos, “Phishdef: Url names say it all,” in2011 Proceedings IEEE INFOCOM. IEEE, 2011, pp. 191–195
2011
-
[16]
Phishing url detection: A real-case scenario through login urls,
M. S ´anchez-Paniagua, E. F. Fern ´andez, E. Alegre, W. Al-Nabki, and V . Gonz´alez-Castro, “Phishing url detection: A real-case scenario through login urls,”IEEE Access, vol. 10, pp. 42 949–42 960, 2022
2022
-
[17]
Openphish database: Continuously updated archive of phishing urls,
OpenPhish, “Openphish database: Continuously updated archive of phishing urls,” https://openphish.com/phishing database.html, 2025, accessed: 2025-11-04
2025
-
[18]
Phish-iris dataset: A small scale multi-class phishing web page screenshots dataset,
S. Shahane, “Phish-iris dataset: A small scale multi-class phishing web page screenshots dataset,” https://www.kaggle.com/datasets/ saurabhshahane/phishiris, 2025, accessed: 2025-11-04
2025
-
[19]
Evaluation: From precision, recall and f-measure to roc, informedness, markedness & correlation,
D. Powers, “Evaluation: From precision, recall and f-measure to roc, informedness, markedness & correlation,”Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63, 2011
2011
-
[20]
The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,
T. Saito and M. Rehmsmeier, “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,”PloS one, vol. 10, no. 3, p. e0118432, 2015
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.