arxiv: 2604.13555 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.NI

AI Powered Image Analysis for Phishing Detection

K. Acharya , S. Ale , R. Kadel This is my paper

Pith reviewed 2026-05-10 13:59 UTC · model grok-4.3

classification 💻 cs.CV cs.NI

keywords phishing detectionwebpage screenshotsConvNeXt-TinyVision Transformertransfer learningthreshold optimizationimage classificationdeep learning

0 comments

The pith

ConvNeXt-Tiny classifies phishing webpage screenshots more accurately and efficiently than ViT-Base when decision thresholds are tuned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Phishing sites increasingly copy visual elements like logos, layouts, and colors to bypass text and URL filters. The paper tests two pretrained vision models on screenshots to classify pages as phishing or legitimate. ConvNeXt-Tiny, a convolutional network, reaches the highest F1-score at its best threshold and uses less compute than the ViT-Base transformer. The evaluation across multiple thresholds shows why operating-point choice matters for keeping false alarms manageable in practice. The authors also plan to release their screenshot dataset to support further work.

Core claim

ConvNeXt-Tiny performs the best overall by achieving the highest F1-score at the optimised threshold while running more efficiently than ViT-Base on the task of distinguishing phishing from legitimate webpage screenshots using ImageNet transfer learning.

What carries the argument

ConvNeXt-Tiny and ViT-Base models with transfer learning from ImageNet, applied to binary classification of webpage screenshots for phishing detection.

If this is right

Threshold tuning produces operating points that trade off detection rate against false-positive rate for realistic deployment.
Convolutional networks show better suitability than transformer models for capturing visual imitation patterns under the tested conditions.
Releasing the screenshot dataset enables direct reproduction and extensions by other researchers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same screenshot-based approach could be paired with existing URL or content filters to create layered detection systems.
The efficiency edge of ConvNeXt-Tiny suggests possible use in browser extensions or mobile scanners where compute is limited.
Visual classification techniques may transfer to spotting other copied-appearance attacks such as cloned app stores or fake social media pages.

Load-bearing premise

The collected phishing screenshot dataset and ImageNet-pretrained weights will match the visual tactics used in future real-world phishing pages without large performance drops.

What would settle it

Measure F1-score and runtime on a new collection of phishing and legitimate website screenshots assembled after the original dataset curation, checking for clear drops relative to the reported numbers.

Figures

Figures reproduced from arXiv: 2604.13555 by K. Acharya, R. Kadel, S. Ale.

**Figure 1.** Figure 1: Methodology applied during the study. optimiser was used to update model parameters during training. Hyperparameters, including the learning rate, batch size and dropout, were tuned during the training phase. For the classification of these binary images, the following choice configurations follow the standard practice [9]. Validation was performed on each epoch while observing model overfitting and genera… view at source ↗

**Figure 2.** Figure 2: Threshold analysis of ConvNeXt-Tiny. TABLE III: Threshold-based performance of ViT-Base Threshold Precision (%) Recall (%) F1-score (%) 0.1 0.595 1.000 0.746 0.2 0.635 0.998 0.776 0.3 0.675 0.996 0.804 0.4 0.715 0.993 0.832 0.5 0.755 0.988 0.857 0.6 0.795 0.975 0.877 0.7 0.835 0.950 0.889 0.8 0.920 0.880 0.900 0.9 0.985 0.665 0.860 stable and reliable operating point for real phishing-detection scenarios. … view at source ↗

**Figure 3.** Figure 3: Threshold analysis of ViT-Base. 0.992 reflects a strong balance between precision and recall, meaning the model is able to detect phishing pages effectively while keeping false alarms very low. In practical terms, this is important because blocking legitimate websites can disrupt users, while missing phishing pages poses security risks. The reported precision (0.997) and recall (0.984) indicate that the mo… view at source ↗

read the original abstract

Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates two vision models—ConvNeXt-Tiny and ViT-Base—for phishing detection from webpage screenshots. It describes dataset curation, ImageNet-pretrained transfer learning, preprocessing, and threshold-based evaluation of precision, recall, and F1-score, concluding that ConvNeXt-Tiny achieves the highest F1 at the optimized threshold while being more efficient than ViT-Base. The work emphasizes threshold tuning for real-world use and plans to release the dataset.

Significance. If the performance claims are substantiated with complete metrics and controls, the paper would provide practical, deployment-oriented guidance on convolutional versus transformer architectures for visual phishing detection and underscore the value of threshold-aware metrics over accuracy alone. The dataset release would support further reproducibility in this area.

major comments (3)

[Abstract] Abstract and Results: The central claim that ConvNeXt-Tiny 'performs the best overall, achieving the highest F1-score at the optimised threshold' supplies no numerical F1 values, dataset size, class balance, train/test split ratios, or error bars. Without these quantities the performance comparison to ViT-Base cannot be verified or reproduced.
[Methods] Methods / Experiments: The transfer-learning pipeline relies on ImageNet-pretrained weights, yet no ablation compares fine-tuning against training from scratch, no feature-space analysis (e.g., t-SNE or domain-adversarial metrics) quantifies the domain shift between natural ImageNet images and structured phishing screenshots, and no temporal hold-out or cross-year test set is reported. These omissions leave the generalization assumption untested.
[Evaluation] Evaluation: The paper states that 'threshold tuning is important' and reports results 'across different decision thresholds,' but provides neither the threshold values tested, the optimization criterion used to select the operating point, nor the full precision-recall curves. This prevents assessment of whether the reported F1 advantage is robust or an artifact of a single chosen threshold.

minor comments (2)

[Abstract] The abstract mentions 'dataset creation and preprocessing' but the manuscript does not specify the exact number of phishing versus benign samples or the curation criteria used to capture current visual imitation tactics.
[Results] Computational-efficiency claims ('running more efficiently') lack concrete metrics such as inference latency, FLOPs, or memory footprint on a stated hardware platform.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and Results: The central claim that ConvNeXt-Tiny 'performs the best overall, achieving the highest F1-score at the optimised threshold' supplies no numerical F1 values, dataset size, class balance, train/test split ratios, or error bars. Without these quantities the performance comparison to ViT-Base cannot be verified or reproduced.

Authors: We agree that the abstract lacks the specific numerical details needed for immediate verification. In the revised manuscript we will update the abstract (and the corresponding Results section) to report the exact F1-scores achieved by ConvNeXt-Tiny and ViT-Base at the optimised threshold, the total number of screenshots in the dataset, the class balance, the train/test split ratios, and error bars obtained from repeated runs where applicable. revision: yes
Referee: [Methods] Methods / Experiments: The transfer-learning pipeline relies on ImageNet-pretrained weights, yet no ablation compares fine-tuning against training from scratch, no feature-space analysis (e.g., t-SNE or domain-adversarial metrics) quantifies the domain shift between natural ImageNet images and structured phishing screenshots, and no temporal hold-out or cross-year test set is reported. These omissions leave the generalization assumption untested.

Authors: The study was scoped to a direct comparison of the two architectures under standard ImageNet-pretrained transfer learning. We will add a short discussion in the Methods section on the rationale for using pretrained weights and the expected domain shift. However, performing the requested ablations from scratch, feature-space analyses, and temporal hold-out splits would require substantial additional experiments and data curation that exceed the current scope and resources; these will be noted explicitly as limitations and suggested for future work. revision: partial
Referee: [Evaluation] Evaluation: The paper states that 'threshold tuning is important' and reports results 'across different decision thresholds,' but provides neither the threshold values tested, the optimization criterion used to select the operating point, nor the full precision-recall curves. This prevents assessment of whether the reported F1 advantage is robust or an artifact of a single chosen threshold.

Authors: We agree that the threshold selection process must be fully documented. The revised manuscript will list the specific threshold values evaluated, state the optimisation criterion (maximising F1 on the validation set), and include the precision-recall curves (or tabulated metrics at multiple operating points) for both models so that readers can judge the robustness of the reported F1 advantage. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of pretrained vision models on held-out phishing screenshot data

full rationale

The paper conducts standard transfer learning from ImageNet weights, trains ConvNeXt-Tiny and ViT-Base on a curated phishing dataset, and evaluates F1, precision, recall, and efficiency metrics at multiple decision thresholds on held-out test data. No equations, derivations, or self-referential definitions appear; performance claims are direct outputs of training and threshold search rather than quantities defined in terms of themselves. No self-citations are load-bearing for the central comparison, and no fitted parameters are relabeled as predictions. The evaluation is self-contained against external benchmarks (standard ML practice) with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transfer-learning assumptions and empirical benchmarking; no new entities postulated.

free parameters (2)

decision threshold
Optimized per model for highest F1-score on validation data; directly affects reported performance.
ImageNet pretraining weights
Fixed external initialization; no domain-specific pretraining performed.

axioms (1)

domain assumption Transfer learning from ImageNet weights transfers effectively to webpage screenshot classification for phishing detection
Invoked in the preprocessing and training description without domain-adaptation experiments or justification.

pith-pipeline@v0.9.0 · 5553 in / 1246 out tokens · 32341 ms · 2026-05-10T13:59:30.347056+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references

[1]

Multimodal phishing detection on social networking sites: A systematic review,

T. Wangchuk and T. Gonsalves, “Multimodal phishing detection on social networking sites: A systematic review,”IEEE Access, 2025

2025
[2]

A comprehensive literature review on phishing url detection using deep learning techniques,

E. Kritika, “A comprehensive literature review on phishing url detection using deep learning techniques,”Journal of Cyber Security Technology, vol. 9, no. 4, pp. 315–343, 2025

2025
[3]

A convnet for the 2020s,

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986

2022
[4]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations
[5]

The answer is in the text: Multi-stage methods for phishing detection based on feature engineering,

E. S. Gualberto, R. T. De Sousa, T. P. D. B. Vieira, J. P. C. L. Da Costa, and C. G. Duque, “The answer is in the text: Multi-stage methods for phishing detection based on feature engineering,”IEEE Access, vol. 8, pp. 223 529–223 547, 2020

2020
[6]

A systematic literature review on phishing email detection using natural language processing techniques,

S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “A systematic literature review on phishing email detection using natural language processing techniques,”IEEE Access, vol. 10, pp. 65 703–65 727, 2022

2022
[7]

Email phishing: Text classification using natural language processing,

P. Verma, A. Goyal, and Y . Gigras, “Email phishing: Text classification using natural language processing,”Computer Science and Information Technologies, vol. 1, no. 1, pp. 1–12, 2020

2020
[8]

Phishing or not phishing? a survey on the detection of phishing websites,

R. Zieni, L. Massari, and M. C. Calzarossa, “Phishing or not phishing? a survey on the detection of phishing websites,”IEEE Access, vol. 11, pp. 18 499–18 519, 2023

2023
[9]

A deep learning-based phishing detection system using CNN, LSTM, and LSTM-CNN,

Z. Alshingiti, R. Alaqel, J. Al-Muhtadi, Q. E. U. Haq, K. Saleem, and M. H. Faheem, “A deep learning-based phishing detection system using CNN, LSTM, and LSTM-CNN,”Electronics, vol. 12, no. 1, p. 232, 2023

2023
[10]

Phishpedia: A hybrid deep learning based approach to visually identify phishing webpages,

Y . Lin, R. Liu, D. M. Divakaran, J. Y . Ng, Q. Z. Chan, Y . Lu, Y . Si, F. Zhang, and J. S. Dong, “Phishpedia: A hybrid deep learning based approach to visually identify phishing webpages,” in30th USENIX Security Symposium, 2021, pp. 3793–3810

2021
[11]

Visual similarity-based phishing detection using deep learning,

U. Saeed, “Visual similarity-based phishing detection using deep learning,” Journal of Electronic Imaging, vol. 31, no. 5, pp. 051 607–051 607, 2022

2022
[12]

Vision gnn based phishing website detection,

J. Lindamulage, L. MandiraPabasari, S. Yapa, I. Perera, and J. Krishara, “Vision gnn based phishing website detection,” in2023 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES). IEEE, 2023, pp. 1–7

2023
[13]

Phish-iris: A new approach for vision based brand prediction of phishing web pages via compact visual descriptors,

F. C. Dalgic, A. S. Bozkir, and M. Aydos, “Phish-iris: A new approach for vision based brand prediction of phishing web pages via compact visual descriptors,” in2018 2nd international symposium on multidisciplinary studies and innovative technologies (ISMSIT). IEEE, 2018, pp. 1–8

2018
[14]

Phishing detection system through hybrid machine learning based on url,

A. Karim, M. Shahroz, K. Mustofa, S. B. Belhaouari, and S. R. K. Joga, “Phishing detection system through hybrid machine learning based on url,”IEEE Access, vol. 11, pp. 36 805–36 822, 2023

2023
[15]

Phishdef: Url names say it all,

A. Le, A. Markopoulou, and M. Faloutsos, “Phishdef: Url names say it all,” in2011 Proceedings IEEE INFOCOM. IEEE, 2011, pp. 191–195

2011
[16]

Phishing url detection: A real-case scenario through login urls,

M. S ´anchez-Paniagua, E. F. Fern ´andez, E. Alegre, W. Al-Nabki, and V . Gonz´alez-Castro, “Phishing url detection: A real-case scenario through login urls,”IEEE Access, vol. 10, pp. 42 949–42 960, 2022

2022
[17]

Openphish database: Continuously updated archive of phishing urls,

OpenPhish, “Openphish database: Continuously updated archive of phishing urls,” https://openphish.com/phishing database.html, 2025, accessed: 2025-11-04

2025
[18]

Phish-iris dataset: A small scale multi-class phishing web page screenshots dataset,

S. Shahane, “Phish-iris dataset: A small scale multi-class phishing web page screenshots dataset,” https://www.kaggle.com/datasets/ saurabhshahane/phishiris, 2025, accessed: 2025-11-04

2025
[19]

Evaluation: From precision, recall and f-measure to roc, informedness, markedness & correlation,

D. Powers, “Evaluation: From precision, recall and f-measure to roc, informedness, markedness & correlation,”Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63, 2011

2011
[20]

The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,

T. Saito and M. Rehmsmeier, “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,”PloS one, vol. 10, no. 3, p. e0118432, 2015

2015