pith. sign in

arxiv: 2602.10750 · v2 · submitted 2026-02-11 · 💻 cs.CR · cs.AI· cs.CV· cs.LG

SecureScan: An AI-Driven Multi-Layer Framework for Malware and Phishing Detection Using Logistic Regression and Threat Intelligence Integration

Pith reviewed 2026-05-16 05:45 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CVcs.LG
keywords malware detectionphishing detectionlogistic regressionthreat intelligencemulti-layer frameworkVirusTotalfalse positive reductionAI security
0
0 comments X

The pith

SecureScan detects malware and phishing at 93.1 percent accuracy by layering logistic regression with heuristics and external threat checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SecureScan as a three-layer system that first applies quick heuristic filters to known threats, then uses logistic regression to classify uncertain samples, and finally consults VirusTotal intelligence for borderline cases. This setup targets URLs, file hashes, and binaries while introducing a calibrated threshold and gray-zone logic between 0.45 and 0.55 to cut false positives. The work shows that a lightweight statistical model can reach balanced precision of 0.87 and recall of 0.92 on benchmark data, suggesting it generalizes without the overfitting common in heavier approaches. If the results hold, the framework offers a practical path to reliable detection without relying on complex deep learning.

Core claim

SecureScan is a triple-layer detection framework that integrates logistic regression-based classification, heuristic analysis, and external threat intelligence via the VirusTotal API for comprehensive triage of URLs, file hashes, and binaries. On benchmark datasets it reaches 93.1 percent accuracy with precision 0.87 and recall 0.92, using threshold-based decision calibration and gray-zone logic to minimize false positives and demonstrate strong generalization with reduced overfitting.

What carries the argument

The triple-layer architecture that filters known threats through heuristics, classifies uncertain samples with logistic regression, and validates borderline cases with VirusTotal intelligence.

Load-bearing premise

The chosen benchmark datasets represent real-world, evolving malware and phishing threats and the VirusTotal API supplies reliable, unbiased labels for borderline cases.

What would settle it

Running SecureScan on a fresh collection of recently emerged malware and phishing samples absent from the original benchmarks and from VirusTotal at training time, then observing accuracy fall substantially below 90 percent with degraded precision-recall balance.

Figures

Figures reproduced from arXiv: 2602.10750 by Aman Dangi, Rumman Firdos.

Figure 1
Figure 1. Figure 1: SecureScan architecture showing the three-layer detection pipeline - heuristic filtering, logistic regression classification, and external verification through VirusTotal API. This layered design enables SecureScan to minimize false positives, enhance interpretability, and maintain real-time performance suitable for deployment in enterprise or SOC environments. 24 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow diagram illustrating the end-to-end detection process, from input preprocessing to classification and threat intelligence validation. 3.2 Data Processing Pipeline Raw inputs—either URLs or file hashes—are first normalized through preprocessing steps such as lowercasing, token cleaning, and removal of protocol and tracking parameters. Feature extraction differs based on input type: • URL samples: c… view at source ↗
Figure 3
Figure 3. Figure 3: Feature extraction pipeline highlighting lexical, metadata, and structural features derived from both file and URL samples. For the machine learning model, URL strings were tokenized at the character level and transformed using a TF– IDF vectorizer over 3–7 character n-grams, capped at 50,000 features. This high-resolution lexical representation enables the model to detect subtle anomalies in domain compos… view at source ↗
Figure 4
Figure 4. Figure 4: VirusTotal correlation layer where ambiguous predictions are re-evaluated via API response to confirm or override model output. This hybrid decision layer ensures real-world robustness by combining statistical inference with global threat intelligence [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Probability calibration and gray-zone threshold visualization (0.45–0.55) showing safe, suspicious, and malicious decision regions. Design Summary SecureScan’s architecture balances speed, interpretability, and reliability [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrix of the calibrated logistic regression model showing balanced classification performance and reduced false positives [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison of different detection approaches based on accuracy, precision, recall, and F1-score metrics. 8.DISCUSSION SecureScan shows that combining a simple and interpretable classifier like logistic regression with layered validation and threat intelligence yields high-performance detection. Key benefits include: • Transparency: Coefficients can be inspected, aiding trust and auditability. •… view at source ↗
Figure 9
Figure 9. Figure 9: Latency versus accuracy trade-off illustrating how SecureScan maintains efficiency while improving detection reliability. Mitigations may include caching results, fallback logic, or layering ensemble approaches. 9.CONCLUSION In this paper, we presented SecureScan, a multi-layered, AI-assisted detection framework that combines logistic regression classification, heuristic preprocessing, and external threat … view at source ↗
read the original abstract

The growing sophistication of modern malware and phishing campaigns has diminished the effectiveness of traditional signature-based intrusion detection systems. This work presents SecureScan, an AI-driven, triple-layer detection framework that integrates logistic regression-based classification, heuristic analysis, and external threat intelligence via the VirusTotal API for comprehensive triage of URLs, file hashes, and binaries. The proposed architecture prioritizes efficiency by filtering known threats through heuristics, classifying uncertain samples using machine learning, and validating borderline cases with third-party intelligence. On benchmark datasets, SecureScan achieves 93.1 percent accuracy with balanced precision (0.87) and recall (0.92), demonstrating strong generalization and reduced overfitting through threshold-based decision calibration. A calibrated threshold and gray-zone logic (0.45-0.55) were introduced to minimize false positives and enhance real-world stability. Experimental results indicate that a lightweight statistical model, when augmented with calibrated verification and external intelligence, can achieve reliability and performance comparable to more complex deep learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents SecureScan, a triple-layer detection framework that applies heuristic filtering for known threats, logistic regression for classifying uncertain samples, and VirusTotal API integration for validating borderline cases involving URLs, file hashes, and binaries. It claims 93.1% accuracy with 0.87 precision and 0.92 recall on benchmark datasets, attributing the results to calibrated gray-zone thresholds (0.45-0.55) that reduce false positives and overfitting.

Significance. If the performance claims can be substantiated with proper experimental controls, the work would show that a lightweight logistic regression model augmented by external threat intelligence can reach reliability comparable to deep learning systems, offering a practical efficiency advantage for real-time malware and phishing triage.

major comments (3)
  1. [Abstract] Abstract: the reported 93.1% accuracy, 0.87 precision, and 0.92 recall are presented without any information on benchmark dataset identities, sizes, temporal coverage, train-test splits, feature engineering, or statistical significance tests, leaving the central generalization claim unsupported.
  2. [Abstract] Abstract and architecture description: the logistic regression coefficients and gray-zone thresholds (0.45-0.55) appear to be fitted and calibrated on the same benchmark data used for final evaluation, creating circularity that prevents assessment of true generalization to unseen or evolving threats.
  3. [Abstract] Abstract: the claim that the framework achieves 'reliability and performance comparable to more complex deep learning systems' is not supported by any direct comparison, baseline results, or ablation study showing the contribution of each layer.
minor comments (1)
  1. [Abstract] The abstract refers to 'strong generalization and reduced overfitting' without specifying how overfitting was quantified (e.g., via cross-validation scores or learning curves).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 93.1% accuracy, 0.87 precision, and 0.92 recall are presented without any information on benchmark dataset identities, sizes, temporal coverage, train-test splits, feature engineering, or statistical significance tests, leaving the central generalization claim unsupported.

    Authors: We agree that the abstract and results section lack sufficient experimental details. The revised manuscript will expand the abstract and add a dedicated experimental setup subsection specifying the benchmark dataset identities, sizes, temporal coverage, train-test splits (including ratios and stratification), feature engineering process, and statistical significance tests (e.g., McNemar's test or bootstrap confidence intervals) to substantiate the generalization claims. revision: yes

  2. Referee: [Abstract] Abstract and architecture description: the logistic regression coefficients and gray-zone thresholds (0.45-0.55) appear to be fitted and calibrated on the same benchmark data used for final evaluation, creating circularity that prevents assessment of true generalization to unseen or evolving threats.

    Authors: We acknowledge the risk of circularity in the current description. The revised manuscript will explicitly state that coefficients were learned on a training partition, thresholds were tuned on a held-out validation set, and final metrics were computed on a disjoint test set. We will also add discussion of temporal splits or cross-validation to address generalization to evolving threats. revision: yes

  3. Referee: [Abstract] Abstract: the claim that the framework achieves 'reliability and performance comparable to more complex deep learning systems' is not supported by any direct comparison, baseline results, or ablation study showing the contribution of each layer.

    Authors: We accept that the comparability claim requires supporting evidence. The revised version will include baseline comparisons against standard ML models (e.g., random forest, SVM), an ablation analysis quantifying each layer's contribution, and references to published deep learning results on similar malware/phishing benchmarks to contextualize the performance. revision: yes

Circularity Check

1 steps flagged

Performance metrics obtained by fitting logistic regression and calibrating thresholds on the same benchmark data used for reporting

specific steps
  1. fitted input called prediction [Abstract]
    "On benchmark datasets, SecureScan achieves 93.1 percent accuracy with balanced precision (0.87) and recall (0.92), demonstrating strong generalization and reduced overfitting through threshold-based decision calibration. A calibrated threshold and gray-zone logic (0.45-0.55) were introduced to minimize false positives and enhance real-world stability."

    The logistic regression parameters are fitted to the benchmark datasets and the decision thresholds are calibrated on the identical data; therefore the quoted accuracy, precision and recall numbers are direct results of that fitting step rather than predictions on independent held-out samples.

full rationale

The paper's central claim of 93.1% accuracy, 0.87 precision and 0.92 recall rests on a logistic regression classifier whose parameters are fitted directly to the benchmark datasets, with the gray-zone thresholds (0.45-0.55) also tuned on those same data. No train/test split, temporal hold-out, or external corpus is described, so the reported figures are outputs of the fitting process rather than independent predictions. This matches the fitted-input-called-prediction pattern and produces a moderate circularity score; the architecture description itself contains no further self-referential equations or self-citations that would raise the score higher.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The performance numbers rest on fitted logistic-regression coefficients and hand-chosen thresholds whose values are determined by the benchmark data; the claim of generalization further rests on the untested assumption that those benchmarks match future threats.

free parameters (2)
  • logistic_regression_coefficients
    Parameters fitted to benchmark datasets to produce the reported accuracy.
  • gray_zone_thresholds
    0.45-0.55 interval chosen to reduce false positives; calibrated on the same data.
axioms (1)
  • domain assumption Benchmark datasets are representative of real-world threats
    Invoked to support the generalization claim in the abstract.

pith-pipeline@v0.9.0 · 5481 in / 1316 out tokens · 54510 ms · 2026-05-16T05:45:43.907543+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    doi: 10.1109/ACCESS.2022.3220184

  2. [2]

    Leveraging VAE- derived latent spaces for enhanced malware detection with machine learning classifiers,

    B. Ajayi, B. Barakat, and K. McGarry, “Leveraging VAE- derived latent spaces for enhanced malware detection with machine learning classifiers,” arXiv preprint , arXiv:2501.04236, 2025

  3. [3]

    Phishing website detection using URL -based machine learning and hybrid features,

    R. M. Mohammad, F. Thabtah, and L. McCluskey, “Phishing website detection using URL -based machine learning and hybrid features,” IEEE Transactions on Information Forensics and Security , vol. 18, pp. 4529 – 4542, 2023. doi: 10.1109/TIFS.2023.3241057

  4. [4]

    PhishInt: Hybrid phishing URL detection using lexical, content -based and external intelligence features,

    R. Kumar and R. Patel, “PhishInt: Hybrid phishing URL detection using lexical, content -based and external intelligence features,” Expert Systems with Applications , vol. 228, p. 120386, 2023. doi: 10.1016/j.eswa.2023.120386

  5. [5]

    A hybrid deep learning approach for network intrusion detection using CNN – LSTM architecture,

    S. K. Sahu and A. K. Sahu, “A hybrid deep learning approach for network intrusion detection using CNN – LSTM architecture,” Computers & Security, vol. 139, p. 103727, 2024. doi: 10.1016/j.cose.2023.103727

  6. [6]

    Hybrid intelligent intrusion detection using feature fusion and ensemble learning,

    V. Kumar, A. Singh, and S. Ghosh, “Hybrid intelligent intrusion detection using feature fusion and ensemble learning,” IEEE Access, vol. 11, pp. 95432–95447, 2023. doi: 10.1109/ACCESS.2023.3287316

  7. [7]

    Intelligent hybrid framework for malware detection using static and dynamic analysis,

    Y. Zhang, J. Wang, and X. Zhang, “Intelligent hybrid framework for malware detection using static and dynamic analysis,” Journal of Information Security and Applications, vol. 71, p. 103398, 2022. doi: 10.1016/j.jisa.2022.103398

  8. [8]

    Combining deep learning and heuristic analysis for phishing detection in real time,

    H. M. Nguyen and Q. T. Le, “Combining deep learning and heuristic analysis for phishing detection in real time,” Computers & Electrical Engineering, vol. 118, p. 109218,

  9. [9]

    doi: 10.1016/j.compeleceng.2024.109218

  10. [10]

    Role of logistic regression in malware detection: A systematic literature review,

    A. Farooq and U. Akram, “Role of logistic regression in malware detection: A systematic literature review,” Journal of Cybersecurity Research, vol. 6, no. 2, pp. 77– 94, 2023. doi: 10.32604/jcsr.2023.026037

  11. [11]

    Threat intelligence –driven malware triage using VirusTotal and ML correlation models,

    T. Sultana and S. Tariq, “Threat intelligence –driven malware triage using VirusTotal and ML correlation models,” Digital Communications and Networks , vol. 9, no. 3, pp. 534 –546, 2023. doi: 10.1016/j.dcan.2023.03.009

  12. [12]

    Quo Vadis: Hybrid machine learning meta - model based on contextual and behavioral malware representations,

    D. Trizna, “Quo Vadis: Hybrid machine learning meta - model based on contextual and behavioral malware representations,” arXiv preprint , arXiv:2208.03912, 2022

  13. [13]

    Review of hybrid analysis technique for malware detection,

    Y. K. M. M. Yunus and S. B. Ngah, “Review of hybrid analysis technique for malware detection,” ResearchGate, 2023

  14. [14]

    Hybrid machine learning model for phishing detection,

    P. Maturure et al., “Hybrid machine learning model for phishing detection,” Information Security Journal, 2024. doi: 10.1080/19393555.2024.1234567

  15. [15]

    Modeling hybrid feature -based phishing websites detection using machine learning,

    “Modeling hybrid feature -based phishing websites detection using machine learning,” PubMed Central (PMC), 2023

  16. [16]

    A systematic literature review on phishing website detection,

    “A systematic literature review on phishing website detection,” ScienceDirect, 2023. doi: 10.1016/j.cose.2023.102731

  17. [17]

    The applicability of a hybrid framework for automated phishing,

    “The applicability of a hybrid framework for automated phishing,” ScienceDirect, 2023. doi: 10.1016/j.cose.2023.102721

  18. [18]

    Induction of electric field in human bodies moving near MRI: An efficient BEM computational procedure,

    M. M. Chiampi and L. L. Zilberti, “Induction of electric field in human bodies moving near MRI: An efficient BEM computational procedure,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 10, pp. 2787 –2793,

  19. [19]

    doi: 10.1109/TBME.2011.2158315

  20. [20]

    From information security to cyber security,

    R. Von Solms and J. Van Niekerk, “From information security to cyber security,” Computers & Security , vol. 38, pp. 97–102, 2013. doi: 10.1016/j.cose.2013.04.004

  21. [21]

    A survey on encrypted network traffic analysis using deep learning,

    E. Papadogiannaki, A. Ioannidis, and G. Kambourakis, “A survey on encrypted network traffic analysis using deep learning,” IEEE Access, vol. 9, pp. 74949 –74972,

  22. [22]

    doi: 10.1109/ACCESS.2021.3080099

  23. [23]

    Hybrid deep learning model for intrusion detection based on CNN and BiLSTM,

    C. Zhao et al., “Hybrid deep learning model for intrusion detection based on CNN and BiLSTM,” IEEE Access, vol. 10, pp. 76853 –76865, 2022. doi: 10.1109/ACCESS.2022.3189614

  24. [24]

    Survey of intrusion detection systems: Techniques, datasets and challenges,

    A. T. Khraisat, A. V. Gondal, and P. Vamplew, “Survey of intrusion detection systems: Techniques, datasets and challenges,” Cybersecurity, vol. 2, no. 1, pp. 1–22, 2019. doi: 10.1186/s42400-019-0038-7

  25. [25]

    Hybrid ensemble learning model for malware detection based on static and dynamic features,

    Y. Li et al., “Hybrid ensemble learning model for malware detection based on static and dynamic features,” Expert Systems with Applications , vol. 210, p. 118321,

  26. [26]

    doi: 10.1016/j.eswa.2022.118321

  27. [27]

    Phishing detection using hybrid feature extraction and machine learning,

    S. Chaudhary et al., “Phishing detection using hybrid feature extraction and machine learning,” IEEE Access, vol. 11, pp. 5692 –5705, 2023. doi: 10.1109/ACCESS.2023.3247120

  28. [28]

    Hybrid intelligent malware detection system using CNN –RF architecture,

    K. N. Kumar, A. R. Basha, and T. Anuradha, “Hybrid intelligent malware detection system using CNN –RF architecture,” Journal of King Saud University – Computer and Information Sciences , 2023. doi: 10.1016/j.jksuci.2023.101545

  29. [29]

    Deep hybrid model for URL - based phishing detection using character -level CNNs,

    P. Singh and D. Ghosh, “Deep hybrid model for URL - based phishing detection using character -level CNNs,” Computers & Security , vol. 115, p. 102645, 2022. doi: 10.1016/j.cose.2021.102645

  30. [30]

    Malware classification using explainable hybrid ML framework,

    S. R. Dey, J. Banik, and A. Mukherjee, “Malware classification using explainable hybrid ML framework,” Pattern Recognition Letters, vol. 171, pp. 131–139, 2023. doi: 10.1016/j.patrec.2023.03.019

  31. [31]

    Adaptive hybrid framework for cyber -threat intelligence fusion,

    H. Gupta, R. Verma, and S. Singh, “Adaptive hybrid framework for cyber -threat intelligence fusion,” Digital Threats: Research and Practice , vol. 5, no. 1, pp. 1 –14,

  32. [32]

    doi: 10.1145/3635441

  33. [33]

    Explainable machine learning for malware detection: A hybrid approach,

    C. Zhou and X. Jiang, “Explainable machine learning for malware detection: A hybrid approach,” ACM Computing Surveys, vol. 56, no. 5, pp. 1 –28, 2024. doi: 10.1145/3631012

  34. [34]

    A detailed analysis of the KDD CUP 99 data set

    S. Tavallaee et al., “A detailed analysis of the KDD CUP 99 data set,” IEEE Symposium on Computational Intelligence for Security and Defense Applications, 2009. doi: 10.1109/CISDA.2009.5356528

  35. [35]

    European Commission, Action plan against disinformation, 2018