Enhancing Malware Detection with Generative AI: Using Variational Autoencoders to Boost Machine Learning Classifiers' Performance

Jeremy Straub; Mohammad Alharbi

arxiv: 2606.06501 · v1 · pith:HP2JBWWUnew · submitted 2026-05-17 · 💻 cs.CR

Enhancing Malware Detection with Generative AI: Using Variational Autoencoders to Boost Machine Learning Classifiers' Performance

Mohammad Alharbi , Jeremy Straub This is my paper

Pith reviewed 2026-06-30 19:37 UTC · model grok-4.3

classification 💻 cs.CR

keywords malware detectionvariational autoencodersgenerative AImachine learning classifiersdataset augmentationsynthetic datacybersecurity

0 comments

The pith

Variational autoencoders generate synthetic malware samples that improve machine learning classifier performance on detection tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes using variational autoencoders to create synthetic malware data for augmenting training sets. The goal is to overcome data scarcity for new or rare malware types by generating high-quality samples that resemble real malware. Classifiers such as random forest, XGBoost, and sequential models show gains in accuracy, precision, recall, and F1-score when trained on the augmented data. A sympathetic reader would care because evolving malware threats require adaptive detection methods that can handle limited data for emerging variants.

Core claim

The paper claims that training machine learning classifiers on datasets augmented with VAE-generated synthetic malware samples leads to notable improvements in accuracy, precision, recall, and F1-scores compared to training on original data alone. This approach addresses data scarcity for less common malware types and enhances robustness against evolving threats.

What carries the argument

Variational autoencoders (VAEs) used to produce synthetic malware samples for dataset augmentation.

If this is right

Augmented datasets lead to better classifier performance metrics.
The method facilitates adaptation to new malware threats.
Generative AI demonstrates utility in cybersecurity applications.
It provides a foundation for developing more resilient malware detection systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique could extend to other cybersecurity tasks with data imbalance such as intrusion detection.
If the synthetic samples maintain quality across different malware families, collection of large real datasets becomes less critical.
Combining VAEs with additional generative approaches might further increase sample diversity for rare threats.

Load-bearing premise

The synthetic malware samples generated by the variational autoencoder must be high-quality, diverse, and closely mimic real-world malware without introducing misleading artifacts to the classifiers.

What would settle it

A test showing that classifiers trained on the augmented data perform no better or worse than on original data when evaluated on a held-out set of real, previously unseen malware samples would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.06501 by Jeremy Straub, Mohammad Alharbi.

**Figure 1.** Figure 1: Study Process Workflow. The process starts with the original data, before applying the generative AI model. Then, VAEs are used to generate synthetic samples. This process creates the data that will be used to augment the original dataset with artificially created data points which are likely to improve the model’s training and, thus, performance. Third, the newly generated synthetic samples from the VAE m… view at source ↗

read the original abstract

The advancement of malware poses obstacles for cybersecurity, necessitating the development of advanced detection techniques. This paper proposes an approach to enhance malware detection through the use of a generative artificial intelligence model. Specifically, variational autoencoders (VAEs) are used with the random forest, XGBoost and sequential model machine learning classifiers. Generated synthetic malware samples are used to address the critical issue of data scarcity for new or less common malware types. This approach can be used to augment datasets to improve classifier robustness. The proposed methodology uses VAEs to create high-quality diverse synthetic datasets that closely mimic real-world malware data. The effectiveness of these augmented datasets is evaluated by comparing the performance of the machine learning classifiers when they are trained with the original data and when they are trained with the synthetic data-augmented datasets. The results demonstrate a notable improvement in the accuracy, precision, recall and F1-scores of the classifiers, when they are trained using the augmented datasets. The enhanced performance for detecting various malware classes shows the potential of this approach to facilitate adaptation to evolving malware threats effectively. This work demonstrates the utility of generative AI for cybersecurity. It also provides a foundation for future research aimed at developing more resilient and adaptive malware detection systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies standard VAE augmentation to malware detection but the abstract supplies no numbers, baselines, or fidelity checks, so the claimed gains cannot be assessed.

read the letter

The core idea is to train a variational autoencoder on malware data and then use the generated samples to augment training sets for random forest, XGBoost, and a sequential classifier. The stated goal is to improve detection of rare malware classes where real samples are scarce.

The work correctly flags a genuine operational problem: malware datasets are often heavily imbalanced, and collecting fresh labeled examples is slow and costly. Suggesting a generative model to fill gaps is a direct response to that constraint.

The abstract asserts notable gains in accuracy, precision, recall, and F1 after augmentation, yet it contains none of the supporting figures, no dataset sizes, no baseline numbers, no error bars, and no description of how synthetic-sample quality was verified. The stress-test note is accurate on this point: without evidence that the VAE was fit only on training splits, that synthetic samples were checked for distributional match to held-out real malware, and that the final test set remained untouched, any reported lift could come from the classifiers latching onto VAE artifacts rather than malware structure.

The technique itself is not new. VAE-based oversampling for imbalanced classification is routine in the broader ML literature and has already appeared in cybersecurity papers.

A reader working on practical security ML pipelines might still find the setup worth examining for implementation details if the full paper supplies the missing controls. The paper is not ready for citation in its current form because the central result rests on an unverified assumption. It should go to peer review rather than desk rejection, provided the manuscript includes proper experimental reporting; the underlying data-scarcity issue is worth addressing even if this particular execution needs tightening.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes using variational autoencoders (VAEs) to generate synthetic malware samples for augmenting training datasets, then evaluates the resulting performance gains of random forest, XGBoost, and a sequential model classifier on malware detection tasks. The central claim is that training on the VAE-augmented data produces notable improvements in accuracy, precision, recall, and F1-score relative to the original data, thereby addressing scarcity for rare malware classes.

Significance. If the claimed gains were shown to arise from faithful synthetic samples rather than artifacts or leakage, the work would offer a practical demonstration of generative models for mitigating data imbalance in cybersecurity. The topic is relevant, but the manuscript supplies none of the quantitative controls or fidelity checks needed to establish that the result holds.

major comments (3)

[Abstract] Abstract: the assertion of 'notable improvement in the accuracy, precision, recall and F1-scores' supplies no numerical deltas, baseline values, dataset sizes, number of synthetic samples, error bars, or statistical significance tests, so the central performance claim cannot be assessed.
[Abstract] Abstract/Methodology: the claim that the VAE produces 'high-quality diverse synthetic datasets that closely mimic real-world malware data' is stated without any reported fidelity metrics (reconstruction loss on held-out real samples, MMD, feature histograms, or label-integrity checks) or confirmation that the VAE was trained exclusively on training splits with no test leakage.
[Results] Results: no tables or figures compare classifier metrics before versus after augmentation, and no details are given on class balance, volume of generated data, or whether the test set remained entirely real and disjoint.

minor comments (1)

[Abstract] Abstract: the term 'sequential model' is used without specifying the network architecture (LSTM, CNN, or MLP).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript requires additional quantitative details and controls to support its claims, and we will revise accordingly to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'notable improvement in the accuracy, precision, recall and F1-scores' supplies no numerical deltas, baseline values, dataset sizes, number of synthetic samples, error bars, or statistical significance tests, so the central performance claim cannot be assessed.

Authors: We agree that the abstract lacks the necessary quantitative support. In the revised version we will update the abstract to report specific numerical deltas, baseline values, dataset sizes, number of synthetic samples generated, error bars, and the outcomes of statistical significance tests. revision: yes
Referee: [Abstract] Abstract/Methodology: the claim that the VAE produces 'high-quality diverse synthetic datasets that closely mimic real-world malware data' is stated without any reported fidelity metrics (reconstruction loss on held-out real samples, MMD, feature histograms, or label-integrity checks) or confirmation that the VAE was trained exclusively on training splits with no test leakage.

Authors: We acknowledge the absence of these controls. The revised methodology section will include reported fidelity metrics such as reconstruction loss on held-out samples, MMD, feature histograms, and label-integrity checks, together with an explicit statement that the VAE was trained exclusively on the training split with no test leakage. revision: yes
Referee: [Results] Results: no tables or figures compare classifier metrics before versus after augmentation, and no details are given on class balance, volume of generated data, or whether the test set remained entirely real and disjoint.

Authors: We agree that direct comparisons and supporting details are missing. The revised results section will add tables and figures comparing metrics before and after augmentation, plus explicit information on class balance, volume of generated data, and confirmation that the test set consists solely of real, disjoint samples. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper presents a standard empirical pipeline: train a VAE on malware data, generate synthetic samples, augment the training set, and measure downstream classifier metrics (accuracy, precision, recall, F1) on held-out real data. No mathematical derivation chain exists that reduces any claimed result to its inputs by construction. The abstract asserts that the synthetic data 'closely mimic real-world malware data' but does not use that assertion as a premise whose truth is established only by the classifier gains; the gains are reported as direct experimental outcomes. No self-citations, uniqueness theorems, or ansatzes are invoked. This is a typical applied ML augmentation study whose validity rests on experimental controls (train/test splits, fidelity checks) rather than definitional or self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified premise that VAE outputs are faithful to real malware distributions; this is an untested domain assumption rather than a derived result. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption VAE-generated samples are high-quality, diverse, and closely mimic real-world malware data
Invoked in the abstract to justify the augmentation step; no validation procedure is described.

pith-pipeline@v0.9.1-grok · 5751 in / 1308 out tokens · 37202 ms · 2026-06-30T19:37:10.854346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Improving the robustness of AI-based malware detection using adversarial machine learning,

S. Patil et al., “Improving the robustness of AI-based malware detection using adversarial machine learning,” Algorithms, vol. 14, no. 10, p. 297, Oct. 2021, doi: 10.3390/a14100297

work page doi:10.3390/a14100297 2021
[2]

Cost of a Data Breach Report,

IBM, “Cost of a Data Breach Report,” IBM, 2025. [Online]. Available: https://www.ibm.com/reports/data- breach. Accessed: May 6, 2026

2025
[3]

From ChatGPT to ThreatGPT: Impact of generative AI in cybersecurity and privacy,

M. Gupta, C. Akiri, K. Aryal, E. Parker, and L. Praharaj, “From ChatGPT to ThreatGPT: Impact of generative AI in cybersecurity and privacy,” IEEE Access, vol. 11, pp. 80218–80245, 2023, doi: 10.1109/ACCESS.2023.3300381

work page doi:10.1109/access.2023.3300381 2023
[4]

Self-improving diffusion models with synthetic data,

S. Alemohammad, A. I. Humayun, S. Agarwal, J. Collomosse, and R. Baraniuk, “Self-improving diffusion models with synthetic data,” arXiv preprint arXiv:2408.16333, Aug. 2024. [Online]. Available: https://arxiv.org/abs/2408.16333. Accessed: May 7, 2026

work page arXiv 2024
[5]

Variational autoencoder based synthetic data generation for imbalanced learning,

Z. Wan, Y. Zhang, and H. He, “Variational autoencoder based synthetic data generation for imbalanced learning,” in Proc. 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 2017, pp. 1–7, doi: 10.1109/SSCI.2017.8285168

work page doi:10.1109/ssci.2017.8285168 2017
[6]

Tabular data synthesis with generative adversarial networks: Design space and optimizations,

T. Liu, J. Fan, G. Li, N. Tang, and X. Du, “Tabular data synthesis with generative adversarial networks: Design space and optimizations,” The VLDB Journal, 2023, doi: 10.1007/s00778-023-00807-y

work page doi:10.1007/s00778-023-00807-y 2023
[7]

Explainability of cybersecurity threats data using SHAP,

R. Alenezi and S. A. Ludwig, “Explainability of cybersecurity threats data using SHAP,” in Proc. 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 2021, pp. 1–10, doi: 10.1109/SSCI50451.2021.9659888

work page doi:10.1109/ssci50451.2021.9659888 2021
[8]

Dynamic Android malware category classification using semi-supervised deep learning,

S. Mahdavifar, A. F. A. Kadir, R. Fatemi, D. Alhadidi, and A. A. Ghorbani, “Dynamic Android malware category classification using semi-supervised deep learning,” in Proc. 2020 IEEE Intl. Conf. on Dependable, Autonomic and Secure Computing, Intl. Conf. on Pervasive Intelligence and Computing, Intl. Conf. on Cloud and Big Data Computing, and Intl. Conf. on ...

work page arXiv 2020
[9]

Effective and efficient hybrid Android malware classification using pseudo-label stacked auto-encoder,

S. Mahdavifar, D. Alhadidi, and A. A. Ghorbani, “Effective and efficient hybrid Android malware classification using pseudo-label stacked auto-encoder,” Journal of Network and Systems Management, vol. 30, no. 1, Art. no. 22, 2022, doi: 10.1007/s10922-021-09634-4

work page doi:10.1007/s10922-021-09634-4 2022
[10]

The Curious Case of Machine Learning In Malware Detection

S. Saad, W. Briguglio, and H. Elmiligi, “The curious case of machine learning in malware detection,” arXiv preprint arXiv:1905.07573, 2019. [Online]. Available: https://arxiv.org/abs/1905.07573. Accessed: May 6, 2026

work page internal anchor Pith review Pith/arXiv arXiv 1905
[11]

A survey on machine learning-based malware detection in executable files,

J. Singh and J. Singh, “A survey on machine learning-based malware detection in executable files,” Journal of Systems Architecture, vol. 112, p. 101861, 2021, doi: 10.1016/j.sysarc.2020.101861

work page doi:10.1016/j.sysarc.2020.101861 2021
[12]

A survey on artificial intelligence techniques for malware detection,

H. Faisal, H. Hindy, S. Gaber, and A. Salem, “A survey on artificial intelligence techniques for malware detection,” in Artificial Intelligence, Soft Computing and Applications, 2022

2022
[13]

On oversampling imbalanced data with deep conditional generative models,

V. A. Fajardo et al., “On oversampling imbalanced data with deep conditional generative models,” Expert Systems with Applications, vol. 169, p. 114463, May 2021, doi: 10.1016/j.eswa.2020.114463

work page doi:10.1016/j.eswa.2020.114463 2021
[14]

Empirical evaluation of SMOTE in Android malware detection with machine learning: Challenges and performance in CICMalDroid 2020,

A. A. Bortoli, D. F. Duarte, and M. R. M. Guerreiro, “Empirical evaluation of SMOTE in Android malware detection with machine learning: Challenges and performance in CICMalDroid 2020,” arXiv preprint arXiv:2602.08744, Feb. 2026, doi: 10.48550/arXiv.2602.08744

work page doi:10.48550/arxiv.2602.08744 2020
[15]

Construction of bounded-degree minimum-radius spanning trees for WSNs

F. M. H. Othman et al., “Data augmentation using conditional generative adversarial network (CGAN) for Android malware binary and multi-class classification,” in Proc. 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), Jan. 2025, pp. 584– 592, doi: 10.1109/CCWC62904.2025.10903749

work page doi:10.1109/ccwc62904.2025.10903749 2025
[16]

arXiv preprint arXiv:1606.05908 , year=

C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016. [Online]. Available: https://arxiv.org/abs/1606.05908. Accessed: May 6, 2026

work page arXiv 2016
[17]

Data augmentation with generative models for improved malware detection: A comparative study,

R. Burks, K. A. Islam, Y. Lu, and J. Li, “Data augmentation with generative models for improved malware detection: A comparative study,” in Proc. 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 2019, pp. 660–665, doi: 10.1109/UEMCON47517.2019.8993085

work page doi:10.1109/uemcon47517.2019.8993085 2019
[18]

Random forest for big data classification in the Internet of Things using optimal features,

S. K. Lakshmanaprabu, K. Shankar, M. Ilayaraja, A. W. Nasir, V. Vijayakumar, and N. Chilamkurti, “Random forest for big data classification in the Internet of Things using optimal features,” International Journal of Machine Learning and Cybernetics, vol. 10, no. 10, pp. 2609–2618, 2019, doi: 10.1007/s13042-018-00916-z

work page doi:10.1007/s13042-018-00916-z 2019
[19]

Consistent Individualized Feature Attribution for Tree Ensembles

S. M. Lundberg, G. G. Erion, and S.-I. Lee, “Consistent individualized feature attribution for tree ensembles,” arXiv preprint arXiv:1802.03888, 2018. [Online]. Available: https://arxiv.org/abs/1802.03888. Accessed: May 6, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Predicting the direction of stock market prices using random forest

L. Khaidem, S. Saha, and S. Roy Dey, “Predicting the direction of stock market prices using random forest,” arXiv preprint arXiv:1605.00003, Apr. 2016. [Online]. Available: https://arxiv.org/abs/1605.00003. Accessed: May 6, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Malware classification using XGBoost-gradient boosted decision tree,

R. Kumar and G. S., “Malware classification using XGBoost-gradient boosted decision tree,” Advances in Science, Technology and Engineering Systems Journal, vol. 5, no. 5, pp. 536–549, 2020, doi: 10.25046/aj050566

work page doi:10.25046/aj050566 2020
[22]

Xgboost: A scalable tree boosting system,

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2016, pp. 785–794, doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[23]

Gulli, A

A. Gulli, A. Kapoor, and S. Pal, Deep Learning with TensorFlow 2 and Keras. Birmingham, U.K.: Packt Publishing, 2019

2019
[24]

Datasets,

Canadian Institute for Cybersecurity, “Datasets,” University of New Brunswick. [Online]. Available: https://www.unb.ca/cic/. Accessed: May 6, 2026

2026
[25]

XGBoost parameters,

XGBoost Developers, “XGBoost parameters,” XGBoost Documentation. [Online]. Available: https://xgboost.readthedocs.io/en/stable/parameter.html. Accessed: May 6, 2026

2026
[26]

Tree methods,

XGBoost Developers, “Tree methods,” XGBoost Documentation. [Online]. Available: https://xgboost.readthedocs.io/en/stable/treemethod.html. Accessed: May 8, 2026

2026
[27]

Python API reference,

XGBoost Developers, “Python API reference,” XGBoost Documentation. [Online]. Available: https://xgboost.readthedocs.io/en/latest/python/python_api.html. Accessed: May 6, 2026

2026
[28]

Comparison of multiple classifiers for Android malware detection with emphasis on feature insights using CICMalDroid 2020 dataset,

M. M.-H.-Z. Abedin and T. Mehrub, “Comparison of multiple classifiers for Android malware detection with emphasis on feature insights using CICMalDroid 2020 dataset,” in Proc. 2025 IEEE 7th International Conference on Sustainable Technologies for Industry 5.0 (STI), Dhaka, Bangladesh, Dec. 2025, pp. 1–6, doi: 10.1109/STI69347.2025.11367549

work page doi:10.1109/sti69347.2025.11367549 2020

[1] [1]

Improving the robustness of AI-based malware detection using adversarial machine learning,

S. Patil et al., “Improving the robustness of AI-based malware detection using adversarial machine learning,” Algorithms, vol. 14, no. 10, p. 297, Oct. 2021, doi: 10.3390/a14100297

work page doi:10.3390/a14100297 2021

[2] [2]

Cost of a Data Breach Report,

IBM, “Cost of a Data Breach Report,” IBM, 2025. [Online]. Available: https://www.ibm.com/reports/data- breach. Accessed: May 6, 2026

2025

[3] [3]

From ChatGPT to ThreatGPT: Impact of generative AI in cybersecurity and privacy,

M. Gupta, C. Akiri, K. Aryal, E. Parker, and L. Praharaj, “From ChatGPT to ThreatGPT: Impact of generative AI in cybersecurity and privacy,” IEEE Access, vol. 11, pp. 80218–80245, 2023, doi: 10.1109/ACCESS.2023.3300381

work page doi:10.1109/access.2023.3300381 2023

[4] [4]

Self-improving diffusion models with synthetic data,

S. Alemohammad, A. I. Humayun, S. Agarwal, J. Collomosse, and R. Baraniuk, “Self-improving diffusion models with synthetic data,” arXiv preprint arXiv:2408.16333, Aug. 2024. [Online]. Available: https://arxiv.org/abs/2408.16333. Accessed: May 7, 2026

work page arXiv 2024

[5] [5]

Variational autoencoder based synthetic data generation for imbalanced learning,

Z. Wan, Y. Zhang, and H. He, “Variational autoencoder based synthetic data generation for imbalanced learning,” in Proc. 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 2017, pp. 1–7, doi: 10.1109/SSCI.2017.8285168

work page doi:10.1109/ssci.2017.8285168 2017

[6] [6]

Tabular data synthesis with generative adversarial networks: Design space and optimizations,

T. Liu, J. Fan, G. Li, N. Tang, and X. Du, “Tabular data synthesis with generative adversarial networks: Design space and optimizations,” The VLDB Journal, 2023, doi: 10.1007/s00778-023-00807-y

work page doi:10.1007/s00778-023-00807-y 2023

[7] [7]

Explainability of cybersecurity threats data using SHAP,

R. Alenezi and S. A. Ludwig, “Explainability of cybersecurity threats data using SHAP,” in Proc. 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 2021, pp. 1–10, doi: 10.1109/SSCI50451.2021.9659888

work page doi:10.1109/ssci50451.2021.9659888 2021

[8] [8]

Dynamic Android malware category classification using semi-supervised deep learning,

S. Mahdavifar, A. F. A. Kadir, R. Fatemi, D. Alhadidi, and A. A. Ghorbani, “Dynamic Android malware category classification using semi-supervised deep learning,” in Proc. 2020 IEEE Intl. Conf. on Dependable, Autonomic and Secure Computing, Intl. Conf. on Pervasive Intelligence and Computing, Intl. Conf. on Cloud and Big Data Computing, and Intl. Conf. on ...

work page arXiv 2020

[9] [9]

Effective and efficient hybrid Android malware classification using pseudo-label stacked auto-encoder,

S. Mahdavifar, D. Alhadidi, and A. A. Ghorbani, “Effective and efficient hybrid Android malware classification using pseudo-label stacked auto-encoder,” Journal of Network and Systems Management, vol. 30, no. 1, Art. no. 22, 2022, doi: 10.1007/s10922-021-09634-4

work page doi:10.1007/s10922-021-09634-4 2022

[10] [10]

The Curious Case of Machine Learning In Malware Detection

S. Saad, W. Briguglio, and H. Elmiligi, “The curious case of machine learning in malware detection,” arXiv preprint arXiv:1905.07573, 2019. [Online]. Available: https://arxiv.org/abs/1905.07573. Accessed: May 6, 2026

work page internal anchor Pith review Pith/arXiv arXiv 1905

[11] [11]

A survey on machine learning-based malware detection in executable files,

J. Singh and J. Singh, “A survey on machine learning-based malware detection in executable files,” Journal of Systems Architecture, vol. 112, p. 101861, 2021, doi: 10.1016/j.sysarc.2020.101861

work page doi:10.1016/j.sysarc.2020.101861 2021

[12] [12]

A survey on artificial intelligence techniques for malware detection,

H. Faisal, H. Hindy, S. Gaber, and A. Salem, “A survey on artificial intelligence techniques for malware detection,” in Artificial Intelligence, Soft Computing and Applications, 2022

2022

[13] [13]

On oversampling imbalanced data with deep conditional generative models,

V. A. Fajardo et al., “On oversampling imbalanced data with deep conditional generative models,” Expert Systems with Applications, vol. 169, p. 114463, May 2021, doi: 10.1016/j.eswa.2020.114463

work page doi:10.1016/j.eswa.2020.114463 2021

[14] [14]

Empirical evaluation of SMOTE in Android malware detection with machine learning: Challenges and performance in CICMalDroid 2020,

A. A. Bortoli, D. F. Duarte, and M. R. M. Guerreiro, “Empirical evaluation of SMOTE in Android malware detection with machine learning: Challenges and performance in CICMalDroid 2020,” arXiv preprint arXiv:2602.08744, Feb. 2026, doi: 10.48550/arXiv.2602.08744

work page doi:10.48550/arxiv.2602.08744 2020

[15] [15]

Construction of bounded-degree minimum-radius spanning trees for WSNs

F. M. H. Othman et al., “Data augmentation using conditional generative adversarial network (CGAN) for Android malware binary and multi-class classification,” in Proc. 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), Jan. 2025, pp. 584– 592, doi: 10.1109/CCWC62904.2025.10903749

work page doi:10.1109/ccwc62904.2025.10903749 2025

[16] [16]

arXiv preprint arXiv:1606.05908 , year=

C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016. [Online]. Available: https://arxiv.org/abs/1606.05908. Accessed: May 6, 2026

work page arXiv 2016

[17] [17]

Data augmentation with generative models for improved malware detection: A comparative study,

R. Burks, K. A. Islam, Y. Lu, and J. Li, “Data augmentation with generative models for improved malware detection: A comparative study,” in Proc. 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 2019, pp. 660–665, doi: 10.1109/UEMCON47517.2019.8993085

work page doi:10.1109/uemcon47517.2019.8993085 2019

[18] [18]

Random forest for big data classification in the Internet of Things using optimal features,

S. K. Lakshmanaprabu, K. Shankar, M. Ilayaraja, A. W. Nasir, V. Vijayakumar, and N. Chilamkurti, “Random forest for big data classification in the Internet of Things using optimal features,” International Journal of Machine Learning and Cybernetics, vol. 10, no. 10, pp. 2609–2618, 2019, doi: 10.1007/s13042-018-00916-z

work page doi:10.1007/s13042-018-00916-z 2019

[19] [19]

Consistent Individualized Feature Attribution for Tree Ensembles

S. M. Lundberg, G. G. Erion, and S.-I. Lee, “Consistent individualized feature attribution for tree ensembles,” arXiv preprint arXiv:1802.03888, 2018. [Online]. Available: https://arxiv.org/abs/1802.03888. Accessed: May 6, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Predicting the direction of stock market prices using random forest

L. Khaidem, S. Saha, and S. Roy Dey, “Predicting the direction of stock market prices using random forest,” arXiv preprint arXiv:1605.00003, Apr. 2016. [Online]. Available: https://arxiv.org/abs/1605.00003. Accessed: May 6, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Malware classification using XGBoost-gradient boosted decision tree,

R. Kumar and G. S., “Malware classification using XGBoost-gradient boosted decision tree,” Advances in Science, Technology and Engineering Systems Journal, vol. 5, no. 5, pp. 536–549, 2020, doi: 10.25046/aj050566

work page doi:10.25046/aj050566 2020

[22] [22]

Xgboost: A scalable tree boosting system,

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2016, pp. 785–794, doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[23] [23]

Gulli, A

A. Gulli, A. Kapoor, and S. Pal, Deep Learning with TensorFlow 2 and Keras. Birmingham, U.K.: Packt Publishing, 2019

2019

[24] [24]

Datasets,

Canadian Institute for Cybersecurity, “Datasets,” University of New Brunswick. [Online]. Available: https://www.unb.ca/cic/. Accessed: May 6, 2026

2026

[25] [25]

XGBoost parameters,

XGBoost Developers, “XGBoost parameters,” XGBoost Documentation. [Online]. Available: https://xgboost.readthedocs.io/en/stable/parameter.html. Accessed: May 6, 2026

2026

[26] [26]

Tree methods,

XGBoost Developers, “Tree methods,” XGBoost Documentation. [Online]. Available: https://xgboost.readthedocs.io/en/stable/treemethod.html. Accessed: May 8, 2026

2026

[27] [27]

Python API reference,

XGBoost Developers, “Python API reference,” XGBoost Documentation. [Online]. Available: https://xgboost.readthedocs.io/en/latest/python/python_api.html. Accessed: May 6, 2026

2026

[28] [28]

Comparison of multiple classifiers for Android malware detection with emphasis on feature insights using CICMalDroid 2020 dataset,

M. M.-H.-Z. Abedin and T. Mehrub, “Comparison of multiple classifiers for Android malware detection with emphasis on feature insights using CICMalDroid 2020 dataset,” in Proc. 2025 IEEE 7th International Conference on Sustainable Technologies for Industry 5.0 (STI), Dhaka, Bangladesh, Dec. 2025, pp. 1–6, doi: 10.1109/STI69347.2025.11367549

work page doi:10.1109/sti69347.2025.11367549 2020