pith. sign in

arxiv: 2606.03432 · v1 · pith:UD4H3YP3new · submitted 2026-06-02 · 💻 cs.CR · cs.AI· cs.LG

A Hybrid Approach For Malware Classification Using Secondary Features Fusion

Pith reviewed 2026-06-28 09:48 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords malware classificationfeature fusionAPI callsn-gramsensemble votingMicrosoft malware datasetbinary and multi-class
0
0 comments X

The pith

A hybrid method fuses API calls with fixed and variable n-grams, applies customized selection, and votes among algorithms to classify malware families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes automating both malware detection and assignment to specific families by first pulling API calls along with fixed-length and variable-length n-grams from samples. These features are fused after a customized selection step, then fed to a voting ensemble that combines multiple learning algorithms. Experiments on the Microsoft malware dataset show the method works for both binary detection and multi-class family labeling, with results compared against earlier techniques. A reader would care because most existing detectors stop at finding malware and leave family identification to manual effort, while rapid family grouping can guide faster response to new variants.

Core claim

The paper claims that feature fusion of API calls together with fixed and variable length n-grams, performed after a customized selection procedure, followed by a voting-based fusion of multiple algorithms, produces effective malware family classification. On the Microsoft dataset this yields an AUC of 0.989, accuracy of 99.72 percent, and log loss of 0.01 in both binary and multi-class settings, outperforming prior reported results.

What carries the argument

Secondary feature fusion: extraction of API calls and n-grams, customized selection, then voting ensemble across algorithms

If this is right

  • Both binary detection and multi-class family labeling become feasible within one pipeline on the same feature set.
  • The voting ensemble produces lower log loss than single algorithms, indicating more confident family assignments.
  • The reported numbers exceed those listed for earlier methods on the identical Microsoft dataset.
  • The approach remains practical because it relies on static features that can be extracted without running the sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the fusion truly captures complementary signals, similar secondary-feature combinations might improve classification of other file types such as documents or scripts.
  • The customized selection procedure could be re-applied periodically on new samples to keep the feature set current without full retraining.
  • A drop in performance on zero-day families would point to the need for an online update mechanism rather than a static model.

Load-bearing premise

The customized feature selection step and the voting ensemble will not overfit to the Microsoft dataset and will continue to work on malware samples never seen during training.

What would settle it

Running the trained model on a fresh collection of malware samples drawn from a different source or collected after the Microsoft dataset and observing accuracy fall below 90 percent would falsify the generalization claim.

read the original abstract

The number of malware (either variant or novel) is rapidly increasing, making malware detection and mitigation a complex problem. One approach to improving malware mitigation is automatic detection and malware family classification. However, traditional malware detection methods cannot classify detected malware into their respective families, hindering effective malware mitigation. Consequently, this paper proposes a method to automate malware detection and classification of the detected malware into respective malware families. The proposed method uses feature fusion after extracting relevant malware features such as API calls and fixed and variable length n-grams with a customized feature selection method. Moreover, for the predictive model, a voting based approach is proposed for algorithm fusion. For the experimental evaluation of the proposed method, both binary and multi-class classification approaches are applied to the data set provided by Microsoft. Finally, the experimental results are compared with the state of the art. The experimental results indicate the effectiveness and efficiency of the proposed approach with an AUC of 0.989, accuracy of 99.72%, and a log loss of 0.01.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to present a hybrid malware classification method that extracts API calls along with fixed- and variable-length n-grams, applies a customized feature selection step, fuses the resulting secondary features, and employs a voting-based ensemble for both binary and multi-class family classification on the Microsoft malware dataset, reporting an AUC of 0.989, accuracy of 99.72 %, and log loss of 0.01 while outperforming prior state-of-the-art approaches.

Significance. If the performance figures are obtained under leakage-free validation and generalize beyond the single Microsoft corpus, the fusion of API-call and n-gram features with a voting ensemble would constitute a useful incremental advance in automated malware family classification. The reported low log-loss and high AUC indicate potentially strong discriminative power; however, the absence of methodological safeguards in the reported pipeline leaves the central empirical claim unverified.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Evaluation): the reported AUC of 0.989, accuracy of 99.72 %, and log loss of 0.01 are presented without any description of the cross-validation scheme, train-test split ratio, or hyper-parameter tuning protocol. Because the central claim rests on these metrics, the lack of these details prevents assessment of whether the results support the effectiveness assertion.
  2. [§3] §3 (Proposed Method): the customized feature selection procedure is described as operating on the extracted features without any statement that selection is performed exclusively inside training folds or via nested cross-validation. When selection occurs on the full Microsoft dataset before splitting, the chosen features can encode information from the held-out test samples, directly inflating the reported performance and undermining the generalization claim.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'data set' appears inconsistently; standardize to 'dataset'. Ensure all acronyms (AUC, API) are defined at first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and clarify the experimental protocol and feature selection process. Where details were insufficiently explicit, we will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Evaluation): the reported AUC of 0.989, accuracy of 99.72 %, and log loss of 0.01 are presented without any description of the cross-validation scheme, train-test split ratio, or hyper-parameter tuning protocol. Because the central claim rests on these metrics, the lack of these details prevents assessment of whether the results support the effectiveness assertion.

    Authors: We agree that the cross-validation and tuning details should have been stated explicitly. The evaluation used stratified 5-fold cross-validation on the Microsoft dataset with an 80/20 train-test split per fold. Hyper-parameter tuning for the base classifiers and voting ensemble was performed via grid search nested inside each training fold. We will expand §4 (and the abstract if space permits) to document the full validation protocol, including the split ratios, fold count, and nested tuning procedure. revision: yes

  2. Referee: [§3] §3 (Proposed Method): the customized feature selection procedure is described as operating on the extracted features without any statement that selection is performed exclusively inside training folds or via nested cross-validation. When selection occurs on the full Microsoft dataset before splitting, the chosen features can encode information from the held-out test samples, directly inflating the reported performance and undermining the generalization claim.

    Authors: We acknowledge that the description in §3 did not explicitly address the placement of feature selection relative to the data split. Feature selection was in fact performed inside each training fold using a nested cross-validation loop, ensuring that no test-fold information influenced the selected secondary features. We will revise §3 to state this explicitly and confirm that the customized selection step was confined to training data only. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical ML classification study that extracts features (API calls, n-grams), applies a customized feature selection step, fuses them, and evaluates an ensemble on the Microsoft dataset, reporting accuracy/AUC/log-loss. No derivation chain, equations, or first-principles claims exist that reduce a claimed prediction or result to its own inputs by construction. Feature selection is described as part of the method but is not shown (via any quote or equation) to be fitted to the full evaluation set in a manner that makes the reported metrics tautological. The work is self-contained against the external Microsoft benchmark and SOTA comparisons; no self-citation load-bearing, ansatz smuggling, or renaming of known results is present. Standard ML pipeline risks (e.g., split timing) are correctness concerns, not circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no extractable information on free parameters, axioms, or invented entities; full manuscript would be required for a complete ledger.

pith-pipeline@v0.9.1-grok · 5713 in / 1189 out tokens · 30824 ms · 2026-06-28T09:48:24.928872+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Campbell, S. L. and Gear, C. W. The index of general nonlinear D A E S. Numer. M ath. 1995

  2. [2]

    Slifka, M. K. and Whitton, J. L. Clinical implications of dysregulated cytokine production. J. M ol. M ed. 2000. doi:10.1007/s001090000086

  3. [3]

    Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations

    Hamburger, C. Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations. Ann. Mat. Pura. Appl. 1995

  4. [4]

    Geddes, K. O. and Czapor, S. R. and Labahn, G. Algorithms for C omputer A lgebra. 1992

  5. [5]

    Software engineering---from auxiliary to key technologies

    Broy, M. Software engineering---from auxiliary to key technologies. Software Pioneers. 1992

  6. [6]

    Conductive P olymers. 1981

  7. [7]

    Smith, S. E. Neuromuscular blocking drugs in man. Neuromuscular junction. H andbook of experimental pharmacology. 1976

  8. [8]

    Chung, S. T. and Morris, R. L. Isolation and characterization of plasmid deoxyribonucleic acid from Streptomyces fradiae. 1978

  9. [9]

    and AghaKouchak, A

    Hao, Z. and AghaKouchak, A. and Nakhjiri, N. and Farahmand, A. Global integrated drought monitoring and prediction system (GIDMaPS) data sets. 2014

  10. [10]

    Babichev, S. A. and Ries, J. and Lvovsky, A. I. Quantum scissors: teleportation of single-mode optical states by means of a nonlocal single photon. 2002

  11. [11]

    Wormholes in Maximal Supergravity

    Beneke, M. and Buchalla, G. and Dunietz, I. Mixing induced CP asymmetries in inclusive B decays. Phys. L ett. 1997. arXiv:0707.3168

  12. [12]

    deep SIP : deep learning of S upernova I a P arameters

    Stahl, B. deep SIP : deep learning of S upernova I a P arameters. 2020. ascl:2006.023

  13. [13]

    Feature ranking methods based on information entropy with Parzen windows , author=

  14. [14]

    Embedded

    Lal, Thomas Navin and Chapelle, Olivier and Weston, Jason and Elisseeff, André , editor =. Embedded. Feature. 2006 , doi =

  15. [15]

    Filter Methods for Feature Selection -- A Comparative Study , booktitle=

    S. Filter Methods for Feature Selection -- A Comparative Study , booktitle=. 2007 , publisher=

  16. [16]

    Detection of Spyware by Mining Executable Files , isbn =

    Shahzad, Raja Khurram and Haider, Syed Imran and Lavesson, Niklas , year =. Detection of Spyware by Mining Executable Files , isbn =. Proceedings of the 5th International Conference on Availability, Reliability, and Security , publisher =

  17. [17]

    2012 , isbn =

    Sikorski, Michael and Honig, Andrew , title =. 2012 , isbn =

  18. [18]

    A Wrapper Method for Feature Selection in Multiple Classes Datasets , booktitle=

    S. A Wrapper Method for Feature Selection in Multiple Classes Datasets , booktitle=. 2009 , publisher=

  19. [19]

    Accurate Adware Detection Using Opcode Sequence Extraction , isbn =

    Shahzad, Raja Khurram and Lavesson, Niklas and Johnson, Henric , year =. Accurate Adware Detection Using Opcode Sequence Extraction , isbn =. Proceedings of the Sixth International Conference on Availability, Reliability and Security , publisher =

  20. [20]

    2013 , publisher=

    Applied Predictive Modeling , author=. 2013 , publisher=

  21. [21]

    38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) , title=

    A. 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) , title=. 2015 , pages=

  22. [22]

    2016 , booktitle=

    Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification , author=. 2016 , booktitle=

  23. [23]

    The Fundamental Nature of the Log Loss Function , bookTitle=

    Vovk, Vladimir , editor=. The Fundamental Nature of the Log Loss Function , bookTitle=. 2015 , publisher=

  24. [24]

    Consensus Decision Making in Random Forests , booktitle=

    Shahzad, Raja Khurram and Fatima, Mehwish and Lavesson, Niklas and Boldt, Martin , editor=. Consensus Decision Making in Random Forests , booktitle=. 2015 , publisher=

  25. [25]

    Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

    Chen, Tianqi and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , isbn =

  26. [26]

    and Frank, Eibe and Hall, Mark A

    Witten, Ian H. and Frank, Eibe and Hall, Mark A. and Pal, Christopher J. , title =. 2016 , isbn =

  27. [27]

    Adversarial Machine Learning in Malware Detection: Arms Race between Evasion Attack and Defense , year=

    Chen, Lingwei and Ye, Yanfang and Bourlai, Thirimachos , booktitle=. Adversarial Machine Learning in Malware Detection: Arms Race between Evasion Attack and Defense , year=

  28. [28]

    Microsoft Malware Classification Challenge , journal =

    Royi Ronen and Marian Radu and Corina Feuerstein and Elad Yom. Microsoft Malware Classification Challenge , journal =

  29. [29]

    Machine Learning Techniques for Classifying Malicious API Calls and N-Grams in Kaggle Data-set , year=

    Hu, Yen-Hung Frank and Ali, Abdinur and Hsieh, Chung-Chu George and Williams, Aurelia , booktitle=. Machine Learning Techniques for Classifying Malicious API Calls and N-Grams in Kaggle Data-set , year=

  30. [30]

    2019 , booktitle =

    Unal, Ugur and Yenido. 2019 , booktitle =

  31. [31]

    Orthrus: A Bimodal Learning Architecture for Malware Classification , year=

    Gibert, Daniel and Mateu, Carles and Planes, Jordi , booktitle=. Orthrus: A Bimodal Learning Architecture for Malware Classification , year=

  32. [32]

    2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) , year=

    Malware Classification on Imbalanced Data through Self-Attention , author=. 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) , year=

  33. [33]

    Malware Family Classification using Active Learning by Learning , year=

    Chen, Chin-Wei and Su, Ching-Hung and Lee, Kun-Wei and Bair, Ping-Hao , booktitle=. Malware Family Classification using Active Learning by Learning , year=

  34. [34]

    Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, , year=

    Mark Sokolov and Nic Herndon , title=. Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, , year=. doi:10.5220/0010264902950301 , isbn=

  35. [35]

    2020 , author =

    Maximizing accuracy in multi-scanner malware detection systems , journal =. 2020 , author =

  36. [36]

    2020 , issn =

    Similarity hash based scoring of portable executable files for efficient malware detection in IoT , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.future.2019.04.044 , author =

  37. [37]

    Evelyn Fix and J. L. Hodges , journal =. Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties , volume =

  38. [38]

    Altman , title =

    Naomi S. Altman , title =. The American Statistician , volume =. 1992 , publisher =

  39. [39]

    Machine Learning , pages =

    Breiman, Leo , title =. Machine Learning , pages =. 1996 , publisher =

  40. [40]

    Machine learning , volume=

    Random forests , author=. Machine learning , volume=. 2001 , publisher=

  41. [41]

    Journal in Computer Virology , year=

    Attaluri, Srilatha and McGhee, Scott and Stamp, Mark , title=. Journal in Computer Virology , year=

  42. [42]

    Information Security Technical Report , author =

    Detection of Malicious Code by applying Machine Learning Classifiers on Static Features: A. Information Security Technical Report , author =. 2009 , pages =

  43. [43]

    2014 , Journal =

    Learning Nonlinear Functions Using Regularized Greedy Forest , Author =. 2014 , Journal =

  44. [44]

    Frontiers of Information Technology

    Liu, Liu and Wang, Bao-sheng and Yu, Bo and Zhong, Qiu-xi , title=. Frontiers of Information Technology. 2017 , month=

  45. [45]

    WIREs Data Mining and Knowledge Discovery , volume =

    Sagi, Omer and Rokach, Lior , title =. WIREs Data Mining and Knowledge Discovery , volume =

  46. [46]

    ArXiv , year=

    Optimizing Ensemble Weights and Hyperparameters of Machine Learning Models for Regression Problems , author=. ArXiv , year=

  47. [47]

    Frontiers of Computer Science , year=

    Dong, Xibin and Yu, Zhiwen and Cao, Wenming and Shi, Yifan and Ma, Qianli , title=. Frontiers of Computer Science , year=

  48. [48]

    Family medicine and community health , author =

    Variable selection strategies and its importance in clinical prediction modelling , volume =. Family medicine and community health , author =. 2020 , note =. doi:10.1136/fmch-2019-000262 , number =

  49. [49]

    2020 , issn =

    The rise of machine learning for detection and classification of malware: Research developments, trends and challenges , journal =. 2020 , issn =

  50. [50]

    Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems , year=

    Euh, Seoungyul and Lee, Hyunjong and Kim, Donghoon and Hwang, Doosung , journal=. Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems , year=

  51. [51]

    Journal of Physics Conference Series , year =

    The Application of LightGBM in Microsoft Malware Detection. Journal of Physics Conference Series , year =

  52. [52]

    Journal of Ambient Intelligence and Humanized Computing , year=

    Ding, Yuxin and Zhang, Xiao and Hu, Jieke and Xu, Wenting , title=. Journal of Ambient Intelligence and Humanized Computing , year=

  53. [53]

    2021 , month=

    A novel Android malware detection system: adaption of filter-based feature selection methods , journal=. 2021 , month=

  54. [54]

    Madeh Piryonesi and Tamer E

    S. Madeh Piryonesi and Tamer E. El-Diraby , title =. Journal of Infrastructure Systems , volume =

  55. [55]

    Ensemble-Based Classification Using Neural Networks and Machine Learning Models for Windows PE Malware Detection , JOURNAL =

    Dama. Ensemble-Based Classification Using Neural Networks and Machine Learning Models for Windows PE Malware Detection , JOURNAL =. 2021 , NUMBER =

  56. [56]

    Journal of Information Security and Applications , volume =

    Malicious code classification based on opcode sequences and textCNN network , author =. Journal of Information Security and Applications , volume =. 2022 , issn =

  57. [57]

    2022 , issn =

    N-gram MalGAN: Evading machine learning detection via feature n-gram , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.dcan.2021.11.007 , author =

  58. [58]

    2022 , issn =

    Malware classification based on double byte feature encoding , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.aej.2021.04.076 , url =

  59. [59]

    2022 , issn =

    Fusing feature engineering and deep learning: A case study for malware classification , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.eswa.2022.117957 , author =

  60. [60]

    2023 , issn =

    A review of Machine Learning-based zero-day attack detection: Challenges and future directions , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.comcom.2022.11.001 , url =

  61. [61]

    2023 , issn =

    BHMDC: A byte and hex n-gram based malware detection and classification method , journal=. 2023 , issn =. doi:https://doi.org/10.1016/j.cose.2023.103118 , author =

  62. [62]

    2023 , issn =

    Development of a deep stacked ensemble with process based volatile memory forensics for platform independent malware detection and classification , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.eswa.2023.119952 , author =

  63. [63]

    2023 , issn =

    XMal: A lightweight memory-based explainable obfuscated-malware detector , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.cose.2023.103409 , author =

  64. [64]

    2023 , issn =

    Malware detection using image representation of malware data and transfer learning , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.jpdc.2022.10.001 , author =

  65. [65]

    2023 , issn =

    MOTIF: A Malware Reference Dataset with Ground Truth Family Labels , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.cose.2022.102921 , author =

  66. [66]

    2023 , issn =

    Impact of benign sample size on binary classification accuracy , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.eswa.2022.118630 , author =