pith. machine review for the scientific record. sign in

arxiv: 2604.21153 · v1 · submitted 2026-04-22 · 💻 cs.CR

Recognition: unknown

Image-Based Malware Type Classification on MalNet-Image Tiny: Effects of Multi-Scale Fusion, Transfer Learning, Data Augmentation, and Schedule-Free Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:36 UTC · model grok-4.3

classification 💻 cs.CR
keywords malware classificationimage-based detectiontransfer learningdata augmentationfeature pyramid networkResNet18Android malwareMalNet-Image
0
0 comments X

The pith

Pretraining and data augmentations lift macro-F1 to 0.6927 on 43-class malware image classification with ResNet18 and FPN.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four components in controlled ablations for classifying 43 malware types from resized binary images on the fixed MalNet-Image Tiny split. A ResNet18 backbone receives ImageNet pretraining, Mixup plus TrivialAugment, a feature pyramid network for scale handling, and schedule-free AdamW. Pretraining and augmentation deliver the main macro-F1 gains while the feature pyramid network further lifts precision, AUC and test loss. The strongest combination reaches macro-F1 of 0.6927, macro-precision 0.7707, macro-AUC 0.9556 and test loss 0.8536 after 10 epochs.

Core claim

The authors show that the configuration using ImageNet pretraining, Mixup, TrivialAugment and a feature pyramid network on ResNet18 produces the highest metrics on the MalNet-Image Tiny test set, with pretraining and augmentation responsible for most of the macro-F1 improvement over the reproduced baseline of 0.6510.

What carries the argument

Feature pyramid network (FPN) for multi-scale fusion attached to ResNet18 to address scale variation from resizing binaries of unequal lengths.

If this is right

  • ImageNet pretraining supplies the largest single lift in macro-F1 for this 43-class task.
  • Mixup and TrivialAugment improve robustness with low overhead on binary-derived images.
  • Schedule-free AdamW reaches near-baseline performance in 10 epochs instead of 96.
  • FPN addition mainly raises macro-precision, macro-AUC and lowers test loss once pretraining and augmentation are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same four components could be tested on larger binary-image malware collections to check whether the observed ranking persists.
  • Faster convergence from schedule-free optimization may cut compute costs when scanning new Android APK collections.
  • Prioritizing pretraining and simple augmentations first could serve as a practical recipe for adapting image classifiers to new malware families.

Load-bearing premise

The performance gains from pretraining, Mixup, TrivialAugment and FPN will hold on other malware image datasets or under real-world distribution shifts beyond the fixed MalNet-Image Tiny split.

What would settle it

Re-running the exact same four configurations on a different public malware image dataset and finding that the reported best setup no longer exceeds the plain baseline would show the gains do not generalize.

Figures

Figures reproduced from arXiv: 2604.21153 by Ahmed A. Abouelkhaire, Issa Traor, Waleed A. Yousef.

Figure 1
Figure 1. Figure 1: APK-to-image conversion pipeline. The DEX byte stream is reshaped [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MalNet-Image Tiny class distribution. The Tiny subset is less [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: F1macro versus Ltest across the ablation runs. Lower loss is associated with, but does not determine, higher F1macro. B. Full ablation results Table III reports the complete experiment set. Experiment 5 reproduces the benchmark-style baseline with F1macro = 0.6510, which is consistent with the value reported for ResNet18 on MalNet-Image Tiny [2]. Experiment 6 shows that schedule-free AdamW with unweighted … view at source ↗
read the original abstract

This paper studies 43-class malware type classification on MalNet-Image Tiny, a public benchmark derived from Android APK files. The goal is to assess whether a compact image classifier benefits from four components evaluated in a controlled ablation: a feature pyramid network (FPN) for scale variation induced by resizing binaries of different lengths, ImageNet pretraining, lightweight augmentation through Mixup and TrivialAugment, and schedule-free AdamW optimization. All experiments use a ResNet18 backbone and the provided train/validation/test split. Reproducing the benchmark-style configuration yields macro-F1 (F1_macro) of 0.6510, consistent with the reported baseline of approximately 0.65. Replacing the optimizer with schedule-free AdamW and using unweighted cross-entropy increases F1_macro to 0.6535 in 10 epochs, compared with 96 epochs for the reproduced baseline. The best configuration combines pretraining, Mixup, TrivialAugment, and FPN, reaching F1_macro=0.6927, P_macro=0.7707, AUC_macro=0.9556, and L_test=0.8536. The ablation indicates that the largest gains in F1_macro arise from pretraining and augmentation, whereas FPN mainly improves P_macro, AUC_macro, and L_test in the strongest configuration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This paper conducts a controlled ablation study on 43-class malware type classification on the MalNet-Image Tiny benchmark using a ResNet18 backbone. It evaluates four components: feature pyramid network (FPN) for multi-scale fusion to handle variable binary lengths, ImageNet pretraining, lightweight data augmentation (Mixup + TrivialAugment), and schedule-free AdamW optimization. The reproduced baseline reaches F1_macro=0.6510; the best configuration (pretraining + augmentations + FPN) reaches F1_macro=0.6927, P_macro=0.7707, AUC_macro=0.9556, with pretraining and augmentation identified as the primary sources of F1 gains.

Significance. If the results prove robust, the work supplies a useful, reproducible empirical reference for applying standard computer-vision techniques to malware image classification. The precise baseline reproduction, the focus on schedule-free optimization for faster convergence, and the explicit attribution of gains to individual components are positive contributions that could guide efficient training on similar fixed-split benchmarks.

major comments (1)
  1. The ablation results and headline claims rest on single training runs on the fixed train/val/test split, with no reported standard deviations across random seeds, no error bars, and no statistical significance tests on the observed deltas (e.g., F1_macro improvement of 0.0417). Given that ResNet18 training with cross-entropy, Mixup, and TrivialAugment exhibits run-to-run variance that can easily exceed 0.03–0.05 in F1_macro, the attribution of largest gains to pretraining and augmentation cannot be considered reliably demonstrated.
minor comments (1)
  1. The abstract states that the schedule-free configuration reaches its result in 10 epochs versus 96 for the baseline; the manuscript should explicitly confirm that total wall-clock compute or convergence criteria are comparable before claiming efficiency gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting an important aspect of our experimental design. We respond to the major comment as follows.

read point-by-point responses
  1. Referee: The ablation results and headline claims rest on single training runs on the fixed train/val/test split, with no reported standard deviations across random seeds, no error bars, and no statistical significance tests on the observed deltas (e.g., F1_macro improvement of 0.0417). Given that ResNet18 training with cross-entropy, Mixup, and TrivialAugment exhibits run-to-run variance that can easily exceed 0.03–0.05 in F1_macro, the attribution of largest gains to pretraining and augmentation cannot be considered reliably demonstrated.

    Authors: We agree that this is a valid criticism and that reporting results from single runs limits the strength of our conclusions regarding the source of performance gains. Although the use of a fixed train/val/test split is standard for this benchmark to ensure direct comparability, the potential for run-to-run variance means that the observed improvements, particularly the 0.0417 increase in F1_macro, should be interpreted with caution. In the revised manuscript, we will rerun the key experiments (baseline, best configuration, and main ablations) using at least five different random seeds, report the mean and standard deviation for all metrics, and include error bars in the relevant tables and figures. This will enable a more robust assessment of the contributions from pretraining and data augmentation. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical ablation study on fixed dataset split

full rationale

The paper performs controlled ablation experiments training ResNet18 classifiers on the MalNet-Image Tiny benchmark and reports direct empirical metrics (F1_macro, P_macro, AUC_macro, L_test) measured on the provided held-out test set. All headline numbers (e.g., baseline 0.6510 vs. best configuration 0.6927) are obtained by running the models and computing standard classification scores; no equations, derivations, or fitted parameters are presented as predictions that reduce to their own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central claims. The work is therefore self-contained as a set of reproducible empirical measurements rather than a deductive chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised classification assumptions and the fixed public train/validation/test split of MalNet-Image Tiny; no new entities are postulated.

free parameters (2)
  • initial learning rate and other optimizer hyperparameters
    Schedule-free AdamW still requires an initial learning rate and weight decay that are not stated in the abstract and are presumably tuned on the validation set.
  • Mixup alpha and TrivialAugment magnitude
    The strength of the two augmentation methods is a free choice that affects the reported gains.
axioms (2)
  • domain assumption The provided train/validation/test split is representative and fixed for all experiments.
    All comparisons rely on the benchmark's official split without re-sampling or cross-validation.
  • domain assumption Macro-averaged metrics are the appropriate summary for a 43-class imbalanced problem.
    The paper reports macro-F1, macro-P, and macro-AUC without justifying why macro-averaging is preferred over micro or weighted alternatives.

pith-pipeline@v0.9.0 · 5566 in / 1638 out tokens · 92902 ms · 2026-05-09T23:36:13.338672+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Malware images: visualization and automatic classification,

    L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware images: visualization and automatic classification,” inProceedings of the 8th International Symposium on Visualization for Cyber Security, ser. VizSec ’11. New York, NY , USA: Association for Computing Machinery, Jul. 2011, pp. 1–7. [Online]. Available: https://doi.org/10.1145/2016904.201...

  2. [2]

    MalNet: A Large-Scale Image Database of Malicious Software,

    S. Freitas, R. Duggal, and D. H. Chau, “MalNet: A Large-Scale Image Database of Malicious Software,” Sep. 2022, arXiv:2102.01072 [cs]. [Online]. Available: http://arxiv.org/abs/2102.01072

  3. [3]

    An ensemble of pre-trained transformer models for imbalanced multiclass malware classification,

    F. Demirknran, A. Cayhr, U. ¨Unal, and H. Da ˘g, “An ensemble of pre-trained transformer models for imbalanced multiclass malware classification,”Computers & Security, vol. 121, p. 102846, Oct. 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0167404822002401

  4. [4]

    Malware Detection and Classification Using fastText and BERT,

    S. Yesir and I. Sogukpinar, “Malware Detection and Classification Using fastText and BERT,” in2021 9th International Symposium on Digital Forensics and Security (ISDFS), Jun. 2021, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9486377

  5. [5]

    Hybrid sequence-based Android malware detection using natural language processing,

    N. Zhang, J. Xue, Y . Ma, R. Zhang, T. Liang, and Y .-a. Tan, “Hybrid sequence-based Android malware detection using natural language processing,”International Journal of Intelligent Systems, vol. 36, no. 10, pp. 5770–5784, 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/int.22529

  6. [6]

    DLGraph: Malware Detection Using Deep Learning and Graph Embedding,

    H. Jiang, T. Turki, and J. T. L. Wang, “DLGraph: Malware Detection Using Deep Learning and Graph Embedding,” in2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Dec. 2018, pp. 1029–1033. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8614193

  7. [7]

    AMalNet: A deep learning framework based on graph convolutional networks for malware detection,

    X. Pei, L. Yu, and S. Tian, “AMalNet: A deep learning framework based on graph convolutional networks for malware detection,”Computers & Security, vol. 93, p. 101792, Jun. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404820300778

  8. [8]

    GDroid: Android malware detection and classification with graph convolutional network,

    H. Gao, S. Cheng, and W. Zhang, “GDroid: Android malware detection and classification with graph convolutional network,”Computers & Security, vol. 106, p. 102264, Jul. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404821000882

  9. [9]

    Radon transform based malware classification in cyber-physical system using deep learning,

    R. Alguliyev, R. Aliguliyev, and L. Sukhostat, “Radon transform based malware classification in cyber-physical system using deep learning,”Results in Control and Optimization, vol. 14, p. 100382, Mar. 2024. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S2666720724000122

  10. [10]

    Stamina: Scalable deep learning approach for malware classification,

    L. Chen, R. Sahita, J. Parikh, and M. Marino, “Stamina: Scalable deep learning approach for malware classification,” Intel and Microsoft, Tech. Rep., 2020. [Online]. Available: https://www.microsoft.com/ en-us/research/uploads/prod/2020/05/stamina.pdf

  11. [11]

    SDIF-CNN: Stacking deep image features using fine-tuned convolution neural network models for real-world malware detection and classification,

    S. Kumar and K. Panda, “SDIF-CNN: Stacking deep image features using fine-tuned convolution neural network models for real-world malware detection and classification,”Applied Soft Computing, vol. 146, p. 110676, Oct. 2023. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S1568494623006944

  12. [12]

    Virus-mnist: A benchmark malware dataset,

    D. Noever and S. E. M. Noever, “Virus-mnist: A benchmark malware dataset,”arXiv preprint arXiv:2103.00602, 2021. [Online]. Available: https://arxiv.org/abs/2103.00602

  13. [13]

    Androdex: Android dex images of obfuscated malware,

    A. Khan, M. Usama, B. B. Kamal, A. Ahmad, H. Malik, and S. Lee, “Androdex: Android dex images of obfuscated malware,” Scientific Data, 2024. [Online]. Available: https://www.nature.com/ articles/s41597-024-03027-3

  14. [14]

    Mcafee dataset for malware detection,

    M. Labs, “Mcafee dataset for malware detection,” 2020. [Online]. Available: https://www.mcafee.com/enterprise/en-us/assets/ white-papers/wp-machine-learning-malware-detection.pdf

  15. [15]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  16. [16]

    How transferable are features in deep neural networks?

    J. Yosinski, J. Clune, Y . Bengio, and H. Lipson, “How transferable are features in deep neural networks?”Advances in neural information processing systems, vol. 27, pp. 3320–3328, 2014

  17. [17]

    The road less scheduled.arXiv [cs.LG], 2024

    A. Defazio, Xingyu, Yang, H. Mehta, K. Mishchenko, A. Khaled, and A. Cutkosky, “The Road Less Scheduled,” May 2024, arXiv:2405.15682 [cs, math, stat]. [Online]. Available: http://arxiv.org/abs/2405.15682

  18. [18]

    mixup: Beyond Empirical Risk Minimization

    H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” Apr. 2018, arXiv:1710.09412 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1710.09412

  19. [19]

    Trivialaugment: Tuning-free yet state-of-the- art data augmentation,

    R. Mueller and F. Hutter, “Trivialaugment: Tuning-free yet state-of-the- art data augmentation,”arXiv preprint arXiv:2103.10158, 2021