arxiv: 2604.21153 · v1 · submitted 2026-04-22 · 💻 cs.CR

Recognition: unknown

Image-Based Malware Type Classification on MalNet-Image Tiny: Effects of Multi-Scale Fusion, Transfer Learning, Data Augmentation, and Schedule-Free Optimization

Ahmed A. Abouelkhaire , Waleed A. Yousef , Issa Traor

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:36 UTC · model grok-4.3

classification 💻 cs.CR

keywords malware classificationimage-based detectiontransfer learningdata augmentationfeature pyramid networkResNet18Android malwareMalNet-Image

0 comments

The pith

Pretraining and data augmentations lift macro-F1 to 0.6927 on 43-class malware image classification with ResNet18 and FPN.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four components in controlled ablations for classifying 43 malware types from resized binary images on the fixed MalNet-Image Tiny split. A ResNet18 backbone receives ImageNet pretraining, Mixup plus TrivialAugment, a feature pyramid network for scale handling, and schedule-free AdamW. Pretraining and augmentation deliver the main macro-F1 gains while the feature pyramid network further lifts precision, AUC and test loss. The strongest combination reaches macro-F1 of 0.6927, macro-precision 0.7707, macro-AUC 0.9556 and test loss 0.8536 after 10 epochs.

Core claim

The authors show that the configuration using ImageNet pretraining, Mixup, TrivialAugment and a feature pyramid network on ResNet18 produces the highest metrics on the MalNet-Image Tiny test set, with pretraining and augmentation responsible for most of the macro-F1 improvement over the reproduced baseline of 0.6510.

What carries the argument

Feature pyramid network (FPN) for multi-scale fusion attached to ResNet18 to address scale variation from resizing binaries of unequal lengths.

If this is right

ImageNet pretraining supplies the largest single lift in macro-F1 for this 43-class task.
Mixup and TrivialAugment improve robustness with low overhead on binary-derived images.
Schedule-free AdamW reaches near-baseline performance in 10 epochs instead of 96.
FPN addition mainly raises macro-precision, macro-AUC and lowers test loss once pretraining and augmentation are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same four components could be tested on larger binary-image malware collections to check whether the observed ranking persists.
Faster convergence from schedule-free optimization may cut compute costs when scanning new Android APK collections.
Prioritizing pretraining and simple augmentations first could serve as a practical recipe for adapting image classifiers to new malware families.

Load-bearing premise

The performance gains from pretraining, Mixup, TrivialAugment and FPN will hold on other malware image datasets or under real-world distribution shifts beyond the fixed MalNet-Image Tiny split.

What would settle it

Re-running the exact same four configurations on a different public malware image dataset and finding that the reported best setup no longer exceeds the plain baseline would show the gains do not generalize.

Figures

Figures reproduced from arXiv: 2604.21153 by Ahmed A. Abouelkhaire, Issa Traor, Waleed A. Yousef.

**Figure 2.** Figure 2: MalNet-Image Tiny class distribution. The Tiny subset is less [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: F1macro versus Ltest across the ablation runs. Lower loss is associated with, but does not determine, higher F1macro. B. Full ablation results Table III reports the complete experiment set. Experiment 5 reproduces the benchmark-style baseline with F1macro = 0.6510, which is consistent with the value reported for ResNet18 on MalNet-Image Tiny [2]. Experiment 6 shows that schedule-free AdamW with unweighted … view at source ↗

read the original abstract

This paper studies 43-class malware type classification on MalNet-Image Tiny, a public benchmark derived from Android APK files. The goal is to assess whether a compact image classifier benefits from four components evaluated in a controlled ablation: a feature pyramid network (FPN) for scale variation induced by resizing binaries of different lengths, ImageNet pretraining, lightweight augmentation through Mixup and TrivialAugment, and schedule-free AdamW optimization. All experiments use a ResNet18 backbone and the provided train/validation/test split. Reproducing the benchmark-style configuration yields macro-F1 (F1_macro) of 0.6510, consistent with the reported baseline of approximately 0.65. Replacing the optimizer with schedule-free AdamW and using unweighted cross-entropy increases F1_macro to 0.6535 in 10 epochs, compared with 96 epochs for the reproduced baseline. The best configuration combines pretraining, Mixup, TrivialAugment, and FPN, reaching F1_macro=0.6927, P_macro=0.7707, AUC_macro=0.9556, and L_test=0.8536. The ablation indicates that the largest gains in F1_macro arise from pretraining and augmentation, whereas FPN mainly improves P_macro, AUC_macro, and L_test in the strongest configuration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clean ablation study on malware image classification that reproduces the baseline and shows modest gains from pretraining and augmentation, but the differences rest on single-run point estimates without error bars.

read the letter

The main thing here is that this is a clean ablation study on the MalNet-Image Tiny benchmark for 43-class malware type classification. They take a ResNet18, reproduce the baseline at 0.6510 F1_macro, and test adding FPN for multi-scale, ImageNet pretraining, Mixup and TrivialAugment, and schedule-free AdamW. The best combo gets to 0.6927, with pretraining and augmentation driving most of the lift. What they do well is keep the experiments controlled on the fixed split and show that the baseline can be matched closely. The ablation breaks down which pieces help which metrics, like FPN helping precision and AUC more than F1. The soft spots are that all numbers come from single runs with no error bars or multiple random seeds. In deep learning, especially with augmentations and optimizers, variance can easily swallow a 0.04 F1 difference, so it's not clear the gains are reliable. They also don't provide code or full hyperparameter details, which makes it tough to verify or build on. The improvements are incremental and unlikely to shift how people do this outside this specific setup. This paper is for people already working on image-based Android malware detection who want to see how these off-the-shelf tricks stack up on this dataset. A reader interested in practical tweaks for similar tasks might get some value from the numbers, but it's not breaking new ground. It deserves a serious referee because the experiments are structured and the reproduction is solid, even if the claims need more statistical backing. I'd recommend sending it for review with a request for multiple runs and significance tests.

Referee Report

1 major / 1 minor

Summary. This paper conducts a controlled ablation study on 43-class malware type classification on the MalNet-Image Tiny benchmark using a ResNet18 backbone. It evaluates four components: feature pyramid network (FPN) for multi-scale fusion to handle variable binary lengths, ImageNet pretraining, lightweight data augmentation (Mixup + TrivialAugment), and schedule-free AdamW optimization. The reproduced baseline reaches F1_macro=0.6510; the best configuration (pretraining + augmentations + FPN) reaches F1_macro=0.6927, P_macro=0.7707, AUC_macro=0.9556, with pretraining and augmentation identified as the primary sources of F1 gains.

Significance. If the results prove robust, the work supplies a useful, reproducible empirical reference for applying standard computer-vision techniques to malware image classification. The precise baseline reproduction, the focus on schedule-free optimization for faster convergence, and the explicit attribution of gains to individual components are positive contributions that could guide efficient training on similar fixed-split benchmarks.

major comments (1)

The ablation results and headline claims rest on single training runs on the fixed train/val/test split, with no reported standard deviations across random seeds, no error bars, and no statistical significance tests on the observed deltas (e.g., F1_macro improvement of 0.0417). Given that ResNet18 training with cross-entropy, Mixup, and TrivialAugment exhibits run-to-run variance that can easily exceed 0.03–0.05 in F1_macro, the attribution of largest gains to pretraining and augmentation cannot be considered reliably demonstrated.

minor comments (1)

The abstract states that the schedule-free configuration reaches its result in 10 epochs versus 96 for the baseline; the manuscript should explicitly confirm that total wall-clock compute or convergence criteria are comparable before claiming efficiency gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting an important aspect of our experimental design. We respond to the major comment as follows.

read point-by-point responses

Referee: The ablation results and headline claims rest on single training runs on the fixed train/val/test split, with no reported standard deviations across random seeds, no error bars, and no statistical significance tests on the observed deltas (e.g., F1_macro improvement of 0.0417). Given that ResNet18 training with cross-entropy, Mixup, and TrivialAugment exhibits run-to-run variance that can easily exceed 0.03–0.05 in F1_macro, the attribution of largest gains to pretraining and augmentation cannot be considered reliably demonstrated.

Authors: We agree that this is a valid criticism and that reporting results from single runs limits the strength of our conclusions regarding the source of performance gains. Although the use of a fixed train/val/test split is standard for this benchmark to ensure direct comparability, the potential for run-to-run variance means that the observed improvements, particularly the 0.0417 increase in F1_macro, should be interpreted with caution. In the revised manuscript, we will rerun the key experiments (baseline, best configuration, and main ablations) using at least five different random seeds, report the mean and standard deviation for all metrics, and include error bars in the relevant tables and figures. This will enable a more robust assessment of the contributions from pretraining and data augmentation. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical ablation study on fixed dataset split

full rationale

The paper performs controlled ablation experiments training ResNet18 classifiers on the MalNet-Image Tiny benchmark and reports direct empirical metrics (F1_macro, P_macro, AUC_macro, L_test) measured on the provided held-out test set. All headline numbers (e.g., baseline 0.6510 vs. best configuration 0.6927) are obtained by running the models and computing standard classification scores; no equations, derivations, or fitted parameters are presented as predictions that reduce to their own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central claims. The work is therefore self-contained as a set of reproducible empirical measurements rather than a deductive chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised classification assumptions and the fixed public train/validation/test split of MalNet-Image Tiny; no new entities are postulated.

free parameters (2)

initial learning rate and other optimizer hyperparameters
Schedule-free AdamW still requires an initial learning rate and weight decay that are not stated in the abstract and are presumably tuned on the validation set.
Mixup alpha and TrivialAugment magnitude
The strength of the two augmentation methods is a free choice that affects the reported gains.

axioms (2)

domain assumption The provided train/validation/test split is representative and fixed for all experiments.
All comparisons rely on the benchmark's official split without re-sampling or cross-validation.
domain assumption Macro-averaged metrics are the appropriate summary for a 43-class imbalanced problem.
The paper reports macro-F1, macro-P, and macro-AUC without justifying why macro-averaging is preferred over micro or weighted alternatives.

pith-pipeline@v0.9.0 · 5566 in / 1638 out tokens · 92902 ms · 2026-05-09T23:36:13.338672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Malware images: visualization and automatic classification,

L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware images: visualization and automatic classification,” inProceedings of the 8th International Symposium on Visualization for Cyber Security, ser. VizSec ’11. New York, NY , USA: Association for Computing Machinery, Jul. 2011, pp. 1–7. [Online]. Available: https://doi.org/10.1145/2016904.201...

work page doi:10.1145/2016904.2016908 2011
[2]

MalNet: A Large-Scale Image Database of Malicious Software,

S. Freitas, R. Duggal, and D. H. Chau, “MalNet: A Large-Scale Image Database of Malicious Software,” Sep. 2022, arXiv:2102.01072 [cs]. [Online]. Available: http://arxiv.org/abs/2102.01072

work page arXiv 2022
[3]

An ensemble of pre-trained transformer models for imbalanced multiclass malware classification,

F. Demirknran, A. Cayhr, U. ¨Unal, and H. Da ˘g, “An ensemble of pre-trained transformer models for imbalanced multiclass malware classification,”Computers & Security, vol. 121, p. 102846, Oct. 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0167404822002401

2022
[4]

Malware Detection and Classification Using fastText and BERT,

S. Yesir and I. Sogukpinar, “Malware Detection and Classification Using fastText and BERT,” in2021 9th International Symposium on Digital Forensics and Security (ISDFS), Jun. 2021, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9486377

work page arXiv 2021
[5]

Hybrid sequence-based Android malware detection using natural language processing,

N. Zhang, J. Xue, Y . Ma, R. Zhang, T. Liang, and Y .-a. Tan, “Hybrid sequence-based Android malware detection using natural language processing,”International Journal of Intelligent Systems, vol. 36, no. 10, pp. 5770–5784, 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/int.22529

work page doi:10.1002/int.22529 2021
[6]

DLGraph: Malware Detection Using Deep Learning and Graph Embedding,

H. Jiang, T. Turki, and J. T. L. Wang, “DLGraph: Malware Detection Using Deep Learning and Graph Embedding,” in2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Dec. 2018, pp. 1029–1033. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8614193

work page arXiv 2018
[7]

AMalNet: A deep learning framework based on graph convolutional networks for malware detection,

X. Pei, L. Yu, and S. Tian, “AMalNet: A deep learning framework based on graph convolutional networks for malware detection,”Computers & Security, vol. 93, p. 101792, Jun. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404820300778

2020
[8]

GDroid: Android malware detection and classification with graph convolutional network,

H. Gao, S. Cheng, and W. Zhang, “GDroid: Android malware detection and classification with graph convolutional network,”Computers & Security, vol. 106, p. 102264, Jul. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404821000882

2021
[9]

Radon transform based malware classification in cyber-physical system using deep learning,

R. Alguliyev, R. Aliguliyev, and L. Sukhostat, “Radon transform based malware classification in cyber-physical system using deep learning,”Results in Control and Optimization, vol. 14, p. 100382, Mar. 2024. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S2666720724000122

2024
[10]

Stamina: Scalable deep learning approach for malware classification,

L. Chen, R. Sahita, J. Parikh, and M. Marino, “Stamina: Scalable deep learning approach for malware classification,” Intel and Microsoft, Tech. Rep., 2020. [Online]. Available: https://www.microsoft.com/ en-us/research/uploads/prod/2020/05/stamina.pdf

2020
[11]

SDIF-CNN: Stacking deep image features using fine-tuned convolution neural network models for real-world malware detection and classification,

S. Kumar and K. Panda, “SDIF-CNN: Stacking deep image features using fine-tuned convolution neural network models for real-world malware detection and classification,”Applied Soft Computing, vol. 146, p. 110676, Oct. 2023. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S1568494623006944

2023
[12]

Virus-mnist: A benchmark malware dataset,

D. Noever and S. E. M. Noever, “Virus-mnist: A benchmark malware dataset,”arXiv preprint arXiv:2103.00602, 2021. [Online]. Available: https://arxiv.org/abs/2103.00602

work page arXiv 2021
[13]

Androdex: Android dex images of obfuscated malware,

A. Khan, M. Usama, B. B. Kamal, A. Ahmad, H. Malik, and S. Lee, “Androdex: Android dex images of obfuscated malware,” Scientific Data, 2024. [Online]. Available: https://www.nature.com/ articles/s41597-024-03027-3

2024
[14]

Mcafee dataset for malware detection,

M. Labs, “Mcafee dataset for malware detection,” 2020. [Online]. Available: https://www.mcafee.com/enterprise/en-us/assets/ white-papers/wp-machine-learning-malware-detection.pdf

2020
[15]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

2016
[16]

How transferable are features in deep neural networks?

J. Yosinski, J. Clune, Y . Bengio, and H. Lipson, “How transferable are features in deep neural networks?”Advances in neural information processing systems, vol. 27, pp. 3320–3328, 2014

2014
[17]

The road less scheduled.arXiv [cs.LG], 2024

A. Defazio, Xingyu, Yang, H. Mehta, K. Mishchenko, A. Khaled, and A. Cutkosky, “The Road Less Scheduled,” May 2024, arXiv:2405.15682 [cs, math, stat]. [Online]. Available: http://arxiv.org/abs/2405.15682

work page arXiv 2024
[18]

mixup: Beyond Empirical Risk Minimization

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” Apr. 2018, arXiv:1710.09412 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1710.09412

work page internal anchor Pith review arXiv 2018
[19]

Trivialaugment: Tuning-free yet state-of-the- art data augmentation,

R. Mueller and F. Hutter, “Trivialaugment: Tuning-free yet state-of-the- art data augmentation,”arXiv preprint arXiv:2103.10158, 2021

work page arXiv 2021