Unraveling the Key of Machine Learning-based Android Malware Detection
Pith reviewed 2026-05-24 03:29 UTC · model grok-4.3
The pith
ML-based Android malware detectors remain vulnerable to evolving threats and adversarial attacks because they fail to capture semantic information that characterizes malicious behaviors from APK features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through the taxonomy and re-implementation of 12 approaches, the paper shows that existing ML-based Android malware detectors achieve encouraging results in standard settings yet remain vulnerable to malware evolution and adversarial attacks, with these limitations stemming from insufficient capture and use of malware semantics defined as semantic information characterizing malicious behaviors derived from APK features.
What carries the argument
A general-purpose framework that unifies Android app representations and the ML modeling pipeline to enable consistent re-implementation and cross-dimensional evaluation of detection approaches.
If this is right
- Improving the capture of malware semantics should directly increase robustness to evolution and attacks.
- Current detectors trade off effectiveness for efficiency in ways that limit semantic depth.
- A taxonomy organized by representations and pipelines allows systematic identification of gaps across research communities.
- Recommendations for future work center on designing features and models that better encode malicious behavior semantics.
Where Pith is reading between the lines
- The evaluation setup could be extended to test whether newer representation learning methods overcome the identified semantic shortfall.
- If semantics are the key missing element, then hybrid systems combining static and dynamic behavioral traces may close the robustness gap faster than pure ML refinements.
- The taxonomy provides a reusable structure for classifying and comparing any future Android malware detector without redoing the full re-implementation effort.
Load-bearing premise
The twelve re-implemented approaches accurately reproduce the original published methods and the chosen datasets, metrics, and attack models reflect real-world Android malware detection conditions.
What would settle it
A single detector that maintains high accuracy against both unseen malware families over time and adversarial perturbations while explicitly deriving and using semantic behavioral information from APK features.
Figures
read the original abstract
With the rapid advancement of machine learning (ML), ML-based Android malware detection has gained significant popularity due to its ability to automatically learn malicious patterns from Android apps. However, the lack of an in-depth and systematic analysis of existing research makes it difficult to obtain a holistic understanding of the state of the art in this field. In this work, we present the most comprehensive investigation to date of ML-based Android malware detection systems, combining both empirical and quantitative analyses. We first organize prior work into a unified taxonomy based on Android app representations and the ML modeling pipeline. Building on this taxonomy, we design a general-purpose framework for ML-based Android malware detection and re-implement 12 representative approaches from three research communities -- software engineering, security, and machine learning. Using this framework, we conduct a large-scale evaluation across three key dimensions: detection effectiveness, robustness to real-world challenges, and efficiency. Despite extensive research efforts and encouraging results, our findings reveal that existing learning-based Android malware detectors still face significant challenges, including vulnerability to malware evolution and susceptibility to adversarial attacks. We attribute these limitations to the detectors' ability to capture and leverage malware semantics, defined as semantic information that characterizes malicious behaviors derived from APK features. Finally, we summarize our key insights and provide actionable recommendations to guide future research in this domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper organizes prior ML-based Android malware detection work into a taxonomy based on app representations and the ML modeling pipeline, designs a general-purpose framework, re-implements 12 representative detectors from SE, security, and ML communities, and evaluates them at scale on detection effectiveness, robustness to malware evolution and adversarial attacks, and efficiency. It concludes that existing detectors remain vulnerable to evolution and adversarial examples because they fail to capture and leverage malware semantics (defined as semantic information characterizing malicious behaviors from APK features), and offers insights plus recommendations for future work.
Significance. If the central empirical claims hold after verification of reproduction fidelity, the work would be a significant contribution as the largest-scale comparative study in this area, providing a reusable framework and concrete evidence of persistent limitations that could steer the community toward semantics-aware approaches. The explicit taxonomy and unified re-implementation effort are strengths that enable direct comparability across communities.
major comments (2)
- [§4] §4 (Re-implementation section): The manuscript does not report quantitative fidelity metrics (e.g., side-by-side F1 or accuracy on the exact dataset splits used in the original publications) for any of the 12 re-implemented detectors. Because the central attribution of failure modes to lack of semantic capture rests entirely on these reproductions, absence of such checks leaves open the possibility that observed vulnerabilities are artifacts of implementation differences rather than intrinsic properties of the original methods.
- [§5.2–5.3] §5.2–5.3 (Evolution and adversarial evaluation): The paper attributes poor performance on evolved malware and adversarial examples to insufficient semantic capture, yet provides no ablation or feature-importance analysis showing that the detectors' learned representations indeed lack the semantic properties defined in the introduction. Without such evidence, the causal link between the observed failures and the semantics hypothesis remains correlational.
minor comments (2)
- [Abstract / §1] The abstract and introduction repeatedly use the phrase 'most comprehensive investigation to date' without a supporting citation or explicit comparison table against prior surveys; a brief related-work paragraph quantifying coverage would strengthen this claim.
- [§3] Notation for the unified framework (e.g., how APK features are mapped to the taxonomy categories) is introduced in §3 but not summarized in a single table; adding such a table would improve readability for readers comparing the 12 approaches.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment point-by-point below.
read point-by-point responses
-
Referee: [§4] §4 (Re-implementation section): The manuscript does not report quantitative fidelity metrics (e.g., side-by-side F1 or accuracy on the exact dataset splits used in the original publications) for any of the 12 re-implemented detectors. Because the central attribution of failure modes to lack of semantic capture rests entirely on these reproductions, absence of such checks leaves open the possibility that observed vulnerabilities are artifacts of implementation differences rather than intrinsic properties of the original methods.
Authors: We agree this is a valid concern for strengthening the reproduction claims. Our re-implementations followed the original papers as closely as possible within the unified framework, and overall trends align with published results. We will add a table in the revised §4 reporting side-by-side F1/accuracy comparisons against original publications on their reported dataset splits where those splits and data are available and reproducible. revision: yes
-
Referee: [§5.2–5.3] §5.2–5.3 (Evolution and adversarial evaluation): The paper attributes poor performance on evolved malware and adversarial examples to insufficient semantic capture, yet provides no ablation or feature-importance analysis showing that the detectors' learned representations indeed lack the semantic properties defined in the introduction. Without such evidence, the causal link between the observed failures and the semantics hypothesis remains correlational.
Authors: The attribution rests on the taxonomy in §3, which classifies each detector by its feature representations and explicitly identifies which rely on syntactic rather than semantic properties (as defined in the introduction). The uniform vulnerability pattern across non-semantic detectors provides supporting evidence. We acknowledge the absence of explicit ablation or feature-importance studies. We will expand the discussion in §5.2–5.3 to more directly connect results to the taxonomy classifications; adding full ablations would require new experiments beyond the current scope. revision: partial
Circularity Check
No circularity; empirical re-implementations and evaluations are independent of the paper's own inputs.
full rationale
The paper organizes prior work into a taxonomy, re-implements 12 detectors in a general framework, and evaluates them empirically on detection effectiveness, robustness to evolution/adversarial attacks, and efficiency. Claims about limitations and attribution to 'malware semantics' (defined as semantic information characterizing malicious behaviors from APK features) follow from these new comparisons rather than reducing by construction to fitted parameters, self-definitions, or self-citation chains. No equations, predictions, or uniqueness theorems are present that equate outputs to inputs. The work is self-contained against external benchmarks via the re-implementations and large-scale evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard ML evaluation assumptions hold, including that benchmark datasets are representative of real-world Android malware distributions.
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Androguard. https://github.com/androguard/
-
[2]
[n. d.]. Angr. https://angr.io/
-
[3]
[n. d.]. Apktool. https://ibotpeaches.github.io/Apktool/
-
[4]
[n. d.]. BackSmali. https://github.com/JesusFreke/smali
-
[5]
[n. d.]. Harly: another Trojan subscriber on Google Play. https://www.kaspersky. com/blog/harly-trojan-subscriber/45573
-
[6]
[n. d.]. How Many Apps In Google Play Store? https://www.bankmycell.com/ blog/number-of-google-play-store-apps
-
[7]
[n. d.]. IDA Pro. https://hex-rays.com/ida-pro/
-
[8]
[n. d.]. Kharon project. https://cidre.gitlabpages.inria.fr/malware/malware- website/dataset/malware_DroidKungFu1.html
-
[9]
[n. d.]. LibRadar. https://github.com/pkumza/LibRadar
-
[10]
[n. d.]. PyTorch. https://pytorch.org/
-
[11]
[n. d.]. Share of Android OS of global smartphone shipments. https://www.statista.com/statistics/236027/global-smartphone-os-market- share-of-android
-
[12]
[n. d.]. The mobile malware threat landscape in 2022. https://securelist.com/ mobile-threat-report-2022/108844
work page 2022
-
[13]
[n. d.]. VirusTotal. https://www.virustotal.com
-
[14]
Yousra Aafer, Wenliang Du, and Heng Yin. 2013. Droidapiminer: Mining api- level features for robust malware detection in android. In International ICST Conference, SecureComm
work page 2013
-
[15]
Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2016. Androzoo: Collecting millions of android apps for the research community. In MSR
work page 2016
-
[16]
Muhammad Amin, Babar Shah, Aizaz Sharif, Tamleek Ali, Ki-Il Kim, and Sajid Anwar. 2022. Android malware detection through generative adversarial net- works. Emerging Telecommunications Technologies (2022)
work page 2022
-
[17]
Simone Aonzo, Gabriel Claudiu Georgiu, Luca Verderame, and Alessio Merlo
-
[18]
Obfuscapk: An open-source black-box obfuscation tool for Android apps. SoftwareX (2020)
work page 2020
-
[19]
Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket.. In NDSS
work page 2014
-
[20]
Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML
work page 2018
-
[21]
Kathy Wain Yee Au, Yi Fan Zhou, Zhen Huang, and David Lie. 2012. Pscout: analyzing the android permission specification. In CCS
work page 2012
-
[22]
Michael Backes, Sven Bugiel, Erik Derr, Patrick McDaniel, Damien Octeau, and Sebastian Weisgerber. 2016. On demystifying the android application framework:{Re-Visiting} android permission specification analysis. InSecurity
work page 2016
-
[23]
Federico Barbero, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro
-
[24]
Transcending transcend: Revisiting malware classification in the presence of concept drift. In SP
-
[25]
Arjun Nitin Bhagoji, Daniel Cullina, Chawin Sitawarin, and Prateek Mittal. 2018. Enhancing robustness of machine learning systems via data transformations. In CISS
work page 2018
-
[26]
Haipeng Cai. 2020. Assessing and improving malware detection sustainability through app evolution studies. TOSEM (2020)
work page 2020
-
[27]
Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In SP
work page 2017
-
[28]
Fabrício Ceschin, Marcus Botacin, Albert Bifet, Bernhard Pfahringer, Luiz S Oliveira, Heitor Murilo Gomes, and André Grégio. 2020. Machine learning (in) security: A stream of problems. Digital Threats: Research and Practice (2020)
work page 2020
-
[29]
Ngoc-Tu Chau and Souhwan Jung. 2018. Dynamic analysis with Android container: Challenges and opportunities. Digital Investigation (2018)
work page 2018
-
[30]
Simin Chen, Soroush Bateni, Sampath Grandhi, Xiaodi Li, Cong Liu, and Wei Yang. 2020. DENAS: automated rule generation by knowledge extraction from neural networks. In ESEC/FSE
work page 2020
-
[31]
Xiao Chen, Chaoran Li, Derui Wang, Sheng Wen, Jun Zhang, Surya Nepal, Yang Xiang, and Kui Ren. 2019. Android HIV: A study of repackaging malware for evading machine-learning detection. TIFS (2019)
work page 2019
- [32]
-
[33]
Francisco Handrick da Costa, Ismael Medeiros, Thales Menezes, João Victor da Silva, Ingrid Lorraine da Silva, Rodrigo Bonifácio, Krishna Narasimhan, and Márcio Ribeiro. 2022. Exploring the use of static and dynamic analysis to improve the performance of the mining sandbox approach for android malware identification. Journal of Systems and Software (2022)
work page 2022
-
[34]
Nadia Daoudi, Jordan Samhi, Abdoul Kader Kabore, Kevin Allix, Tegawendé F Bissyandé, and Jacques Klein. 2021. Dexray: a simple, yet effective deep learn- ing approach to android malware detection based on image representation of bytecode. In DMLSD
work page 2021
-
[35]
Yuxin Ding, Xiao Zhang, Jieke Hu, and Wenting Xu. 2020. Android malware detection method based on bytecode image. AIHC (2020)
work page 2020
-
[36]
William Enck, Machigar Ongtang, and Patrick McDaniel. 2009. On lightweight mobile phone application certification. In CCS
work page 2009
-
[37]
Yujie Fan, Mingxuan Ju, Shifu Hou, Yanfang Ye, Wenqiang Wan, Kui Wang, Yinming Mei, and Qi Xiong. 2021. Heterogeneous temporal graph transformer: An intelligent system for evolving android malware detection. In KDD
work page 2021
-
[38]
Parvez Faruki, Ammar Bharmal, Vijay Laxmi, Vijay Ganmoor, Manoj Singh Gaur, Mauro Conti, and Muttukrishnan Rajarajan. 2014. Android security: a survey of issues, malware penetration, and defenses. IEEE communications surveys & tutorials (2014)
work page 2014
-
[39]
Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn Song, and David Wagner
-
[40]
Android permissions demystified. In CCS
-
[41]
Ruitao Feng, Sen Chen, Xiaofei Xie, Lei Ma, Guozhu Meng, Yang Liu, and Shang- Wei Lin. 2019. Mobidroid: A performance-sensitive malware detection system on mobile platform. In ICECCS
work page 2019
-
[42]
Ruitao Feng, Sen Chen, Xiaofei Xie, Guozhu Meng, Shang-Wei Lin, and Yang Liu
-
[43]
A performance-sensitive malware detection system using deep learning on mobile devices. TIFS (2020)
work page 2020
-
[44]
Han Gao, Shaoyin Cheng, and Weiming Zhang. 2021. GDroid: Android malware detection and classification with graph convolutional network. Computers & Security (2021)
work page 2021
-
[45]
Joshua Garcia, Mahmoud Hammad, and Sam Malek. 2018. Lightweight, obfuscation-resilient detection and family identification of android malware. TOSEM (2018)
work page 2018
-
[46]
Ross Girshick. 2015. Fast r-cnn. In ICCV
work page 2015
-
[47]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS
work page 2014
-
[48]
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In ICLR
work page 2015
-
[49]
Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. 2017. Adversarial examples for malware detection. In ES- ORICS
work page 2017
-
[50]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In CVPR
work page 2017
-
[51]
Ke He and Dong-Seong Kim. 2019. Malware detection with malware images using deep learning techniques. In TrustCom
work page 2019
-
[52]
Ping He, Yifan Xia, Xuhong Zhang, and Shouling Ji. 2023. Efficient Query-Based Attack against ML-Based Android Malware Detection under Zero Knowledge Setting. In CCS
work page 2023
-
[53]
Yiling He, Yiping Liu, Lei Wu, Ziqi Yang, Kui Ren, and Zhan Qin. 2022. MsDroid: Identifying Malicious Snippets for Android Malware Detection. In TDSC
work page 2022
-
[54]
Geoffrey Hinton. 2009. Deep belief networks. Scholarpedia (2009)
work page 2009
-
[55]
Shifu Hou, Yanfang Ye, Yangqiu Song, and Melih Abdulhayoglu. 2017. Hin- droid: An intelligent android malware detection system based on structured heterogeneous information network. In KDD. 13 Jiahao Liu, Jun Zeng, Fabio Pierazzi, Lorenzo Cavallaro, and Zhenkai Liang
work page 2017
-
[56]
TonTon Hsien-De Huang and Hung-Yu Kao. 2018. R2-d2: Color-inspired convo- lutional neural network cnn-based android malware detections. In BigData
work page 2018
-
[57]
Na Huang, Ming Xu, Ning Zheng, Tong Qiao, and Kim-Kwang Raymond Choo
-
[58]
Deep android malware classification with API-based feature graph. In TrustCom/BigDataSE
-
[59]
Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box adversarial attacks with limited queries and information. In ICML
work page 2018
-
[60]
Roberto Jordaney, Kumar Sharad, Santanu K Dash, Zhi Wang, Davide Papini, Ilia Nouretdinov, and Lorenzo Cavallaro. 2017. Transcend: Detecting concept drift in malware classification models. In Security
work page 2017
-
[61]
ElMouatez Billah Karbab and Mourad Debbabi. 2021. Petadroid: adaptive an- droid malware detection using deep learning. In DIMV A
work page 2021
-
[62]
ElMouatez Billah Karbab, Mourad Debbabi, Abdelouahid Derhab, and Djedjiga Mouheb. 2018. MalDozer: Automatic framework for android malware detection using deep learning. Digital Investigation (2018)
work page 2018
-
[63]
TaeGuen Kim, BooJoong Kang, Mina Rho, Sakir Sezer, and Eul Gyu Im. 2018. A multimodal deep learning method for android malware detection using various features. In TIFS
work page 2018
-
[64]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[65]
Tao Lei, Zhan Qin, Zhibo Wang, Qi Li, and Dengpan Ye. 2019. EveDroid: Event- aware Android malware detection against model degrading for IoT devices. IoTJ (2019)
work page 2019
-
[66]
Heng Li, Zhang Cheng, Bang Wu, Liheng Yuan, Cuiying Gao, Wei Yuan, and Xiapu Luo. 2023. Black-box Adversarial Example Attack towards FCG Based Android Malware Detection under Incomplete Feature Information. InSecurity
work page 2023
-
[67]
Heng Li, ShiYao Zhou, Wei Yuan, Jiahuan Li, and Henry Leung. 2019. Adversarial-example attacks toward android malware detection system. IEEE Systems Journal (2019)
work page 2019
-
[68]
Heng Li, Shiyao Zhou, Wei Yuan, Xiapu Luo, Cuiying Gao, and Shuiyan Chen
-
[69]
Robust android malware detection against adversarial example attacks. In WWW
-
[70]
Li Li, Tegawendé F Bissyandé, Mike Papadakis, Siegfried Rasthofer, Alexandre Bartel, Damien Octeau, Jacques Klein, and Le Traon. 2017. Static analysis of android apps: A systematic literature review. Information and Software Technology (2017)
work page 2017
-
[71]
Xuezixiang Li, Yu Qu, and Heng Yin. 2021. Palmtree: Learning an assembly language model for instruction embedding. In CCS
work page 2021
-
[72]
Yuping Li, Jiyong Jang, Xin Hu, and Xinming Ou. 2017. Android malware clus- tering through malicious payload mining. In Research in Attacks, Intrusions, and Defenses: 20th International Symposium, RAID 2017, Atlanta, GA, USA, September 18–20, 2017, Proceedings
work page 2017
-
[73]
Kaijun Liu, Shengwei Xu, Guoai Xu, Miao Zhang, Dawei Sun, and Haifeng Liu
-
[74]
A review of android malware detection approaches based on machine learning. IEEE Access (2020)
work page 2020
-
[75]
Yue Liu, Chakkrit Tantithamthavorn, Li Li, and Yepang Liu. 2022. Deep learning for android malware defenses: a systematic literature review. JACM (2022)
work page 2022
-
[76]
Enrico Mariconti, Lucky Onwuzurike, Panagiotis Andriotis, Emiliano De Cristo- faro, Gordon Ross, and Gianluca Stringhini. 2017. Mamadroid: Detecting android malware by building markov chains of behavioral models. In NDSS
work page 2017
-
[77]
Alejandro Martín, Félix Fuentes-Hurtado, Valery Naranjo, and David Cama- cho. 2017. Evolving deep neural networks architectures for android malware classification. In CEC
work page 2017
-
[78]
Niall McLaughlin, Jesus Martinez del Rincon, BooJoong Kang, Suleiman Yerima, Paul Miller, Sakir Sezer, Yeganeh Safaei, Erik Trickel, Ziming Zhao, Adam Doupé, et al. 2017. Deep android malware detection. In CODASPY
work page 2017
-
[79]
Larry R Medsker and LC Jain. 2001. Recurrent neural networks. Design and Applications (2001)
work page 2001
-
[80]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.