MARD: A Multi-Agent Framework for Robust Android Malware Detection

Bo Li; Lei Cui; Sihao Liu; Xudong Mou; Xueying Zeng; Yanze Li; Youquan Xian

arxiv: 2604.25264 · v1 · submitted 2026-04-28 · 💻 cs.CR · cs.SE

MARD: A Multi-Agent Framework for Robust Android Malware Detection

Xueying Zeng , Youquan Xian , Sihao Liu , Xudong Mou , Yanze Li , Lei Cui , Bo Li This is my paper

Pith reviewed 2026-05-07 15:55 UTC · model grok-4.3

classification 💻 cs.CR cs.SE

keywords Android malware detectionmulti-agent frameworklarge language modelsstatic analysisconcept driftReAct paradigminterpretability

0 comments

The pith

MARD lets LLMs orchestrate static analysis tools via multi-agents for robust Android malware detection without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MARD, a multi-agent system designed to let large language models manage traditional static analysis tools for finding malware in Android applications. It addresses the issue of machine learning models becoming outdated as malware changes and the inefficiency of having LLMs process entire code files directly. Through agent-based interactions following the ReAct approach, the system builds clear chains of evidence to support its decisions. This would matter to readers interested in security because it promises detection that stays reliable over long periods and across different sources without extra training costs.

Core claim

MARD is a multi-agent framework that treats deterministic static analysis engines as on-demand tools orchestrated by LLMs using the ReAct paradigm. This allows the construction of interpretable evidentiary chains for malware detection in APKs, achieving an F1 score of 93.46% without fine-tuning, with robustness to concept drift and cross-domain generalization over up to five years of data, at a cost under $0.10 per APK.

What carries the argument

The autonomous multi-agent interaction mechanism based on the ReAct paradigm, which enables LLMs to orchestrate static analysis engines as tools and build evidence chains.

Load-bearing premise

Large language models are capable of reliably selecting and interpreting outputs from static analysis tools to make correct malware determinations in complex Android application packages without any specialized training.

What would settle it

Testing MARD on a new collection of Android apps released after the evaluation period that shows a significant drop in F1 score or frequent failures in building consistent evidence chains.

Figures

Figures reproduced from arXiv: 2604.25264 by Bo Li, Lei Cui, Sihao Liu, Xudong Mou, Xueying Zeng, Yanze Li, Youquan Xian.

**Figure 1.** Figure 1: Comparison of different scheme architectures. view at source ↗

**Figure 2.** Figure 2: Overview of the MARD Architecture. Multi-dimensional Evidence Fusion. The Verdict Agent, situated at the apex of the architecture, is responsible for completing the global logical closed loop. It deeply fuses the macro-level intentional risks extracted in Tier 1 with the structured, micro-level definitive evidence vectors from Tier 2. Based on this fusion, it executes high-order logical deduction and align… view at source ↗

**Figure 3.** Figure 3: Dimension Reduction of API Context per APK. view at source ↗

**Figure 4.** Figure 4: Distribution and cumulative distribution of function-level lines of code (LOC). The left subfigure show the histogram and kernel density estimation of view at source ↗

**Figure 5.** Figure 5: 2017-2021 results on AndroZoo dataset. Answer for RQ2: In stark contrast to traditional and continual learning models that suffer from a precipitous performance decline over time, MARD demonstrates exceptional temporal generalization capabilities. By substituting the fitting of shallow statistical features with deep semantic reasoning of attack intent, the framework maintains stable detection efficacy … view at source ↗

**Figure 6.** Figure 6: Token consumption analysis results. over 85% to 90% of the system’s total token consumption. This phenomenon not only aligns with expectations but also strongly validates the core design philosophy of MARD. The macro-level pre-screening in Tier 1 successfully intercepts a massive volume of risk-free code, consuming only a minimal fraction of tokens (< 10%); meanwhile, the system skews the vast majority of … view at source ↗

**Figure 7.** Figure 7: Average consumption cost per APK. total input cost for processing a single APK across the entire pipeline is strictly controlled between $0.0675 and $0.0825 ( view at source ↗

read the original abstract

With the rapid evolution of Android applications, traditional machine learning-based detection models suffer from concept drift. Additionally, they are constrained by shallow features, lacking deep semantic understanding and interpretability of decisions. Although Large Language Models (LLMs) demonstrate remarkable semantic reasoning capabilities, directly processing massive raw code incurs prohibitive token overhead. Moreover, this approach fails to fully unleash the deep logical reasoning potential of LLMs within complex contexts. To address these limitations, we propose MARD, a multi-agent framework for robust Android malware detection. This framework effectively bridges the gap between the semantic understanding of LLMs and traditional static analysis. It treats underlying deterministic analysis engines as on-demand execution tools, while utilizing the LLM to orchestrate the entire decision-making process. By designing an autonomous multi-agent interaction mechanism based on the ReAct paradigm, MARD constructs a highly interpretable evidentiary chain for conviction. Furthermore, we radically reduce the total cost of conducting a deep analysis of a single complex APK to under $0.10. Evaluations demonstrate that, without any domain-specific fine-tuning, MARD achieves an F1 score of 93.46%. It not only outperforms continual learning baselines but also exhibits robustness against concept drift and strong cross-domain generalization capabilities in evaluations spanning up to five years.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARD's multi-agent ReAct setup lets LLMs call static analysis tools on APKs instead of eating raw code, but the 93.46 F1 and drift-robustness numbers sit on unshown tool-call reliability.

read the letter

The main point is that this paper puts together a multi-agent ReAct loop where an LLM directs existing deterministic static analyzers as on-demand tools, builds an evidence chain, and claims it detects Android malware at 93.46 F1 without any domain-specific fine-tuning while staying under ten cents per APK and holding up across five years of data. That framing directly targets the token-cost and concept-drift problems that come with feeding whole APKs to LLMs or retraining models constantly. Treating the analyzers as callable tools rather than hoping the model parses decompiled code on its own is a straightforward engineering move that keeps the deterministic parts deterministic and gives the output some traceability. The cost figure and the no-tuning claim are the practical hooks if they check out. The evaluation section is the clear weak spot. The abstract states the F1 score, the outperformance over continual-learning baselines, and the cross-domain results, yet supplies no dataset sizes, no breakdown of which static tools were used, no tool-call success rates, and no count of how often the agents had to retry or fell back. Without those numbers it is impossible to tell whether the LLM actually orchestrates the tools reliably on messy real-world APKs or whether the reported performance depends on easy cases and heavy prompt engineering. The stress-test concern lands: if tool invocation or output interpretation fails even 5-10 percent of the time, the headline results would need post-filtering that contradicts the no-fine-tuning story. This is aimed at people already working on LLM agents for security or mobile malware pipelines. A reader who wants concrete ideas for tool-use patterns will find something to think about, but anyone planning to cite the numbers or replicate the system will need the full experimental details first. I would send it to peer review. The architecture is worth a proper check even if the current write-up leaves the central performance claims hanging.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MARD, a multi-agent framework that employs large language models (LLMs) under the ReAct paradigm to orchestrate deterministic static-analysis engines as on-demand tools for Android malware detection. It claims an F1 score of 93.46% without domain-specific fine-tuning, superiority over continual-learning baselines, robustness to concept drift, and strong cross-domain generalization across evaluations spanning up to five years, while reducing per-APK analysis cost below $0.10 and constructing interpretable evidentiary chains.

Significance. If the performance and robustness claims are substantiated with complete experimental details, the work could meaningfully advance Android malware detection by bridging LLM semantic reasoning with traditional static analysis, yielding interpretable decisions at low cost without fine-tuning. The reported ability to handle concept drift over multi-year spans and the emphasis on evidentiary chains represent potentially valuable contributions to robust, explainable security systems.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: the central performance claim of 93.46% F1, outperformance of continual-learning baselines, and robustness to concept drift are stated without any description of the datasets (size, sources, temporal distribution), evaluation protocol (train/test splits, drift simulation method, cross-domain setup), baseline implementations, or statistical significance tests. This absence prevents assessment of whether the reported results support the no-fine-tuning and generalization assertions.
[Methodology section] Methodology section: the no-domain-specific-fine-tuning claim rests on the unverified premise that an unmodified LLM can reliably invoke static-analysis tools, correctly interpret their outputs (e.g., decompiled code, permission graphs), and assemble evidentiary chains with negligible error. The manuscript should report quantitative metrics on tool-call success rate, hallucination frequency during feature extraction, and fallback behavior for edge-case APKs; without these, the headline F1 and drift-robustness results cannot be confidently attributed to the framework rather than post-filtering or dataset-specific prompting.

minor comments (2)

[Abstract] The abstract refers to 'evaluations spanning up to five years' without naming the exact time windows or APK corpora used; adding these concrete details would improve clarity.
[Methodology] Notation for the multi-agent interaction mechanism and ReAct loop could be formalized with a diagram or pseudocode to make the orchestration process easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the central performance claim of 93.46% F1, outperformance of continual-learning baselines, and robustness to concept drift are stated without any description of the datasets (size, sources, temporal distribution), evaluation protocol (train/test splits, drift simulation method, cross-domain setup), baseline implementations, or statistical significance tests. This absence prevents assessment of whether the reported results support the no-fine-tuning and generalization assertions.

Authors: We agree that the current presentation of results lacks sufficient detail on datasets and protocols, which limits independent assessment of the claims. In the revised manuscript, we will substantially expand the Evaluation section to include dataset sizes and sources, temporal distributions, exact train/test splits, the procedure for simulating concept drift, cross-domain evaluation setups, descriptions of baseline implementations, and statistical significance testing. These additions will directly support evaluation of the no-fine-tuning and generalization assertions. revision: yes
Referee: [Methodology section] Methodology section: the no-domain-specific-fine-tuning claim rests on the unverified premise that an unmodified LLM can reliably invoke static-analysis tools, correctly interpret their outputs (e.g., decompiled code, permission graphs), and assemble evidentiary chains with negligible error. The manuscript should report quantitative metrics on tool-call success rate, hallucination frequency during feature extraction, and fallback behavior for edge-case APKs; without these, the headline F1 and drift-robustness results cannot be confidently attributed to the framework rather than post-filtering or dataset-specific prompting.

Authors: We recognize that explicit quantitative evidence of tool-use reliability is required to substantiate the no-fine-tuning claim. Although the ReAct-based multi-agent design and deterministic tool interfaces were intended to limit errors, the original manuscript did not include aggregated metrics on tool-call success rates or hallucination frequency. In the revision we will add these metrics, derived from our existing experimental logs for tool-call success and supplemented by targeted additional runs to quantify hallucination rates during feature extraction. We will also document fallback behaviors for edge-case APKs. These changes will allow readers to attribute performance more confidently to the framework itself. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework claims rest on external tools and evaluations

full rationale

The paper describes MARD as an LLM-orchestrated multi-agent system using the external ReAct paradigm to invoke deterministic static-analysis engines on APKs. No equations, fitted parameters, or derivations appear that reduce predictions to inputs by construction. Performance metrics (F1 93.46%, drift robustness) are presented as results of cross-year empirical evaluations rather than self-referential fits or self-citation chains. The core premise relies on independent tool outputs and standard LLM capabilities, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated in the text. The framework implicitly relies on the general capabilities of LLMs and existing static analysis engines.

pith-pipeline@v0.9.0 · 5535 in / 1128 out tokens · 76165 ms · 2026-05-07T15:55:44.879500+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references

[1]

Android security: a survey of issues, malware penetration, and defenses.IEEE communications surveys & tutorials, 17(2):998–1022, 2014

Parvez Faruki, Ammar Bharmal, Vijay Laxmi, Vijay Ganmoor, Manoj Singh Gaur, Mauro Conti, and Muttukrishnan Rajarajan. Android security: a survey of issues, malware penetration, and defenses.IEEE communications surveys & tutorials, 17(2):998–1022, 2014

2014
[2]

A survey on various threats and current state of security in android platform.ACM Computing Surveys (CSUR), 52(1):1–35, 2019

Parnika Bhat and Kamlesh Dutta. A survey on various threats and current state of security in android platform.ACM Computing Surveys (CSUR), 52(1):1–35, 2019

2019
[3]

Crowdroid: behavior-based malware detection system for android

Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani. Crowdroid: behavior-based malware detection system for android. InProceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices, pages 15–26, 2011

2011
[4]

Droid- sec: deep learning in android malware detection

Zhenlong Yuan, Yongqiang Lu, Zhaoguo Wang, and Yibo Xue. Droid- sec: deep learning in android malware detection. InProceedings of the 2014 ACM conference on SIGCOMM, pages 371–372, 2014

2014
[5]

Drebin: Effective and explainable detection of android malware in your pocket

Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. Drebin: Effective and explainable detection of android malware in your pocket. InNdss, volume 14, pages 23–26. San Diego, CA, 2014

2014
[6]

Transcend: Detecting concept drift in malware classification models

Roberto Jordaney, Kumar Sharad, Santanu K Dash, Zhi Wang, Davide Papini, Ilia Nouretdinov, and Lorenzo Cavallaro. Transcend: Detecting concept drift in malware classification models. In26th USENIX security symposium (USENIX security 17), pages 625–642, 2017

2017
[7]

Transcending transcend: Revisiting malware classification in the presence of concept drift

Federico Barbero, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro. Transcending transcend: Revisiting malware classification in the presence of concept drift. In2022 IEEE Symposium on Security and Privacy (SP), pages 805–823. IEEE, 2022

2022
[8]

Cyber code intelligence for android malware detection.IEEE Transactions on Cybernetics, 53(1):617–627, 2022

Junyang Qiu, Qing-Long Han, Wei Luo, Lei Pan, Surya Nepal, Jun Zhang, and Yang Xiang. Cyber code intelligence for android malware detection.IEEE Transactions on Cybernetics, 53(1):617–627, 2022

2022
[9]

Droidapiminer: Mining api- level features for robust malware detection in android

Yousra Aafer, Wenliang Du, and Heng Yin. Droidapiminer: Mining api- level features for robust malware detection in android. InInternational conference on security and privacy in communication systems, pages 86–103. Springer, 2013

2013
[10]

Intelligent mobile malware detection using permission requests and api calls.Future Generation Computer Systems, 107:509–521, 2020

Moutaz Alazab, Mamoun Alazab, Andrii Shalaginov, Abdelwadood Mesleh, and Albara Awajan. Intelligent mobile malware detection using permission requests and api calls.Future Generation Computer Systems, 107:509–521, 2020

2020
[11]

Cruparamer: Learning on parameter-augmented api sequences for malware detection.IEEE Transactions on Information Forensics and Security, 17:788–803, 2022

Xiaohui Chen, Zhiyu Hao, Lun Li, Lei Cui, Yiran Zhu, Zhenquan Ding, and Yongji Liu. Cruparamer: Learning on parameter-augmented api sequences for malware detection.IEEE Transactions on Information Forensics and Security, 17:788–803, 2022

2022
[12]

Significant permission identification for machine-learning- based android malware detection.IEEE Transactions on Industrial Informatics, 14(7):3216–3225, 2018

Jin Li, Lichao Sun, Qiben Yan, Zhiqiang Li, Witawas Srisa-An, and Heng Ye. Significant permission identification for machine-learning- based android malware detection.IEEE Transactions on Industrial Informatics, 14(7):3216–3225, 2018

2018
[13]

Improved real- time permission based malware detection and clustering approach using model independent pruning.IET Information Security, 14(5):531–541, 2020

Janani Thiyagarajan, A Akash, and Brindha Murugan. Improved real- time permission based malware detection and clustering approach using model independent pruning.IET Information Security, 14(5):531–541, 2020

2020
[14]

Malscan: Fast market-wide mobile malware scanning by social-network centrality analysis

Yueming Wu, Xiaodi Li, Deqing Zou, Wei Yang, Xin Zhang, and Hai Jin. Malscan: Fast market-wide mobile malware scanning by social-network centrality analysis. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 139–150. IEEE, 2019

2019
[15]

Learn- ing features from enhanced function call graphs for android malware detection.Neurocomputing, 423:301–307, 2021

Minghui Cai, Yuan Jiang, Cuiying Gao, Heng Li, and Wei Yuan. Learn- ing features from enhanced function call graphs for android malware detection.Neurocomputing, 423:301–307, 2021

2021
[16]

Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning.Computers & Security, 122:102872, 2022

Ce Li, Zijun Cheng, He Zhu, Leiqi Wang, Qiujian Lv, Yan Wang, Ning Li, and Degang Sun. Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning.Computers & Security, 122:102872, 2022

2022
[17]

Demystifying the evolution of android malware variants.IEEE Transactions on Dependable and Secure Computing, 21(4):3324–3341, 2023

Lihong Tang, Xiao Chen, Sheng Wen, Li Li, Marthie Grobler, and Yang Xiang. Demystifying the evolution of android malware variants.IEEE Transactions on Dependable and Secure Computing, 21(4):3324–3341, 2023

2023
[18]

Machine learning for android malware detection: mission accomplished? a comprehensive review of open challenges and future perspectives.Computers & Security, 138:103654, 2024

Alejandro Guerra-Manzanares. Machine learning for android malware detection: mission accomplished? a comprehensive review of open challenges and future perspectives.Computers & Security, 138:103654, 2024

2024
[19]

Mamadroid: Detect- ing android malware by building markov chains of behavioral models (extended version).ACM Transactions on Privacy and Security (TOPS), 22(2):1–34, 2019

Lucky Onwuzurike, Enrico Mariconti, Panagiotis Andriotis, Emiliano De Cristofaro, Gordon Ross, and Gianluca Stringhini. Mamadroid: Detect- ing android malware by building markov chains of behavioral models (extended version).ACM Transactions on Privacy and Security (TOPS), 22(2):1–34, 2019

2019
[20]

Sdac: A slow-aging solution for android malware detection using semantic distance based api clustering.IEEE transactions on dependable and secure computing, 19(2):1149–1163, 2020

Jiayun Xu, Yingjiu Li, Robert H Deng, and Ke Xu. Sdac: A slow-aging solution for android malware detection using semantic distance based api clustering.IEEE transactions on dependable and secure computing, 19(2):1149–1163, 2020

2020
[21]

Slowing down the aging of learning- based malware detectors with api knowledge.IEEE Transactions on Dependable and Secure Computing, 20(2):902–916, 2022

Xiaohan Zhang, Mi Zhang, Yuan Zhang, Ming Zhong, Xin Zhang, Yinzhi Cao, and Min Yang. Slowing down the aging of learning- based malware detectors with api knowledge.IEEE Transactions on Dependable and Secure Computing, 20(2):902–916, 2022

2022
[22]

A novel android malware detection method with api semantics extraction

Hongyu Yang, Youwei Wang, Liang Zhang, Xiang Cheng, and Ze Hu. A novel android malware detection method with api semantics extraction. Computers & Security, 137:103651, 2024

2024
[23]

Ldcdroid: Learning data drift characteristics for handling the model aging problem in android malware detection

Zhen Liu, Ruoyu Wang, Bitao Peng, Lingyu Qiu, Qingqing Gan, Changji Wang, and Wenbin Zhang. Ldcdroid: Learning data drift characteristics for handling the model aging problem in android malware detection. Computers & Security, 150:104294, 2025

2025
[24]

In30th USENIX Security Symposium (USENIX Security 21), pages 2327–2344, 2021

Limin Yang, Wenbo Guo, Qingying Hao, Arridhana Ciptadi, Ali Ah- madzadeh, Xinyu Xing, and Gang Wang.{CADE}: Detecting and explaining concept drift samples for security applications. In30th USENIX Security Symposium (USENIX Security 21), pages 2327–2344, 2021

2021
[25]

Fesad ransomware detection framework with machine learning using adaption to concept drift.Computers & Security, 137:103629, 2024

Damien Warren Fernando and Nikos Komninos. Fesad ransomware detection framework with machine learning using adaption to concept drift.Computers & Security, 137:103629, 2024

2024
[26]

Droide- volver: Self-evolving android malware detection system

Ke Xu, Yingjiu Li, Robert Deng, Kai Chen, and Jiayun Xu. Droide- volver: Self-evolving android malware detection system. In2019 IEEE European Symposium on Security and Privacy (EuroS&P), pages 47–62. IEEE, 2019

2019
[27]

Strengthening llm ecosystem security: Preventing mobile malware from manipulating llm-based applications.Information Sciences, 681:120923, 2024

Lu Huang, Jingfeng Xue, Yong Wang, Junbao Chen, and Tianwei Lei. Strengthening llm ecosystem security: Preventing mobile malware from manipulating llm-based applications.Information Sciences, 681:120923, 2024

2024
[28]

Continuous learning for android malware detection

Yizheng Chen, Zhoujie Ding, and David Wagner. Continuous learning for android malware detection. In32nd USENIX Security Symposium (USENIX Security 23), pages 1127–1144, 2023

2023
[29]

Plangenllms: A modern survey of llm planning capabilities

Hui Wei, Zihao Zhang, Shenghua He, Tian Xia, Shijia Pan, and Fei Liu. Plangenllms: A modern survey of llm planning capabilities. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19497– 19521, 2025

2025
[30]

Srdc: Semantics- based ransomware detection and classification with llm-assisted pre- training

Ce Zhou, Yilun Liu, Weibin Meng, Shimin Tao, Weinan Tian, Feiyu Yao, Xiaochun Li, Tao Han, Boxing Chen, and Hao Yang. Srdc: Semantics- based ransomware detection and classification with llm-assisted pre- training. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 39, pages 28566–28574, 2025

2025
[31]

Apppoet: Large language model based android malware detection via multi-view prompt engineering.Expert Systems with Applications, 262:125546, 2025

Wenxiang Zhao, Juntao Wu, and Zhaoyi Meng. Apppoet: Large language model based android malware detection via multi-view prompt engineering.Expert Systems with Applications, 262:125546, 2025

2025
[32]

Foredroid: Scenario-aware analysis for android malware detection and explanation

Jiaming Li, Sen Chen, Chunlian Wu, Yuxin Zhang, and Lingling Fan. Foredroid: Scenario-aware analysis for android malware detection and explanation. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 1379–1393, 2025

2025
[33]

Prompt engineering-assisted malware dynamic analysis using gpt-4.IEEE Transactions on Dependable and Secure Computing, 2025

Pei Yan, Shunquan Tan, Miaohui Wang, and Jiwu Huang. Prompt engineering-assisted malware dynamic analysis using gpt-4.IEEE Transactions on Dependable and Secure Computing, 2025

2025
[34]

Av-agent: A bottom- up interpretable malware classifier based on large language models

Rui Zheng, Zhibo Wang, Kui Ren, and Chun Chen. Av-agent: A bottom- up interpretable malware classifier based on large language models. IEEE Transactions on Information Forensics and Security, 2025

2025
[35]

On benchmarking code llms for android malware analysis

Yiling He, Hongyu She, Xingzhi Qian, Xinran Zheng, Zhuo Chen, Zhan Qin, and Lorenzo Cavallaro. On benchmarking code llms for android malware analysis. InProceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 153– 160, 2025

2025
[36]

Lamd: Context-driven android malware detection and classification with llms

Xingzhi Qian, Xinran Zheng, Yiling He, Shuo Yang, and Lorenzo Caval- laro. Lamd: Context-driven android malware detection and classification with llms. In2025 IEEE Security and Privacy Workshops (SPW), pages 126–136. IEEE, 2025

2025
[37]

Api2vec: Learning representations of api sequences for malware detection

Lei Cui, Jiancong Cui, Yuede Ji, Zhiyu Hao, Lun Li, and Zhenquan Ding. Api2vec: Learning representations of api sequences for malware detection. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 261–273, 2023

2023
[38]

Apibeh: Learning behavior inclination of apis for malware classification

Lei Cui, Yiran Zhu, Junnan Yin, Zhiyu Hao, Wei Wang, Peng Liu, Ziqi Yang, and Xiaochun Yun. Apibeh: Learning behavior inclination of apis for malware classification. In2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), pages 1–12. IEEE, 2024

2024
[39]

Appcontext: Differentiating malicious and benign mobile app behaviors using context

Wei Yang, Xusheng Xiao, Benjamin Andow, Sihan Li, Tao Xie, and William Enck. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In2015 IEEE/ACM 37th IEEE interna- tional conference on software engineering, volume 1, pages 303–313. IEEE, 2015

2015
[40]

Android malware classification and optimisation based on bm25 score of android api

Rahul Yumlembam, Biju Issac, Longzhi Yang, and Seibu Mary Jacob. Android malware classification and optimisation based on bm25 score of android api. InIEEE INFOCOM 2023-IEEE conference on computer communications workshops (INFOCOM WKSHPS), pages 1–6. IEEE, 2023

2023
[41]

A multimodal deep learning method for android malware detection using various features.IEEE Transactions on Information Forensics and Security, 14(3):773–788, 2018

TaeGuen Kim, BooJoong Kang, Mina Rho, Sakir Sezer, and Eul Gyu Im. A multimodal deep learning method for android malware detection using various features.IEEE Transactions on Information Forensics and Security, 14(3):773–788, 2018

2018
[42]

Jowmdroid: Android malware detection based on feature weighting with joint optimization of weight- mapping and classifier parameters.Computers & Security, 100:102086, 2021

Lingru Cai, Yao Li, and Zhi Xiong. Jowmdroid: Android malware detection based on feature weighting with joint optimization of weight- mapping and classifier parameters.Computers & Security, 100:102086, 2021

2021
[43]

Mobipcr: Efficient, accurate, and strict ml-based mobile malware detection.Future Generation Computer Systems, 144:140–150, 2023

Chuanchang Liu, Jianyun Lu, Wendi Feng, Enbo Du, Luyang Di, and Zhen Song. Mobipcr: Efficient, accurate, and strict ml-based mobile malware detection.Future Generation Computer Systems, 144:140–150, 2023

2023
[44]

Fesa: Feature selection architecture for ransomware detection under concept drift.Computers & Security, 116:102659, 2022

Damien Warren Fernando and Nikos Komninos. Fesa: Feature selection architecture for ransomware detection under concept drift.Computers & Security, 116:102659, 2022

2022
[45]

Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds of llm reasoning: Are multi-agent dis- cussions the key? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6106–6131, 2024

2024
[46]

Poster: Llmalware: An llm-powered robust and efficient android malware detec- tion framework

Zijing Ma, Leming Shen, Xinyu Huang, and Yuanqing Zheng. Poster: Llmalware: An llm-powered robust and efficient android malware detec- tion framework. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 4737–4739, 2025

2025
[47]

Dynamic android malware category classification using semi-supervised deep learning

Samaneh Mahdavifar, Andi Fitriah Abdul Kadir, Rasool Fatemi, Dima Alhadidi, and Ali A Ghorbani. Dynamic android malware category classification using semi-supervised deep learning. In2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on ...

2020
[48]

Bissyand ´e, Jacques Klein, and Yves Le Traon

Kevin Allix, Tegawend ´e F. Bissyand ´e, Jacques Klein, and Yves Le Traon. Androzoo: Collecting millions of android apps for the research community. InProceedings of the 13th International Conference on Mining Software Repositories, MSR ’16, pages 468–471, New York, NY , USA, 2016. ACM

2016
[49]

Toward developing a systematic approach to generate bench- mark android malware datasets and classification

Arash Habibi Lashkari, Andi Fitriah A Kadir, Laya Taheri, and Ali A Ghorbani. Toward developing a systematic approach to generate bench- mark android malware datasets and classification. In2018 International Carnahan conference on security technology (ICCST), pages 1–7. ieee, 2018

2018

[1] [1]

Android security: a survey of issues, malware penetration, and defenses.IEEE communications surveys & tutorials, 17(2):998–1022, 2014

Parvez Faruki, Ammar Bharmal, Vijay Laxmi, Vijay Ganmoor, Manoj Singh Gaur, Mauro Conti, and Muttukrishnan Rajarajan. Android security: a survey of issues, malware penetration, and defenses.IEEE communications surveys & tutorials, 17(2):998–1022, 2014

2014

[2] [2]

A survey on various threats and current state of security in android platform.ACM Computing Surveys (CSUR), 52(1):1–35, 2019

Parnika Bhat and Kamlesh Dutta. A survey on various threats and current state of security in android platform.ACM Computing Surveys (CSUR), 52(1):1–35, 2019

2019

[3] [3]

Crowdroid: behavior-based malware detection system for android

Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani. Crowdroid: behavior-based malware detection system for android. InProceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices, pages 15–26, 2011

2011

[4] [4]

Droid- sec: deep learning in android malware detection

Zhenlong Yuan, Yongqiang Lu, Zhaoguo Wang, and Yibo Xue. Droid- sec: deep learning in android malware detection. InProceedings of the 2014 ACM conference on SIGCOMM, pages 371–372, 2014

2014

[5] [5]

Drebin: Effective and explainable detection of android malware in your pocket

Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. Drebin: Effective and explainable detection of android malware in your pocket. InNdss, volume 14, pages 23–26. San Diego, CA, 2014

2014

[6] [6]

Transcend: Detecting concept drift in malware classification models

Roberto Jordaney, Kumar Sharad, Santanu K Dash, Zhi Wang, Davide Papini, Ilia Nouretdinov, and Lorenzo Cavallaro. Transcend: Detecting concept drift in malware classification models. In26th USENIX security symposium (USENIX security 17), pages 625–642, 2017

2017

[7] [7]

Transcending transcend: Revisiting malware classification in the presence of concept drift

Federico Barbero, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro. Transcending transcend: Revisiting malware classification in the presence of concept drift. In2022 IEEE Symposium on Security and Privacy (SP), pages 805–823. IEEE, 2022

2022

[8] [8]

Cyber code intelligence for android malware detection.IEEE Transactions on Cybernetics, 53(1):617–627, 2022

Junyang Qiu, Qing-Long Han, Wei Luo, Lei Pan, Surya Nepal, Jun Zhang, and Yang Xiang. Cyber code intelligence for android malware detection.IEEE Transactions on Cybernetics, 53(1):617–627, 2022

2022

[9] [9]

Droidapiminer: Mining api- level features for robust malware detection in android

Yousra Aafer, Wenliang Du, and Heng Yin. Droidapiminer: Mining api- level features for robust malware detection in android. InInternational conference on security and privacy in communication systems, pages 86–103. Springer, 2013

2013

[10] [10]

Intelligent mobile malware detection using permission requests and api calls.Future Generation Computer Systems, 107:509–521, 2020

Moutaz Alazab, Mamoun Alazab, Andrii Shalaginov, Abdelwadood Mesleh, and Albara Awajan. Intelligent mobile malware detection using permission requests and api calls.Future Generation Computer Systems, 107:509–521, 2020

2020

[11] [11]

Cruparamer: Learning on parameter-augmented api sequences for malware detection.IEEE Transactions on Information Forensics and Security, 17:788–803, 2022

Xiaohui Chen, Zhiyu Hao, Lun Li, Lei Cui, Yiran Zhu, Zhenquan Ding, and Yongji Liu. Cruparamer: Learning on parameter-augmented api sequences for malware detection.IEEE Transactions on Information Forensics and Security, 17:788–803, 2022

2022

[12] [12]

Significant permission identification for machine-learning- based android malware detection.IEEE Transactions on Industrial Informatics, 14(7):3216–3225, 2018

Jin Li, Lichao Sun, Qiben Yan, Zhiqiang Li, Witawas Srisa-An, and Heng Ye. Significant permission identification for machine-learning- based android malware detection.IEEE Transactions on Industrial Informatics, 14(7):3216–3225, 2018

2018

[13] [13]

Improved real- time permission based malware detection and clustering approach using model independent pruning.IET Information Security, 14(5):531–541, 2020

Janani Thiyagarajan, A Akash, and Brindha Murugan. Improved real- time permission based malware detection and clustering approach using model independent pruning.IET Information Security, 14(5):531–541, 2020

2020

[14] [14]

Malscan: Fast market-wide mobile malware scanning by social-network centrality analysis

Yueming Wu, Xiaodi Li, Deqing Zou, Wei Yang, Xin Zhang, and Hai Jin. Malscan: Fast market-wide mobile malware scanning by social-network centrality analysis. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 139–150. IEEE, 2019

2019

[15] [15]

Learn- ing features from enhanced function call graphs for android malware detection.Neurocomputing, 423:301–307, 2021

Minghui Cai, Yuan Jiang, Cuiying Gao, Heng Li, and Wei Yuan. Learn- ing features from enhanced function call graphs for android malware detection.Neurocomputing, 423:301–307, 2021

2021

[16] [16]

Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning.Computers & Security, 122:102872, 2022

Ce Li, Zijun Cheng, He Zhu, Leiqi Wang, Qiujian Lv, Yan Wang, Ning Li, and Degang Sun. Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning.Computers & Security, 122:102872, 2022

2022

[17] [17]

Demystifying the evolution of android malware variants.IEEE Transactions on Dependable and Secure Computing, 21(4):3324–3341, 2023

Lihong Tang, Xiao Chen, Sheng Wen, Li Li, Marthie Grobler, and Yang Xiang. Demystifying the evolution of android malware variants.IEEE Transactions on Dependable and Secure Computing, 21(4):3324–3341, 2023

2023

[18] [18]

Machine learning for android malware detection: mission accomplished? a comprehensive review of open challenges and future perspectives.Computers & Security, 138:103654, 2024

Alejandro Guerra-Manzanares. Machine learning for android malware detection: mission accomplished? a comprehensive review of open challenges and future perspectives.Computers & Security, 138:103654, 2024

2024

[19] [19]

Mamadroid: Detect- ing android malware by building markov chains of behavioral models (extended version).ACM Transactions on Privacy and Security (TOPS), 22(2):1–34, 2019

Lucky Onwuzurike, Enrico Mariconti, Panagiotis Andriotis, Emiliano De Cristofaro, Gordon Ross, and Gianluca Stringhini. Mamadroid: Detect- ing android malware by building markov chains of behavioral models (extended version).ACM Transactions on Privacy and Security (TOPS), 22(2):1–34, 2019

2019

[20] [20]

Sdac: A slow-aging solution for android malware detection using semantic distance based api clustering.IEEE transactions on dependable and secure computing, 19(2):1149–1163, 2020

Jiayun Xu, Yingjiu Li, Robert H Deng, and Ke Xu. Sdac: A slow-aging solution for android malware detection using semantic distance based api clustering.IEEE transactions on dependable and secure computing, 19(2):1149–1163, 2020

2020

[21] [21]

Slowing down the aging of learning- based malware detectors with api knowledge.IEEE Transactions on Dependable and Secure Computing, 20(2):902–916, 2022

Xiaohan Zhang, Mi Zhang, Yuan Zhang, Ming Zhong, Xin Zhang, Yinzhi Cao, and Min Yang. Slowing down the aging of learning- based malware detectors with api knowledge.IEEE Transactions on Dependable and Secure Computing, 20(2):902–916, 2022

2022

[22] [22]

A novel android malware detection method with api semantics extraction

Hongyu Yang, Youwei Wang, Liang Zhang, Xiang Cheng, and Ze Hu. A novel android malware detection method with api semantics extraction. Computers & Security, 137:103651, 2024

2024

[23] [23]

Ldcdroid: Learning data drift characteristics for handling the model aging problem in android malware detection

Zhen Liu, Ruoyu Wang, Bitao Peng, Lingyu Qiu, Qingqing Gan, Changji Wang, and Wenbin Zhang. Ldcdroid: Learning data drift characteristics for handling the model aging problem in android malware detection. Computers & Security, 150:104294, 2025

2025

[24] [24]

In30th USENIX Security Symposium (USENIX Security 21), pages 2327–2344, 2021

Limin Yang, Wenbo Guo, Qingying Hao, Arridhana Ciptadi, Ali Ah- madzadeh, Xinyu Xing, and Gang Wang.{CADE}: Detecting and explaining concept drift samples for security applications. In30th USENIX Security Symposium (USENIX Security 21), pages 2327–2344, 2021

2021

[25] [25]

Fesad ransomware detection framework with machine learning using adaption to concept drift.Computers & Security, 137:103629, 2024

Damien Warren Fernando and Nikos Komninos. Fesad ransomware detection framework with machine learning using adaption to concept drift.Computers & Security, 137:103629, 2024

2024

[26] [26]

Droide- volver: Self-evolving android malware detection system

Ke Xu, Yingjiu Li, Robert Deng, Kai Chen, and Jiayun Xu. Droide- volver: Self-evolving android malware detection system. In2019 IEEE European Symposium on Security and Privacy (EuroS&P), pages 47–62. IEEE, 2019

2019

[27] [27]

Strengthening llm ecosystem security: Preventing mobile malware from manipulating llm-based applications.Information Sciences, 681:120923, 2024

Lu Huang, Jingfeng Xue, Yong Wang, Junbao Chen, and Tianwei Lei. Strengthening llm ecosystem security: Preventing mobile malware from manipulating llm-based applications.Information Sciences, 681:120923, 2024

2024

[28] [28]

Continuous learning for android malware detection

Yizheng Chen, Zhoujie Ding, and David Wagner. Continuous learning for android malware detection. In32nd USENIX Security Symposium (USENIX Security 23), pages 1127–1144, 2023

2023

[29] [29]

Plangenllms: A modern survey of llm planning capabilities

Hui Wei, Zihao Zhang, Shenghua He, Tian Xia, Shijia Pan, and Fei Liu. Plangenllms: A modern survey of llm planning capabilities. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19497– 19521, 2025

2025

[30] [30]

Srdc: Semantics- based ransomware detection and classification with llm-assisted pre- training

Ce Zhou, Yilun Liu, Weibin Meng, Shimin Tao, Weinan Tian, Feiyu Yao, Xiaochun Li, Tao Han, Boxing Chen, and Hao Yang. Srdc: Semantics- based ransomware detection and classification with llm-assisted pre- training. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 39, pages 28566–28574, 2025

2025

[31] [31]

Apppoet: Large language model based android malware detection via multi-view prompt engineering.Expert Systems with Applications, 262:125546, 2025

Wenxiang Zhao, Juntao Wu, and Zhaoyi Meng. Apppoet: Large language model based android malware detection via multi-view prompt engineering.Expert Systems with Applications, 262:125546, 2025

2025

[32] [32]

Foredroid: Scenario-aware analysis for android malware detection and explanation

Jiaming Li, Sen Chen, Chunlian Wu, Yuxin Zhang, and Lingling Fan. Foredroid: Scenario-aware analysis for android malware detection and explanation. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 1379–1393, 2025

2025

[33] [33]

Prompt engineering-assisted malware dynamic analysis using gpt-4.IEEE Transactions on Dependable and Secure Computing, 2025

Pei Yan, Shunquan Tan, Miaohui Wang, and Jiwu Huang. Prompt engineering-assisted malware dynamic analysis using gpt-4.IEEE Transactions on Dependable and Secure Computing, 2025

2025

[34] [34]

Av-agent: A bottom- up interpretable malware classifier based on large language models

Rui Zheng, Zhibo Wang, Kui Ren, and Chun Chen. Av-agent: A bottom- up interpretable malware classifier based on large language models. IEEE Transactions on Information Forensics and Security, 2025

2025

[35] [35]

On benchmarking code llms for android malware analysis

Yiling He, Hongyu She, Xingzhi Qian, Xinran Zheng, Zhuo Chen, Zhan Qin, and Lorenzo Cavallaro. On benchmarking code llms for android malware analysis. InProceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 153– 160, 2025

2025

[36] [36]

Lamd: Context-driven android malware detection and classification with llms

Xingzhi Qian, Xinran Zheng, Yiling He, Shuo Yang, and Lorenzo Caval- laro. Lamd: Context-driven android malware detection and classification with llms. In2025 IEEE Security and Privacy Workshops (SPW), pages 126–136. IEEE, 2025

2025

[37] [37]

Api2vec: Learning representations of api sequences for malware detection

Lei Cui, Jiancong Cui, Yuede Ji, Zhiyu Hao, Lun Li, and Zhenquan Ding. Api2vec: Learning representations of api sequences for malware detection. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 261–273, 2023

2023

[38] [38]

Apibeh: Learning behavior inclination of apis for malware classification

Lei Cui, Yiran Zhu, Junnan Yin, Zhiyu Hao, Wei Wang, Peng Liu, Ziqi Yang, and Xiaochun Yun. Apibeh: Learning behavior inclination of apis for malware classification. In2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), pages 1–12. IEEE, 2024

2024

[39] [39]

Appcontext: Differentiating malicious and benign mobile app behaviors using context

Wei Yang, Xusheng Xiao, Benjamin Andow, Sihan Li, Tao Xie, and William Enck. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In2015 IEEE/ACM 37th IEEE interna- tional conference on software engineering, volume 1, pages 303–313. IEEE, 2015

2015

[40] [40]

Android malware classification and optimisation based on bm25 score of android api

Rahul Yumlembam, Biju Issac, Longzhi Yang, and Seibu Mary Jacob. Android malware classification and optimisation based on bm25 score of android api. InIEEE INFOCOM 2023-IEEE conference on computer communications workshops (INFOCOM WKSHPS), pages 1–6. IEEE, 2023

2023

[41] [41]

A multimodal deep learning method for android malware detection using various features.IEEE Transactions on Information Forensics and Security, 14(3):773–788, 2018

TaeGuen Kim, BooJoong Kang, Mina Rho, Sakir Sezer, and Eul Gyu Im. A multimodal deep learning method for android malware detection using various features.IEEE Transactions on Information Forensics and Security, 14(3):773–788, 2018

2018

[42] [42]

Jowmdroid: Android malware detection based on feature weighting with joint optimization of weight- mapping and classifier parameters.Computers & Security, 100:102086, 2021

Lingru Cai, Yao Li, and Zhi Xiong. Jowmdroid: Android malware detection based on feature weighting with joint optimization of weight- mapping and classifier parameters.Computers & Security, 100:102086, 2021

2021

[43] [43]

Mobipcr: Efficient, accurate, and strict ml-based mobile malware detection.Future Generation Computer Systems, 144:140–150, 2023

Chuanchang Liu, Jianyun Lu, Wendi Feng, Enbo Du, Luyang Di, and Zhen Song. Mobipcr: Efficient, accurate, and strict ml-based mobile malware detection.Future Generation Computer Systems, 144:140–150, 2023

2023

[44] [44]

Fesa: Feature selection architecture for ransomware detection under concept drift.Computers & Security, 116:102659, 2022

Damien Warren Fernando and Nikos Komninos. Fesa: Feature selection architecture for ransomware detection under concept drift.Computers & Security, 116:102659, 2022

2022

[45] [45]

Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds of llm reasoning: Are multi-agent dis- cussions the key? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6106–6131, 2024

2024

[46] [46]

Poster: Llmalware: An llm-powered robust and efficient android malware detec- tion framework

Zijing Ma, Leming Shen, Xinyu Huang, and Yuanqing Zheng. Poster: Llmalware: An llm-powered robust and efficient android malware detec- tion framework. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 4737–4739, 2025

2025

[47] [47]

Dynamic android malware category classification using semi-supervised deep learning

Samaneh Mahdavifar, Andi Fitriah Abdul Kadir, Rasool Fatemi, Dima Alhadidi, and Ali A Ghorbani. Dynamic android malware category classification using semi-supervised deep learning. In2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on ...

2020

[48] [48]

Bissyand ´e, Jacques Klein, and Yves Le Traon

Kevin Allix, Tegawend ´e F. Bissyand ´e, Jacques Klein, and Yves Le Traon. Androzoo: Collecting millions of android apps for the research community. InProceedings of the 13th International Conference on Mining Software Repositories, MSR ’16, pages 468–471, New York, NY , USA, 2016. ACM

2016

[49] [49]

Toward developing a systematic approach to generate bench- mark android malware datasets and classification

Arash Habibi Lashkari, Andi Fitriah A Kadir, Laya Taheri, and Ali A Ghorbani. Toward developing a systematic approach to generate bench- mark android malware datasets and classification. In2018 International Carnahan conference on security technology (ICCST), pages 1–7. ieee, 2018

2018