Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection
Pith reviewed 2026-05-25 04:15 UTC · model grok-4.3
The pith
As the time gap between training and testing data grows, Android malware detectors lose both accuracy and resistance to adversarial attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that temporal separation between training and test data is associated with reduced adversarial robustness in Android malware detection under transfer-based feature-space attacks. As the train-test gap increases, both clean accuracy and adversarial accuracy decline while attack success rates show configuration-dependent increases, especially with FGSM perturbations on static features. Expanding-window retraining mitigates but does not eliminate the robustness loss under ongoing distributional evolution.
What carries the argument
The three deployment protocols—same-year training/testing, cross-year deployment without updates, and expanding-window retraining—combined with temporal linkage metrics (RobustDrop, ΔASR, and Adversarial Amplification Factor) to link distribution shift to robustness degradation.
Load-bearing premise
The three deployment protocols accurately emulate realistic learning scenarios in Android malware detection.
What would settle it
Finding that adversarial accuracy stays stable or rises as the year gap between training and test data widens would falsify the reported association between temporal separation and robustness loss.
Figures
read the original abstract
We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static and dynamic feature representations extracted from emulator and real-device executions. The dataset is organized into yearly slices and evaluated under three deployment protocols that emulate realistic learning scenarios: (1) same-year training and testing, (2) cross-year deployment without model updates, and (3) expanding-window retraining with cumulative historical data. Across multiple classifier families, adversarial examples are generated using FGSM and SPSA under feasibility constraints. We measure clean performance, Adversarial Accuracy (AA), Attack Success Rate (ASR), and introduce temporal linkage metrics -- RobustDrop, $\Delta$ASR, and Adversarial Amplification Factor (AAF) -- to quantify the relationship between distribution shift and robustness degradation.nResults show that temporal separation is associated with reduced adversarial robustness under the evaluated transfer-based feature-space setting. As the train-test gap increases, clean accuracy and adversarial accuracy decline, while attack success exhibits configuration-dependent increases, particularly under FGSM perturbations and static features. Expanding-window retraining mitigates, but does not eliminate, robustness loss under continued distributional evolution. These findings indicate that temporal drift should be considered when assessing the long-term robustness of intelligent detection systems under evolving data distributions and highlight the need for drift-aware robustness assessment frameworks in long-lived adversarial environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that temporal concept drift in Android malware detection over >10 years leads to reduced adversarial robustness: as the train-test temporal gap increases under three protocols (same-year, cross-year without updates, expanding-window retraining), clean accuracy and adversarial accuracy decline while ASR rises in a configuration-dependent manner (especially FGSM on static features), quantified via new metrics RobustDrop, ΔASR, and AAF. Expanding-window retraining mitigates but does not eliminate the effect. The evaluation uses static/dynamic features from emulator/real-device runs and transfer-based FGSM/SPSA attacks.
Significance. If the central attribution to temporal drift holds, the work is significant as a rare longitudinal empirical study spanning a decade of real-world data with multiple protocols and feature types. It provides concrete evidence that drift-aware robustness assessment is needed for long-lived adversarial ML systems in security. Credit is due for the scale of the dataset and the attempt to emulate realistic deployment via the three protocols; however, the ad-hoc nature of the invented metrics (RobustDrop, ΔASR, AAF) limits immediate impact without further validation.
major comments (2)
- [Dataset and feature extraction] Dataset and feature extraction description (abstract and methods): the paper does not specify whether emulator/OS versions, API levels, or instrumentation are held constant across the >10-year span or updated yearly to match app vintages. If the latter (common for realism), observed drops in accuracy/robustness and rises in ASR could be partly artifacts of a drifting measurement pipeline rather than malware distribution shift alone; this directly undermines the central claim that temporal separation causes reduced robustness, as static features are also reported to show configuration-dependent ASR increases.
- [Abstract and deployment protocols] Abstract and results on the three protocols: the claim that expanding-window retraining 'mitigates, but does not eliminate, robustness loss' and that the protocols 'emulate realistic learning scenarios' is load-bearing for the practical implications, yet no validation or comparison to actual industry deployment practices is provided. Without this, the mitigation findings cannot be confidently generalized beyond the specific experimental setup.
minor comments (2)
- [Abstract] Abstract: 'nResults show' is a typographical error and should read 'Results show'.
- [Metrics definition] The new metrics (RobustDrop, ΔASR, AAF) are introduced without explicit mathematical definitions or comparison to standard measures; this should be added for reproducibility even if they remain ad-hoc.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We provide detailed responses to each major comment below.
read point-by-point responses
-
Referee: [Dataset and feature extraction] Dataset and feature extraction description (abstract and methods): the paper does not specify whether emulator/OS versions, API levels, or instrumentation are held constant across the >10-year span or updated yearly to match app vintages. If the latter (common for realism), observed drops in accuracy/robustness and rises in ASR could be partly artifacts of a drifting measurement pipeline rather than malware distribution shift alone; this directly undermines the central claim that temporal separation causes reduced robustness, as static features are also reported to show configuration-dependent ASR increases.
Authors: The referee correctly notes that the paper does not specify the details of the feature extraction pipeline across years. To achieve a realistic longitudinal study, the extraction process was updated yearly to align with contemporary Android API levels and emulator versions for each slice. This is standard practice for such studies to avoid artificial constraints. While this introduces a potential confounding factor, the central claim focuses on the impact of temporal separation in data, which includes both app evolution and the necessary adaptation of the detection environment. We will add a detailed description of the pipeline in the methods section and discuss the implications for interpreting the results. revision: partial
-
Referee: [Abstract and deployment protocols] Abstract and results on the three protocols: the claim that expanding-window retraining 'mitigates, but does not eliminate, robustness loss' and that the protocols 'emulate realistic learning scenarios' is load-bearing for the practical implications, yet no validation or comparison to actual industry deployment practices is provided. Without this, the mitigation findings cannot be confidently generalized beyond the specific experimental setup.
Authors: We acknowledge that the manuscript lacks explicit validation or comparison to industry deployment practices. The protocols are motivated by standard approaches in handling temporal drift in machine learning for security. We will revise the abstract to qualify the claims about emulation of realistic scenarios and add a section in the discussion addressing the limitations regarding generalization to industry settings. revision: yes
Circularity Check
No circularity: empirical metrics defined directly from observed performance differences
full rationale
The paper is a longitudinal empirical study that reports clean accuracy, adversarial accuracy, ASR, and newly introduced metrics (RobustDrop, ΔASR, AAF) computed from measured performance under three explicit deployment protocols on yearly data slices. No equations, derivations, or fitted parameters are presented whose outputs reduce to the inputs by construction. The central claim is an observed association between temporal gap and robustness degradation; the protocols are defined operationally rather than derived. Self-citations are absent from the provided text, and no uniqueness theorems or ansatzes are invoked. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three deployment protocols emulate realistic learning scenarios
invented entities (1)
-
RobustDrop, ΔASR, and Adversarial Amplification Factor (AAF)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. Abusnaina, A. Anwar, M. Saad, A. Alabduljabbar, R. Jang, S. Salem, D. Mohaisen, One step forward, two steps back: Ml-based mal- ware detection under concept drift, Computing 107 (11) (2025) 207. doi:10.1007/S00607-025-01543-7. URL https://doi.org/10.1007/s00607-025-01543-7
-
[2]
A. Sabbah, R. Jarrar, S. Zein, D. Mohaisen, Understand- ing concept drift with deprecated permissions in android mal- ware detection, CoRR abs/2507.22231 (2025). arXiv:2507.22231, doi:10.48550/ARXIV.2507.22231. URL https://doi.org/10.48550/arXiv.2507.22231
-
[3]
A. Sabbah, R. Jarrar, S. Zein, D. Mohaisen, Empirical evaluation of con- cept drift in ml-based android malware detection, CoRR abs/2507.22772 (2025). arXiv:2507.22772, doi:10.48550/ARXIV.2507.22772. URL https://doi.org/10.48550/arXiv.2507.22772
-
[4]
A. Abusnaina, A. Anwar, M. Saad, A. Alabduljabbar, R. Jang, S. Salem, D. Mohaisen, Exposing the limitations of machine learning for malware detection under concept drift, in: M. Barhamgi, H. Wang, X. Wang (Eds.), Web Information Systems Engineering - WISE 2024 - 25th In- ternational Conference, Doha, Qatar, December 2-5, 2024, Proceedings, Part II, Vol. 1...
-
[5]
A. Mohaisen, O. Alrawi, M. Mohaisen, AMAL: high-fidelity, behavior- based automated malware analysis and classification, Comput. Secur. 36 52 (2015) 251–266. doi:10.1016/J.COSE.2015.04.001. URL https://doi.org/10.1016/j.cose.2015.04.001
-
[6]
J. G. Moreno-Torres, T. Raeder, R. Alaíz-Rodríguez, N. V. Chawla, F. Herrera, A unifying view on dataset shift in classification, Pattern Recognit. 45 (1) (2012) 521–530. doi:10.1016/J.PATCOG.2011.06.019
-
[7]
J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on concept drift adaptation, ACM Comput. Surv. 46 (4) (2014) 44:1– 44:37. doi:10.1145/2523813
-
[8]
F. Shen, J. D. Vecchio, A. Mohaisen, S. Y. Ko, L. Ziarek, Android mal- ware detection using complex-flows, IEEE Trans. Mob. Comput. 18 (6) (2019) 1231–1245. doi:10.1109/TMC.2018.2861405
-
[9]
H. Alasmary, A. Khormali, A. Anwar, J. Park, J. Choi, A. Abusnaina, A. Awad, D. Nyang, A. Mohaisen, Analyzing and detecting emerging internet of things malware: A graph-based approach, IEEE Internet Things J. 6 (5) (2019) 8977–8988. doi:10.1109/JIOT.2019.2925929. URL https://doi.org/10.1109/JIOT.2019.2925929
-
[10]
Mobile operating system market share worldwide | statcounter global stats, https://gs.statcounter.com/os-market-share/mobile/worldwide, (Accessed on 03/29/2025)
work page 2025
-
[11]
Malware statistics & trends report | av-test, https://www.av-test.org/ en/statistics/malware/, (Accessed on 03/29/2025)
work page 2025
-
[12]
Kaspersky’s report on mobile threats in 2023 | securelist, https://se curelist.com/mobile-malware-report-2023/111964/, (Accessed on 03/29/2025)
work page 2023
-
[13]
Y. Pan, X. Ge, C. Fang, Y. Fan, A systematic literature review of an- droid malware detection using static analysis, IEEE Access 8 (2020) 116363–116379. doi:10.1109/ACCESS.2020.3002842. URL https://doi.org/10.1109/ACCESS.2020.3002842
-
[14]
A. Alzubaidi, Recent advances in android mobile malware detection: A systematic literature review, IEEE Access 9 (2021) 146318–146349. doi:10.1109/ACCESS.2021.3123187. 37
-
[15]
M. Li, Z. Fang, J. Wang, L. Cheng, Q. Zeng, T. Yang, Y. Wu, J. Geng, A systematic overview of android malware detection, Appl. Artif. Intell. 36 (1) (2022). doi:10.1080/08839514.2021.2007327
-
[16]
A. Guerra-Manzanares, H. Bahsi, S. Nõmm, Kronodroid: Time- based hybrid-featured dataset for effective android malware de- tection and characterization, Comput. Secur. 110 (2021) 102399. doi:10.1016/J.COSE.2021.102399. URL https://doi.org/10.1016/j.cose.2021.102399
-
[17]
A. Guerra-Manzanares, M. Luckner, H. Bahsi, Android malware concept drift using system calls: Detection, characterization and challenges, Ex- pert Syst. Appl. 206 (2022) 117200. doi:10.1016/J.ESWA.2022.117200
-
[18]
A. Guerra-Manzanares, H. Bahsi, On the relativity of time: Im- plications and challenges of data drift on long-term effective android malware detection, Comput. Secur. 122 (2022) 102835. doi:10.1016/J.COSE.2022.102835
-
[19]
F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, L. Cavallaro, TESSERACT: eliminating experimental bias in malware classification across space and time, in: N. Heninger, P. Traynor (Eds.), USENIX, 2019, pp. 729–746
work page 2019
-
[20]
Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=
F. Barbero, F. Pendlebury, F. Pierazzi, L. Cavallaro, Tran- scending TRANSCEND: revisiting malware classification in the presence of concept drift, in: SP, IEEE, 2022, pp. 805–823. doi:10.1109/SP46214.2022.9833659
-
[21]
I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing ad- versarial examples, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015
work page 2015
-
[22]
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards deep learning models resistant to adversarial attacks, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Open- Review.net, 2018. 38
work page 2018
-
[23]
Towards Evaluating the Robustness of Neural Networks
N. Carlini, D. A. Wagner, Towards evaluating the robustness of neural networks, in: 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, IEEE Computer Society, 2017, pp. 39–57. doi:10.1109/SP.2017.49
-
[24]
F. Pierazzi, F. Pendlebury, J. Cortellazzi, L. Cavallaro, Intriguing properties of adversarial ML attacks in the problem space, in: 2020 IEEE Symposium on Security and Privacy, SP 2020, San Fran- cisco, CA, USA, May 18-21, 2020, IEEE, 2020, pp. 1332–1349. doi:10.1109/SP40000.2020.00073
-
[25]
H. Bostani, V. Moonsamy, Evadedroid: A practical evasion attack on machine learning for black-box android malware detection, Comput. Se- cur. 139 (2024) 103676. doi:10.1016/J.COSE.2023.103676
-
[26]
J. C. Schlimmer, R. H. Granger, Incremental learning from noisy data, Mach. Learn. 1 (3) (1986) 317–354. doi:10.1023/A:1022810614389
- [27]
-
[28]
F. Ceschin, M. Botacin, H. M. Gomes, F. A. Pinage, L. S. Oliveira, A. Grégio, Fast & furious: On the modelling of malware detection as an evolving data stream, Expert Syst. Appl. 212 (2023) 118590. doi:10.1016/J.ESWA.2022.118590. URL https://doi.org/10.1016/j.eswa.2022.118590
-
[29]
J. Tripathi, H. M. Gomes, M. Botacin, Towards explainable drift de- tection and early retrain in ml-based malware detection pipelines, in: M. Egele, V. Moonsamy, D. Gruss, M. Carminati (Eds.), Detection of Intrusions and Malware, and Vulnerability Assessment - 22nd Interna- tional Conference, DIMVA 2025, Graz, Austria, July 9-11, 2025, Pro- ceedings, Part...
-
[30]
D. Hu, Z. Ma, X. Zhang, P. Li, D. Ye, B. Ling, The concept drift prob- lem in android malware detection and its solution, Secur. Commun. Net- works 2017 (2017) 4956386:1–4956386:13. doi:10.1155/2017/4956386. 39
-
[31]
Z. Chen, Z. Zhang, Z. Kan, L. Yang, J. Cortellazzi, F. Pendlebury, F. Pierazzi, L. Cavallaro, G. Wang, Is it overkill? analyzing feature- space concept drift in malware detectors, in: IEEE, IEEE, 2023, pp. 21–28. doi:10.1109/SPW59333.2023.00007
-
[32]
A. Guerra-Manzanares, M. Luckner, H. Bahsi, Corrigendum to concept drift and cross-device behavior: Challenges and implications for effective android malware detection computers & security, volume 120, 102757, Comput. Secur. 124 (2023) 102998. doi:10.1016/J.COSE.2022.102998
-
[33]
T. Chow, Z. Kan, L. Linhardt, L. Cavallaro, D. Arp, F. Pierazzi, Drift forensics of malware classifiers, in: M. Pintor, X. Chen, F. Tramèr (Eds.), ACM, ACM, 2023, pp. 197–207. doi:10.1145/3605764.3623918
-
[34]
A. Abusnaina, Y. Wang, S. S. Arora, K. Wang, M. Christodorescu, D. Mohaisen, Burning the adversarial bridges: Robust windows malware detection against binary-level mutations, CoRR abs/2310.03285 (2023). arXiv:2310.03285, doi:10.48550/ARXIV.2310.03285
-
[35]
A. Abusnaina, A. Anwar, S. Alshamrani, A. Alabduljabbar, R. Jang, D. Nyang, D. Mohaisen, Systematically evaluating the robustness of ml-based iot malware detection systems, in: RAID, ACM, 2022, pp. 308–320
work page 2022
-
[36]
F. Hinder, V. Vaquet, B. Hammer, Adversarial attacks for drift detection, CoRR abs/2411.16591 (2024). arXiv:2411.16591, doi:10.48550/ARXIV.2411.16591
-
[37]
P. Faruki, R. Bhan, V. Jain, S. Bhatia, N. E. Madhoun, R. Pa- mula, A survey and evaluation of android-based malware eva- sion techniques and detection frameworks, Inf. 14 (7) (2023) 374. doi:10.3390/INFO14070374
-
[38]
T. S. Sethi, M. M. Kantardzic, Handling adversarial concept drift in streaming data, Expert Syst. Appl. 97 (2018) 18–40. doi:10.1016/J.ESWA.2017.12.022
-
[39]
L. Korycki, B. Krawczyk, Adversarial concept drift detection under poi- soning attacks for robust data stream mining, Mach. Learn. 112 (10) (2023) 4013–4048. doi:10.1007/S10994-022-06177-W. 40
-
[40]
P. Chen, H. Zhang, Y. Sharma, J. Yi, C. Hsieh, ZOO: zeroth order opti- mization based black-box attacks to deep neural networks without train- ing substitute models, in: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec@CCS 2017, Dallas, TX, USA, November3, 2017, ACM,2017, pp.15–26. doi:10.1145/3128572.3140448
-
[41]
I. Rosenberg, A. Shabtai, Y. Elovici, L. Rokach, Query-efficient black- box attack against sequence-based malware classifiers, in: ACSAC ’20: Annual Computer Security Applications Conference, Virtual Event / Austin, TX, USA, 7-11 December, 2020, ACM, 2020, pp. 611–626. doi:10.1145/3427228.3427230
-
[42]
J. Yuste, E. G. Pardo, J. Tapiador, Optimization of code caves in mal- ware binaries to evade machine learning detectors, Comput. Secur. 116 (2022) 102643. doi:10.1016/J.COSE.2022.102643
-
[43]
H. S. Anderson, A. Kharkar, B. Filar, D. Evans, P. Roth, Learning to evade static PE machine learning malware models via reinforcement learning, CoRR abs/1801.08917 (2018). arXiv:1801.08917
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
W. Hu, Y. Tan, Generating adversarial malware examples for black-box attacks based on GAN, in: Data Mining and Big Data - 7th Interna- tional Conference, DMBD 2022, Beijing, China, November 21-24, 2022, Proceedings, Part II, Vol. 1745 of Communications in Computer and Information Science, Springer, 2022, pp. 409–423. doi:10.1007/978-981- 19-8991-9_29
-
[45]
G. Apruzzese, A. Fass, F. Pierazzi, When adversarial perturbations meet concept drift: An exploratory analysis on ML-NIDS, in: AISec 2024, Salt Lake City, UT, USA, October 14-18, 2024, ACM, 2024, pp. 149–
work page 2024
-
[46]
URL https://doi.org/10.1145/3689932.3694757
doi:10.1145/3689932.3694757. URL https://doi.org/10.1145/3689932.3694757
-
[47]
In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security
N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Ce- lik, A. Swami, Practical black-box attacks against machine learn- ing, in: Proceedings of the 2017 ACM on Asia Conference on Com- puter and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 2-6, 2017, ACM, 2017, pp. 506–519. doi:10.1145/3052973.3053009. URL https:...
-
[48]
Y. Liu, X. Chen, C. Liu, D. Song, Delving into transferable adversar- ial examples and black-box attacks, in: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net, 2017. URL https://openreview.net/forum?id=Sys6GJqxl
work page 2017
-
[49]
K. Grosse, N. Papernot, P. Manoharan, M. Backes, P. D. McDaniel, Adversarial examples for malware detection, in: Computer Security - ESORICS 2017 - 22nd European Symposium on Research in Com- puter Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part II, Lecture Notes in Computer Science, Springer, 2017, pp. 62–
work page 2017
-
[50]
URL https://doi.org/10.1007/978-3-319-66399-9\_4
doi:10.1007/978-3-319-66399-9_4. URL https://doi.org/10.1007/978-3-319-66399-9\_4
-
[51]
J. Uesato, B. O’Donoghue, P. Kohli, A. van den Oord, Adversarial risk and the dangers of evaluating against weak attacks, in: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Vol. 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 5032– 5041. 42
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.