Diagnosing and Mitigating Domain Shift in Permission-Based Android Malware Detection
Pith reviewed 2026-05-15 04:57 UTC · model grok-4.3
The pith
Training on the intersection of common permissions across datasets recovers cross-domain accuracy for Android malware detectors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that predictive permission sets are fundamentally mismatched across domains, making domain-specific artifacts the dominant cause of poor generalization rather than missing signals; this is evidenced by highly variable feature importances under explainable AI and by ablation results where noisy full sets degrade cross-domain accuracy, while restricting training to the intersection of common features recovers 88 percent accuracy on PerMalDroid and preserves 97 percent on NATICUSdroid.
What carries the argument
The intersection of common permission features across datasets, used as the basis for a hybrid training strategy that stabilizes the predictive set.
If this is right
- Models trained on full noisy feature sets will generalize poorly across domains for most ensemble classifiers.
- Feature importance rankings shift substantially between domains, so explanations from one dataset do not transfer.
- Restricting to shared permissions improves cross-domain robustness without new data collection.
- Intra-domain accuracy above 92 percent does not ensure reliable deployment on new sources.
- Ablation on feature noise can diagnose whether artifacts or incompleteness drives the shift.
Where Pith is reading between the lines
- Practical detectors should filter to permission features that remain stable across multiple independent sources.
- The observed asymmetry implies one dataset may contain more broadly representative permission patterns.
- The intersection method could apply to other security tasks where feature distributions shift over time.
- Automatic discovery of stable feature intersections might further reduce the need for manual dataset curation.
Load-bearing premise
The performance gaps between the two datasets arise chiefly from genuine differences in permission distributions across real-world app sources rather than from collection methods or labeling differences.
What would settle it
Train the intersection-based hybrid model on a third independent Android dataset collected separately and check whether it sustains at least 85 percent accuracy while single-domain models continue to show large drops.
Figures
read the original abstract
Machine learning-based Android malware detectors often fail in real-world deployment due to domain shift, where models trained on one data source perform poorly on applications from another. This paper presents a comprehensive study on the generalizability and interpretability of permission-based detectors under cross-domain conditions. Using two complementary datasets (PerMalDroid and NATICUSdroid) and five ensemble classifiers, we first establish an intra-domain baseline, where models achieve over 92% accuracy, and then quantify a severe asymmetric performance drop. While models trained on PerMalDroid generalize well to NATICUSdroid (86% accuracy), the reverse direction sees a drastic drop to 73% accuracy. Explainable AI analysis reveals bimodal feature distributions and shows that feature importance is highly unstable, with key permissions losing or gaining influence across domains. The predictive feature sets for different domains are fundamentally mismatched, as models rely on different, dataset-specific permissions. Most importantly, an ablation study demonstrates that for most models, training on a noisy feature set leads to poor generalization, confirming that domain-specific artifacts are a greater obstacle than missing features. To mitigate this, we validate a hybrid training strategy based on the intersection of common features and successfully recover cross-domain performance, achieving 88% accuracy on PerMalDroid and maintaining 97% on NATICUSdroid. These findings highlight the importance of explainable, cross-domain-robust malware detection systems and provide a practical pathway toward improving real-world deployment of permission-based Android malware detectors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that domain shift causes asymmetric performance degradation in permission-based Android malware detectors across two datasets (PerMalDroid and NATICUSdroid). Using five ensemble classifiers, it shows intra-domain accuracies >92%, cross-domain drops to 86% and 73%, unstable feature importances via XAI, and that a hybrid model trained on intersecting common features mitigates the shift to achieve 88% and 97% accuracy. An ablation confirms noisy features are more problematic than missing ones.
Significance. If the results hold, this work is significant for demonstrating the practical impact of domain shift on malware detection systems and validating a simple yet effective mitigation strategy. The XAI-driven diagnosis of feature instability and the ablation results provide valuable insights into why cross-domain generalization fails. The concrete accuracy figures on named datasets and the hybrid intersection approach are strengths that could guide more robust, interpretable Android malware detectors for real-world deployment.
major comments (3)
- [§3] The central claim that the observed performance differences stem from domain shift rather than dataset construction artifacts requires stronger support. Details on how PerMalDroid and NATICUSdroid were collected, including time windows, app sources, and labeling processes, are essential to validate that the asymmetric drop (86% vs 73%) and the ablation results isolate domain-specific permission artifacts.
- [Experimental Setup] Exact specifications of the five ensemble classifiers, including their types, hyperparameters, and the cross-validation procedure, are missing. The accuracy figures are reported without error bars or statistical tests, weakening the evidence for the hybrid strategy's superiority (88% on PerMalDroid, 97% on NATICUSdroid).
- [Ablation Study] In the ablation study, the construction of the 'noisy feature set' and its distinction from missing features should be explicitly defined, perhaps with reference to specific permission lists or overlap metrics, to substantiate that domain-specific artifacts are the greater obstacle.
minor comments (2)
- [Abstract] Specify the exact intra-domain accuracy values instead of 'over 92%' for precision.
- [Tables] Ensure all tables reporting accuracies include standard deviations or confidence intervals for better assessment of result reliability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments that highlight areas for improving clarity and rigor. We address each major point below and will revise the manuscript to incorporate additional details and specifications where feasible.
read point-by-point responses
-
Referee: [§3] The central claim that the observed performance differences stem from domain shift rather than dataset construction artifacts requires stronger support. Details on how PerMalDroid and NATICUSdroid were collected, including time windows, app sources, and labeling processes, are essential to validate that the asymmetric drop (86% vs 73%) and the ablation results isolate domain-specific permission artifacts.
Authors: We agree that expanded details on dataset provenance would strengthen the isolation of domain shift effects. In the revised manuscript, we will augment Section 3 with summaries of collection time windows, app sources (e.g., official and third-party markets), and labeling procedures (e.g., multi-scanner thresholds), drawing from the original dataset papers while adding comparative statistics on permission distributions to better support that the asymmetric degradation arises from domain-specific artifacts rather than construction biases. revision: yes
-
Referee: [Experimental Setup] Exact specifications of the five ensemble classifiers, including their types, hyperparameters, and the cross-validation procedure, are missing. The accuracy figures are reported without error bars or statistical tests, weakening the evidence for the hybrid strategy's superiority (88% on PerMalDroid, 97% on NATICUSdroid).
Authors: We acknowledge the need for precise experimental details. The revised version will specify the five ensemble classifiers (Random Forest, Gradient Boosting, AdaBoost, XGBoost, LightGBM), list key hyperparameters (e.g., n_estimators, max_depth), describe the 5-fold stratified cross-validation, and report accuracies with standard deviations plus paired statistical tests (e.g., t-tests) to quantify the hybrid model's superiority over baselines. revision: yes
-
Referee: [Ablation Study] In the ablation study, the construction of the 'noisy feature set' and its distinction from missing features should be explicitly defined, perhaps with reference to specific permission lists or overlap metrics, to substantiate that domain-specific artifacts are the greater obstacle.
Authors: We will explicitly define the noisy feature set in the revision as the symmetric difference of permission sets across domains (permissions unique to one dataset), contrasted with missing features (those absent in the source but present in the target). We will include Jaccard overlap metrics between feature sets, list example domain-specific permissions (e.g., READ_SMS variants), and reference the ablation results showing greater performance degradation from noisy features to substantiate the claim. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential predictions
full rationale
The paper reports direct experimental results from training five ensemble classifiers on two fixed datasets (PerMalDroid and NATICUSdroid), measuring intra- and cross-domain accuracies (92% baseline, 86%/73% asymmetric drop), running ablation studies on noisy vs. missing features, and evaluating a simple intersection-based hybrid feature set that yields 88%/97% recovery. No equations, fitted parameters, uniqueness theorems, or self-citations appear as load-bearing steps; all claims reduce to explicit accuracy and feature-importance observations on the chosen data. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- choice of five ensemble classifiers
axioms (1)
- domain assumption Permission features are sufficient and representative inputs for malware detection
Reference graph
Works this paper leans on
-
[1]
A review on android malware: Attacks, countermeasures and challenges ahead,
S. Selvaganapathy, S. Sadasivam, and V . Ravi, “A review on android malware: Attacks, countermeasures and challenges ahead,”J. Cyber Secur. Mobil., vol. 10, no. 1, pp. 177–230, 2021
work page 2021
-
[2]
The evolution of android malware and android analysis techniques,
K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, and L. Cavallaro, “The evolution of android malware and android analysis techniques,”ACM Comput. Surv., vol. 49, no. 4, pp. 1–41, 2017
work page 2017
-
[3]
ESET, “ESET Threat Report H2 2023,” ESET, Tech. Rep., Dec. 2023. [Online]. Available: https://www.welivesecurity.com
work page 2023
-
[4]
Android apps containing SpinOk module with spyware features,
Doctor Web, “Android apps containing SpinOk module with spyware features,” Doctor Web, Tech. Rep., May 29, 2023. [Online]. Available: https://news.drweb.com
work page 2023
-
[5]
A study of permission-based malware detection using machine learning,
R. Rahman, M. R. Islam, A. Ahmed, M. K. Hasan, and H. Mahmud, “A study of permission-based malware detection using machine learning,” inProc. 15th Int. Conf. Secur. Inf. Netw. (SIN), 2022, pp. 1–6
work page 2022
-
[6]
S. K. Smmarwar, G. P. Gupta, and S. Kumar, “Android malware detection and identification frameworks by leveraging the machine and deep learning techniques: A comprehensive review,”Telemat. Informat. Rep., vol. 14, p. 100130, 2024
work page 2024
-
[7]
NATICUSdroid: A malware detection framework for Android using native and custom permissions,
A. Mathur, L. M. Podila, K. Kulkarni, Q. Niyaz, and A. Y . Javaid, “NATICUSdroid: A malware detection framework for Android using native and custom permissions,”J. Inf. Secur. Appl., vol. 58, p. 102696, 2021
work page 2021
-
[8]
Machine learning for android malware detection using permission and api calls,
N. Peiravian and X. Zhu, “Machine learning for android malware detection using permission and api calls,” inProc. IEEE 25th Int. Conf. Tools Artif. Intell., 2013, pp. 300–305
work page 2013
-
[9]
PMDS: permission-based malware detec- tion system,
P. Rovelli and ´Y. Vigf´usson, “PMDS: permission-based malware detec- tion system,” inProc. 10th Int. Conf. Inf. Syst. Secur., 2014, pp. 338–357
work page 2014
-
[10]
Machine learning aided Android malware classification,
N. Milosevic, A. Dehghantanha, and K. K. R. Choo, “Machine learning aided Android malware classification,”Comput. Elect. Eng., vol. 61, pp. 266–274, 2017
work page 2017
-
[11]
Significant permission identification for machine-learning-based android malware detection,
J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-An, and H. Ye, “Significant permission identification for machine-learning-based android malware detection,”IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 3216–3225, 2018
work page 2018
-
[12]
Permission based malware detection in android devices,
S. Ilham, G. Abderrahim, and B. A. Abdelhakim, “Permission based malware detection in android devices,” inProc. 3rd Int. Conf. Smart City Appl., 2018, pp. 1–6
work page 2018
-
[13]
Machine- learning-based android malware family classification using built-in and custom permissions,
M. Kim, D. Kim, C. Hwang, S. Cho, S. Han, and M. Park, “Machine- learning-based android malware family classification using built-in and custom permissions,”Appl. Sci., vol. 11, no. 21, p. 10244, 2021
work page 2021
-
[14]
Machine learning-based android malware detection using manifest permissions,
N. Herron, W. B. Glisson, J. T. McDonald, and R. K. Benton, “Machine learning-based android malware detection using manifest permissions,” inProc. 54th Hawaii Int. Conf. Syst. Sci., 2021, pp. 1–10
work page 2021
-
[15]
Permissions-based detection of android malware using machine learning,
F. Akbar, M. Hussain, R. Mumtaz, Q. Riaz, A. W. A. Wahab, and K. H. Jung, “Permissions-based detection of android malware using machine learning,”Symmetry, vol. 14, no. 4, p. 718, 2022
work page 2022
-
[16]
A static analysis approach for Android permission-based malware detection systems,
J. Mohamad Arif, M. F. Ab Razak, S. Awang, S. R. Tuan Mat, N. S. N. Ismail, and A. Firdaus, “A static analysis approach for Android permission-based malware detection systems,”PLoS ONE, vol. 16, no. 9, p. e0257968, 2021
work page 2021
-
[17]
D. ¨O. S ¸ahin, O. E. Kural, S. Akleylek, and E. Kılıc ¸, “A novel permission- based Android malware detection system using feature selection based on linear regression,”Neural Comput. Appl., vol. 35, pp. 1–16, 2023
work page 2023
-
[18]
A. Mahindru, H. Arora, A. Kumar, S. K. Gupta, S. Mahajan, S. Kadry, and J. Kim, “PermDroid a framework developed using proposed feature selection approach and machine learning techniques for Android malware detection,”Sci. Rep., vol. 14, no. 1, p. 10724, 2024
work page 2024
-
[19]
R. Y . Mawoh, J. B. A. Wacka, F. Tchakounte, C. Fachkha, and Kolyang, “An accurate approach to discriminate android colluded malware from single app malware using permissions intelligence,”Sci. Rep., vol. 15, no. 1, p. 10680, 2025
work page 2025
-
[20]
A. Guerra-Manzanares, “Machine learning for Android malware detec- tion: Mission accomplished? A comprehensive review of open chal- lenges and future perspectives,”Comput. Secur., vol. 138, p. 103654, 2024
work page 2024
-
[21]
M. S. Alam and D. T. Vu, “Android malware detection and identification frameworks by leveraging the machine and deep learning techniques: A comprehensive review,”Mach. Learn. Appl., vol. 15, p. 100530, 2024
work page 2024
-
[22]
Explainable artificial intelligence applications in cyber security: State- of-the-art in research,
Z. Zhang, H. Al Hamadi, E. Damiani, C. Y . Yeun, and F. Taher, “Explainable artificial intelligence applications in cyber security: State- of-the-art in research,”IEEE Access, vol. 10, pp. 93104–93139, 2022
work page 2022
-
[23]
XGBoost: A scalable tree boosting system,
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2016, pp. 785–794
work page 2016
-
[24]
LightGBM: A highly efficient gradient boosting decision tree,
G. Keet al., “LightGBM: A highly efficient gradient boosting decision tree,”Adv. Neural Inf. Process. Syst., vol. 30, 2017
work page 2017
-
[25]
CatBoost: unbiased boosting with categorical features,
L. Prokhorenkovaet al., “CatBoost: unbiased boosting with categorical features,”Adv. Neural Inf. Process. Syst., vol. 31, 2018
work page 2018
-
[26]
A unified approach to interpreting model predictions,
S. M. Lundberg and S. I. Lee, “A unified approach to interpreting model predictions,”Adv. Neural Inf. Process. Syst., vol. 30, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.