pith. sign in

arxiv: 2605.09028 · v2 · submitted 2026-05-09 · 💻 cs.LG

Diagnosing and Mitigating Domain Shift in Permission-Based Android Malware Detection

Pith reviewed 2026-05-15 04:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords Android malware detectiondomain shiftpermission featurescross-domain generalizationexplainable AIhybrid trainingensemble classifiers
0
0 comments X

The pith

Training on the intersection of common permissions across datasets recovers cross-domain accuracy for Android malware detectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that permission-based machine learning detectors for Android malware achieve strong results inside their training domain but degrade under domain shift between different data sources. Models exhibit asymmetric failure, with one cross-domain direction dropping sharply while the reverse holds up better, and explainable analysis traces this to unstable, dataset-specific permission importances. Ablation experiments establish that domain-specific artifacts in the full feature set obstruct generalization more than simple feature absence. A hybrid training approach restricted to shared permissions between the two datasets restores usable performance in both directions.

Core claim

The central claim is that predictive permission sets are fundamentally mismatched across domains, making domain-specific artifacts the dominant cause of poor generalization rather than missing signals; this is evidenced by highly variable feature importances under explainable AI and by ablation results where noisy full sets degrade cross-domain accuracy, while restricting training to the intersection of common features recovers 88 percent accuracy on PerMalDroid and preserves 97 percent on NATICUSdroid.

What carries the argument

The intersection of common permission features across datasets, used as the basis for a hybrid training strategy that stabilizes the predictive set.

If this is right

  • Models trained on full noisy feature sets will generalize poorly across domains for most ensemble classifiers.
  • Feature importance rankings shift substantially between domains, so explanations from one dataset do not transfer.
  • Restricting to shared permissions improves cross-domain robustness without new data collection.
  • Intra-domain accuracy above 92 percent does not ensure reliable deployment on new sources.
  • Ablation on feature noise can diagnose whether artifacts or incompleteness drives the shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practical detectors should filter to permission features that remain stable across multiple independent sources.
  • The observed asymmetry implies one dataset may contain more broadly representative permission patterns.
  • The intersection method could apply to other security tasks where feature distributions shift over time.
  • Automatic discovery of stable feature intersections might further reduce the need for manual dataset curation.

Load-bearing premise

The performance gaps between the two datasets arise chiefly from genuine differences in permission distributions across real-world app sources rather than from collection methods or labeling differences.

What would settle it

Train the intersection-based hybrid model on a third independent Android dataset collected separately and check whether it sustains at least 85 percent accuracy while single-domain models continue to show large drops.

Figures

Figures reproduced from arXiv: 2605.09028 by Md Rafid Islam.

Figure 1
Figure 1. Figure 1: Training machine learning models to detect Android malware using [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the experimental workflow showing intra-domain, cross [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of model accuracy across intra-domain and cross-domain [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feature intersection analysis showing the overlap and unique [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SHAP violin plots showing the global feature impact distributions for the top 15 most important permissions in intra-domain evaluations. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Quantification of cross-domain feature importance shifts using mean absolute SHAP values. (a) Pd [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SHAP waterfall plots for local interpretation of four instances from the PerMalDroid test set. (a) True positive, (b) true negative, (c) false positive, [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Machine learning-based Android malware detectors often fail in real-world deployment due to domain shift, where models trained on one data source perform poorly on applications from another. This paper presents a comprehensive study on the generalizability and interpretability of permission-based detectors under cross-domain conditions. Using two complementary datasets (PerMalDroid and NATICUSdroid) and five ensemble classifiers, we first establish an intra-domain baseline, where models achieve over 92% accuracy, and then quantify a severe asymmetric performance drop. While models trained on PerMalDroid generalize well to NATICUSdroid (86% accuracy), the reverse direction sees a drastic drop to 73% accuracy. Explainable AI analysis reveals bimodal feature distributions and shows that feature importance is highly unstable, with key permissions losing or gaining influence across domains. The predictive feature sets for different domains are fundamentally mismatched, as models rely on different, dataset-specific permissions. Most importantly, an ablation study demonstrates that for most models, training on a noisy feature set leads to poor generalization, confirming that domain-specific artifacts are a greater obstacle than missing features. To mitigate this, we validate a hybrid training strategy based on the intersection of common features and successfully recover cross-domain performance, achieving 88% accuracy on PerMalDroid and maintaining 97% on NATICUSdroid. These findings highlight the importance of explainable, cross-domain-robust malware detection systems and provide a practical pathway toward improving real-world deployment of permission-based Android malware detectors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that domain shift causes asymmetric performance degradation in permission-based Android malware detectors across two datasets (PerMalDroid and NATICUSdroid). Using five ensemble classifiers, it shows intra-domain accuracies >92%, cross-domain drops to 86% and 73%, unstable feature importances via XAI, and that a hybrid model trained on intersecting common features mitigates the shift to achieve 88% and 97% accuracy. An ablation confirms noisy features are more problematic than missing ones.

Significance. If the results hold, this work is significant for demonstrating the practical impact of domain shift on malware detection systems and validating a simple yet effective mitigation strategy. The XAI-driven diagnosis of feature instability and the ablation results provide valuable insights into why cross-domain generalization fails. The concrete accuracy figures on named datasets and the hybrid intersection approach are strengths that could guide more robust, interpretable Android malware detectors for real-world deployment.

major comments (3)
  1. [§3] The central claim that the observed performance differences stem from domain shift rather than dataset construction artifacts requires stronger support. Details on how PerMalDroid and NATICUSdroid were collected, including time windows, app sources, and labeling processes, are essential to validate that the asymmetric drop (86% vs 73%) and the ablation results isolate domain-specific permission artifacts.
  2. [Experimental Setup] Exact specifications of the five ensemble classifiers, including their types, hyperparameters, and the cross-validation procedure, are missing. The accuracy figures are reported without error bars or statistical tests, weakening the evidence for the hybrid strategy's superiority (88% on PerMalDroid, 97% on NATICUSdroid).
  3. [Ablation Study] In the ablation study, the construction of the 'noisy feature set' and its distinction from missing features should be explicitly defined, perhaps with reference to specific permission lists or overlap metrics, to substantiate that domain-specific artifacts are the greater obstacle.
minor comments (2)
  1. [Abstract] Specify the exact intra-domain accuracy values instead of 'over 92%' for precision.
  2. [Tables] Ensure all tables reporting accuracies include standard deviations or confidence intervals for better assessment of result reliability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that highlight areas for improving clarity and rigor. We address each major point below and will revise the manuscript to incorporate additional details and specifications where feasible.

read point-by-point responses
  1. Referee: [§3] The central claim that the observed performance differences stem from domain shift rather than dataset construction artifacts requires stronger support. Details on how PerMalDroid and NATICUSdroid were collected, including time windows, app sources, and labeling processes, are essential to validate that the asymmetric drop (86% vs 73%) and the ablation results isolate domain-specific permission artifacts.

    Authors: We agree that expanded details on dataset provenance would strengthen the isolation of domain shift effects. In the revised manuscript, we will augment Section 3 with summaries of collection time windows, app sources (e.g., official and third-party markets), and labeling procedures (e.g., multi-scanner thresholds), drawing from the original dataset papers while adding comparative statistics on permission distributions to better support that the asymmetric degradation arises from domain-specific artifacts rather than construction biases. revision: yes

  2. Referee: [Experimental Setup] Exact specifications of the five ensemble classifiers, including their types, hyperparameters, and the cross-validation procedure, are missing. The accuracy figures are reported without error bars or statistical tests, weakening the evidence for the hybrid strategy's superiority (88% on PerMalDroid, 97% on NATICUSdroid).

    Authors: We acknowledge the need for precise experimental details. The revised version will specify the five ensemble classifiers (Random Forest, Gradient Boosting, AdaBoost, XGBoost, LightGBM), list key hyperparameters (e.g., n_estimators, max_depth), describe the 5-fold stratified cross-validation, and report accuracies with standard deviations plus paired statistical tests (e.g., t-tests) to quantify the hybrid model's superiority over baselines. revision: yes

  3. Referee: [Ablation Study] In the ablation study, the construction of the 'noisy feature set' and its distinction from missing features should be explicitly defined, perhaps with reference to specific permission lists or overlap metrics, to substantiate that domain-specific artifacts are the greater obstacle.

    Authors: We will explicitly define the noisy feature set in the revision as the symmetric difference of permission sets across domains (permissions unique to one dataset), contrasted with missing features (those absent in the source but present in the target). We will include Jaccard overlap metrics between feature sets, list example domain-specific permissions (e.g., READ_SMS variants), and reference the ablation results showing greater performance degradation from noisy features to substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential predictions

full rationale

The paper reports direct experimental results from training five ensemble classifiers on two fixed datasets (PerMalDroid and NATICUSdroid), measuring intra- and cross-domain accuracies (92% baseline, 86%/73% asymmetric drop), running ablation studies on noisy vs. missing features, and evaluating a simple intersection-based hybrid feature set that yields 88%/97% recovery. No equations, fitted parameters, uniqueness theorems, or self-citations appear as load-bearing steps; all claims reduce to explicit accuracy and feature-importance observations on the chosen data. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Claims rest on standard supervised classification assumptions and the premise that the two named datasets capture distinct domains; no new entities or heavily fitted parameters are introduced in the abstract.

free parameters (1)
  • choice of five ensemble classifiers
    Specific models and any hyperparameter tuning are not detailed; treated as standard choices.
axioms (1)
  • domain assumption Permission features are sufficient and representative inputs for malware detection
    The entire study is built on permission vectors; no other signals are considered.

pith-pipeline@v0.9.0 · 5560 in / 1301 out tokens · 43485 ms · 2026-05-15T04:57:38.956765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    A review on android malware: Attacks, countermeasures and challenges ahead,

    S. Selvaganapathy, S. Sadasivam, and V . Ravi, “A review on android malware: Attacks, countermeasures and challenges ahead,”J. Cyber Secur. Mobil., vol. 10, no. 1, pp. 177–230, 2021

  2. [2]

    The evolution of android malware and android analysis techniques,

    K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, and L. Cavallaro, “The evolution of android malware and android analysis techniques,”ACM Comput. Surv., vol. 49, no. 4, pp. 1–41, 2017

  3. [3]

    ESET Threat Report H2 2023,

    ESET, “ESET Threat Report H2 2023,” ESET, Tech. Rep., Dec. 2023. [Online]. Available: https://www.welivesecurity.com

  4. [4]

    Android apps containing SpinOk module with spyware features,

    Doctor Web, “Android apps containing SpinOk module with spyware features,” Doctor Web, Tech. Rep., May 29, 2023. [Online]. Available: https://news.drweb.com

  5. [5]

    A study of permission-based malware detection using machine learning,

    R. Rahman, M. R. Islam, A. Ahmed, M. K. Hasan, and H. Mahmud, “A study of permission-based malware detection using machine learning,” inProc. 15th Int. Conf. Secur. Inf. Netw. (SIN), 2022, pp. 1–6

  6. [6]

    Android malware detection and identification frameworks by leveraging the machine and deep learning techniques: A comprehensive review,

    S. K. Smmarwar, G. P. Gupta, and S. Kumar, “Android malware detection and identification frameworks by leveraging the machine and deep learning techniques: A comprehensive review,”Telemat. Informat. Rep., vol. 14, p. 100130, 2024

  7. [7]

    NATICUSdroid: A malware detection framework for Android using native and custom permissions,

    A. Mathur, L. M. Podila, K. Kulkarni, Q. Niyaz, and A. Y . Javaid, “NATICUSdroid: A malware detection framework for Android using native and custom permissions,”J. Inf. Secur. Appl., vol. 58, p. 102696, 2021

  8. [8]

    Machine learning for android malware detection using permission and api calls,

    N. Peiravian and X. Zhu, “Machine learning for android malware detection using permission and api calls,” inProc. IEEE 25th Int. Conf. Tools Artif. Intell., 2013, pp. 300–305

  9. [9]

    PMDS: permission-based malware detec- tion system,

    P. Rovelli and ´Y. Vigf´usson, “PMDS: permission-based malware detec- tion system,” inProc. 10th Int. Conf. Inf. Syst. Secur., 2014, pp. 338–357

  10. [10]

    Machine learning aided Android malware classification,

    N. Milosevic, A. Dehghantanha, and K. K. R. Choo, “Machine learning aided Android malware classification,”Comput. Elect. Eng., vol. 61, pp. 266–274, 2017

  11. [11]

    Significant permission identification for machine-learning-based android malware detection,

    J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-An, and H. Ye, “Significant permission identification for machine-learning-based android malware detection,”IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 3216–3225, 2018

  12. [12]

    Permission based malware detection in android devices,

    S. Ilham, G. Abderrahim, and B. A. Abdelhakim, “Permission based malware detection in android devices,” inProc. 3rd Int. Conf. Smart City Appl., 2018, pp. 1–6

  13. [13]

    Machine- learning-based android malware family classification using built-in and custom permissions,

    M. Kim, D. Kim, C. Hwang, S. Cho, S. Han, and M. Park, “Machine- learning-based android malware family classification using built-in and custom permissions,”Appl. Sci., vol. 11, no. 21, p. 10244, 2021

  14. [14]

    Machine learning-based android malware detection using manifest permissions,

    N. Herron, W. B. Glisson, J. T. McDonald, and R. K. Benton, “Machine learning-based android malware detection using manifest permissions,” inProc. 54th Hawaii Int. Conf. Syst. Sci., 2021, pp. 1–10

  15. [15]

    Permissions-based detection of android malware using machine learning,

    F. Akbar, M. Hussain, R. Mumtaz, Q. Riaz, A. W. A. Wahab, and K. H. Jung, “Permissions-based detection of android malware using machine learning,”Symmetry, vol. 14, no. 4, p. 718, 2022

  16. [16]

    A static analysis approach for Android permission-based malware detection systems,

    J. Mohamad Arif, M. F. Ab Razak, S. Awang, S. R. Tuan Mat, N. S. N. Ismail, and A. Firdaus, “A static analysis approach for Android permission-based malware detection systems,”PLoS ONE, vol. 16, no. 9, p. e0257968, 2021

  17. [17]

    A novel permission- based Android malware detection system using feature selection based on linear regression,

    D. ¨O. S ¸ahin, O. E. Kural, S. Akleylek, and E. Kılıc ¸, “A novel permission- based Android malware detection system using feature selection based on linear regression,”Neural Comput. Appl., vol. 35, pp. 1–16, 2023

  18. [18]

    PermDroid a framework developed using proposed feature selection approach and machine learning techniques for Android malware detection,

    A. Mahindru, H. Arora, A. Kumar, S. K. Gupta, S. Mahajan, S. Kadry, and J. Kim, “PermDroid a framework developed using proposed feature selection approach and machine learning techniques for Android malware detection,”Sci. Rep., vol. 14, no. 1, p. 10724, 2024

  19. [19]

    An accurate approach to discriminate android colluded malware from single app malware using permissions intelligence,

    R. Y . Mawoh, J. B. A. Wacka, F. Tchakounte, C. Fachkha, and Kolyang, “An accurate approach to discriminate android colluded malware from single app malware using permissions intelligence,”Sci. Rep., vol. 15, no. 1, p. 10680, 2025

  20. [20]

    Machine learning for Android malware detec- tion: Mission accomplished? A comprehensive review of open chal- lenges and future perspectives,

    A. Guerra-Manzanares, “Machine learning for Android malware detec- tion: Mission accomplished? A comprehensive review of open chal- lenges and future perspectives,”Comput. Secur., vol. 138, p. 103654, 2024

  21. [21]

    Android malware detection and identification frameworks by leveraging the machine and deep learning techniques: A comprehensive review,

    M. S. Alam and D. T. Vu, “Android malware detection and identification frameworks by leveraging the machine and deep learning techniques: A comprehensive review,”Mach. Learn. Appl., vol. 15, p. 100530, 2024

  22. [22]

    Explainable artificial intelligence applications in cyber security: State- of-the-art in research,

    Z. Zhang, H. Al Hamadi, E. Damiani, C. Y . Yeun, and F. Taher, “Explainable artificial intelligence applications in cyber security: State- of-the-art in research,”IEEE Access, vol. 10, pp. 93104–93139, 2022

  23. [23]

    XGBoost: A scalable tree boosting system,

    T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2016, pp. 785–794

  24. [24]

    LightGBM: A highly efficient gradient boosting decision tree,

    G. Keet al., “LightGBM: A highly efficient gradient boosting decision tree,”Adv. Neural Inf. Process. Syst., vol. 30, 2017

  25. [25]

    CatBoost: unbiased boosting with categorical features,

    L. Prokhorenkovaet al., “CatBoost: unbiased boosting with categorical features,”Adv. Neural Inf. Process. Syst., vol. 31, 2018

  26. [26]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S. I. Lee, “A unified approach to interpreting model predictions,”Adv. Neural Inf. Process. Syst., vol. 30, 2017