pith. sign in

arxiv: 2606.16072 · v2 · pith:IECCTJOMnew · submitted 2026-06-15 · 💻 cs.CR · cs.AI

MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens

Pith reviewed 2026-06-27 04:11 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords Android malwaresource code datasetGitHub collectionREADME classificationTF-IDF featuresLinearSVC classifiermalware source code
0
0 comments X

The pith

Repository README documentation alone can identify Android malware source code repositories at 96.28% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that text from GitHub repository README files carries enough distinctive patterns to separate Android malware source code collections from benign ones. It extracts character-level TF-IDF features from thousands of labeled README documents and trains a LinearSVC model that reaches 96.28 percent accuracy with a 1.06 percent false-positive rate. This signal supports an automated pipeline for discovering and curating malware source code without exhaustive manual inspection. The resulting datasets more directly expose attacker intent than binaries or decompiled forms, addressing the scarcity that has limited prior research.

Core claim

Repository-level documentation alone provides a strong signal for malware source code collection. Our model extracts character-level TF-IDF features from 8,772 malware and 25,747 benign README documents and trains a LinearSVC classifier to distinguish malware repositories. This README-only model achieves an accuracy of 96.28% and an FPR of 1.06% in local evaluation. In addition, the model outputs confidence scores, allowing users to adjust the decision threshold to balance FPR and coverage, which is practical in real-world malware source code collection.

What carries the argument

The README-only classifier that uses character-level TF-IDF features from repository documentation and a LinearSVC model to flag malware repositories, with confidence scores for threshold tuning.

If this is right

  • An automated pipeline can collect and maintain large-scale Android malware source code datasets from GitHub.
  • Confidence scores let practitioners raise or lower the decision threshold to trade coverage against false positives.
  • Source-code datasets built this way reflect attacker intent more directly than binary or decompiled artifacts.
  • The same documentation-based approach can reduce the cost of curating malware specimens over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on repositories for other platforms such as Windows or Linux malware to check generality.
  • Pairing the README classifier with static code analysis might add a second validation layer before inclusion in the dataset.
  • Large source-code collections produced by the pipeline could serve as training data for improved malware detection models.

Load-bearing premise

The ground-truth labels assigned to the 8,772 malware and 25,747 benign repositories used for training and testing are accurate and free of selection bias.

What would settle it

Manual inspection of a random sample of repositories the model labels as malware that reveals a high rate of clearly benign projects would show the README signal is not reliable.

Figures

Figures reproduced from arXiv: 2606.16072 by Bojing Li, Charles Maxa, Charles Nicholas, Duo Zhong, Prajna Bhandary, Raguvir S, Robert J Joyce.

Figure 1
Figure 1. Figure 1: Yearly distribution of Android malware source code [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of our automated malware [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We believe these formatting elements, such as images, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confidence-threshold trade-off among FPR, accu [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Processed README files and preserved structure [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLM-annotated code reuse network of Android mal [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Code reuse lineage of an Android malware, with LLM traces observed in the 2024 variant. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Different symbolic information in JAVA code [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Compared with binaries and decompiled code, malware source code more directly reflects the attackers' original intent. However, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. We propose MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework for scalable malware source code discovery on GitHub. A key finding of our work is that repository-level documentation alone provides a strong signal for malware source code collection. Our model extracts character-level TF-IDF features from 8,772 malware and 25,747 benign README documents and trains a LinearSVC classifier to distinguish malware repositories. This README-only model achieves an accuracy of 96.28\% and an FPR of 1.06\% in local evaluation. In addition, the model outputs confidence scores, allowing users to adjust the decision threshold to balance FPR and coverage, which is practical in real-world malware source code collection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents MASCOT-Android, a curated dataset of Android malware source code specimens collected from GitHub together with an automated pipeline for scalable discovery. Its central empirical claim is that repository-level README documentation alone supplies a strong signal: a LinearSVC trained on character-level TF-IDF features from 8,772 malware and 25,747 benign README documents achieves 96.28% accuracy and 1.06% FPR in local evaluation, with confidence scores permitting threshold adjustment for FPR-coverage trade-offs.

Significance. If the ground-truth labels prove reliable and independent of the README text, the result would offer a low-cost, documentation-driven method for expanding malware source-code corpora, directly addressing the scarcity problem highlighted in the abstract and enabling practical deployment via adjustable thresholds.

major comments (1)
  1. [Abstract / Dataset Construction] Abstract and dataset-construction section: the ground-truth labels for the 8,772 malware repositories are stated as given counts but their provenance is never described (no mention of GitHub search heuristics, topic tags, external AV reports, manual review protocol, or any validation step). Because the reported 96.28% accuracy and 1.06% FPR are obtained by training and evaluating on these labels, the absence of labeling methodology renders the central performance claim uninterpretable and potentially circular.
minor comments (1)
  1. [Abstract] The abstract supplies performance numbers but omits any reference to cross-validation procedure, label-noise controls, or feature-extraction hyperparameters; these details should be added even if the labeling method is the primary gap.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and for highlighting the need for greater transparency in dataset construction. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Dataset Construction] Abstract and dataset-construction section: the ground-truth labels for the 8,772 malware repositories are stated as given counts but their provenance is never described (no mention of GitHub search heuristics, topic tags, external AV reports, manual review protocol, or any validation step). Because the reported 96.28% accuracy and 1.06% FPR are obtained by training and evaluating on these labels, the absence of labeling methodology renders the central performance claim uninterpretable and potentially circular.

    Authors: We agree that the current manuscript lacks a clear description of how the ground-truth labels for the 8,772 malware repositories were obtained. This omission makes it difficult for readers to assess label independence from the README features and to evaluate potential circularity. In the revised manuscript we will insert a new subsection under Dataset Construction that explicitly details: (1) the GitHub search queries and heuristics used to identify candidate malware repositories, (2) any topic tags, repository metadata, or external signals (e.g., references to known malware families or AV scan results) employed for initial labeling, (3) the manual review protocol applied to a sample of repositories, and (4) any quantitative validation steps performed to confirm label quality. These additions will demonstrate that the labels are derived from sources independent of the character-level TF-IDF features extracted from README files, thereby removing the circularity concern and allowing the reported accuracy and FPR figures to be properly interpreted. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains a LinearSVC on character-level TF-IDF features extracted from externally labeled README documents (8,772 malware, 25,747 benign) and reports standard accuracy/FPR metrics. No equations, procedures, or self-citations reduce the reported performance to a quantity defined by the fitted parameters themselves, nor do any of the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, load-bearing self-citation, etc.) appear. The central claim rests on the (undescribed) labeling process being independent, but the derivation itself is self-contained against external benchmarks and does not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the correctness of the ground-truth labels used to train the LinearSVC and on the assumption that README text is a stable, representative signal independent of the actual source code.

free parameters (1)
  • LinearSVC regularization and class-weight parameters
    Standard hyperparameters of the classifier are fitted to the collected README corpus.
axioms (1)
  • domain assumption Ground-truth labels for the 8,772 malware and 25,747 benign repositories are accurate and unbiased
    The model is trained and evaluated directly on these labels; any systematic error in labeling would invalidate the reported accuracy.

pith-pipeline@v0.9.1-grok · 5719 in / 1320 out tokens · 55035 ms · 2026-06-27T04:11:25.559294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    George Alexandru Adam, Alexander Cui, Edwin Thomas, Emily Napier, Nazar Shmatko, Jacob Schnell, Jacob Junqi Tian, Alekhya Dronavalli, Edward Tian, and Dongwon Lee. 2026. GPTZero: Robust Detection of LLM-Generated Texts.arXiv preprint arXiv:2602.13042(2026)

  2. [2]

    Pieter Arntz. 2025. Android mobile adware surges in second half of 2025. https: //www.malwarebytes.com/blog/mobile/2025/12/android-threats-in-2025- when-your-phone-becomes-the-main-attack-surface. Malwarebytes. Accessed: 2026-04-11

  3. [3]

    Warren Bank. 2020. Android-SMS-Automatic-Reply-GPS. https://github.com /warren-bank/Android-SMS-Automatic-Reply-GPS. GitHub repository, latest release v2.1.7, accessed April 21, 2026

  4. [4]

    Hendrio Bragança, Diego Kreutz, Vanderson Rocha, Joner Assolin, and Eduardo Feitosa. 2025. MH-1M: A 1.34 Million-Sample Multi-Feature Android Malware Dataset with Rich Metadata.Scientific Data(2025)

  5. [5]

    Alejandro Calleja, Juan Tapiador, and Juan Caballero. 2018. The malsource dataset: Quantifying complexity and code reuse in malware development.IEEE Transactions on Information Forensics and Security14, 12 (2018), 3175–3190

  6. [6]

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42

  7. [7]

    James R Cordy and Chanchal K Roy. 2011. The nicad clone detector. In2011 IEEE 19th international conference on program comprehension. IEEE, 219–220

  8. [8]

    Emanuele Cozzi, Pierre-Antoine Vervier, Matteo Dell’Amico, Yun Shen, Leyla Bilge, and Davide Balzarotti. 2020. The tangled genealogy of IoT malware. In Proceedings of the 36th Annual Computer Security Applications Conference. 1–16

  9. [9]

    d Raco. 2026. android-malware-source-code-analysis. https://github.com/d- Raco/android-malware-source-code-analysis. GitHub repository. Accessed: 2026-04-13

  10. [10]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  11. [11]

    F-Droid Contributors. 2026. About F-Droid. https://f-droid.org/en/about/. Accessed: 2026-04-24

  12. [12]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. InFindings of the association for computational linguistics: EMNLP 2020. 1536–1547

  13. [13]

    fredx100. 2024. sms_my_gps: Android app that listens for incoming SMS and automatically sends a reply containing GPS location. https://github.com/fredx10 0/sms_my_gps. GitHub repository, forked from anevero/sms_my_gps, version v3.6.0, accessed April 21, 2026

  14. [14]

    jawah. 2026. charset_normalizer. https://github.com/jawah/charset_normalizer. GitHub repository, version 3.4.7, accessed 2026-04-19

  15. [15]

    Yuxuan Jiang and Francis Ferraro. 2026. SCRIBE: Structured Mid-Level Supervi- sion for Tool-Using Language Models.arXiv preprint arXiv:2601.03555(2026)

  16. [16]

    Kaspersky. 2025. Kaspersky report: Attacks on smartphones increased in the first half of 2025. https://www.kaspersky.com/about/press-releases/kaspersky- report-attacks-on-smartphones-increased-in-the-first-half-of-2025. Kaspersky Press Release. Accessed: 2026-04-11

  17. [17]

    David Sean Keyes, Beiqi Li, Gurdip Kaur, Arash Habibi Lashkari, Francois Gagnon, and Frédéric Massicotte. 2021. EntropLyzer: Android malware classification and characterization using entropy analysis of dynamic characteristics. In2021 Rec- onciling Data Analytics, Automation, Privacy, and Security: A Big Data Challenge (RDAAPS). IEEE, 1–12

  18. [18]

    Arash Habibi Lashkari, Andi Fitriah A Kadir, Laya Taheri, and Ali A Ghorbani

  19. [19]

    In2018 International Carnahan conference on security technology (ICCST)

    Toward developing a systematic approach to generate benchmark android malware datasets and classification. In2018 International Carnahan conference on security technology (ICCST). ieee, 1–7

  20. [20]

    Bojing Li, Duo Zhong, Dharani Nadendla, Gabriel Terceros, Prajna Bhandary, Raguvir S, and Charles Nicholas. 2025. MASCOT: Analyzing Malware Evolution Through a Well-Curated Source Code Dataset. In2025 IEEE International Confer- ence on Big Data (BigData). 7814–7824. doi:10.1109/BigData66926.2025.11401016

  21. [21]

    Andrew Nevero. 2021. sms_my_gps: Android app that listens for incoming SMS and automatically sends a reply containing GPS location. https://github.com/ane vero/sms_my_gps. GitHub repository, latest release v3.4.8, accessed April 21, 2026

  22. [22]

    Phuong T Nguyen, Juri Di Rocco, Claudio Di Sipio, Riccardo Rubei, Davide Di Ruscio, and Massimiliano Di Penta. 2024. GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT.Journal of Systems and Software214 (2024), 112059

  23. [23]

    OpenAI. 2026. GPT-5.4 Thinking System Card. https://deploymentsafety.openai. com/gpt-5-4-thinking/model-safety-training Deployment Safety Hub

  24. [24]

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al . 2018. Improving language understanding by generative pre-training. (2018)

  25. [25]

    Md Omar Faruk Rokon, Risul Islam, Ahmad Darki, Evangelos E Papalexakis, and Michalis Faloutsos. 2020. {SourceFinder}: Finding malware {Source-Code} from publicly available repositories in {GitHub}. In23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020). 149–163

  26. [26]

    Ashkan Sami, Babak Yadegari, Hossein Rahimi, Naser Peiravian, Sattar Hashemi, and Ali Hamze. 2010. Malware detection based on mining API calls. InProceedings of the 2010 ACM symposium on applied computing. 1020–1025

  27. [27]

    scikit-learn developers. 2026. sklearn.feature_extraction.text.TfidfVectorizer. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.te xt.TfidfVectorizer.html. Accessed: 2026-04-24

  28. [28]

    scikit-learn developers. 2026. sklearn.svm.LinearSVC. https://scikit-learn.org/st able/modules/generated/sklearn.svm.LinearSVC.html. Accessed: 2026-04-24

  29. [29]

    Minami Someya, Yuhei Otsubo, and Akira Otsuka. 2023. FCGAT: Interpretable malware classification method using function call graph and attention mechanism. InProceedings of Network and Distributed Systems Security (NDSS) Symposium, Vol. 1

  30. [30]

    Kimberly Tam, Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Lorenzo Caval- laro. 2017. The evolution of android malware and android analysis techniques. ACM Computing Surveys (CSUR)49, 4 (2017), 1–41

  31. [31]

    Tree-sitter. 2018. Tree-sitter Documentation. https://tree-sitter.github.io/tree- sitter/. Accessed: 2026-04-23

  32. [32]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  33. [33]

    Khaled Yakdan, Sergej Dechand, Elmar Gerhards-Padilla, and Matthew Smith

  34. [34]

    In2016 IEEE Symposium on Security and Privacy (SP)

    Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study. In2016 IEEE Symposium on Security and Privacy (SP). 158–177. doi:10.1109/SP.2016.18