Self-Supervised Learning for Android Malware Detection on a Time-Stamped Dataset

Annan Fu; Hao Pei; Maryam Tanha

arxiv: 2604.23025 · v2 · pith:QFB4EV6Bnew · submitted 2026-04-24 · 💻 cs.CR · cs.LG

Self-Supervised Learning for Android Malware Detection on a Time-Stamped Dataset

Annan Fu , Hao Pei , Maryam Tanha This is my paper

Pith reviewed 2026-05-08 11:20 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords Android malware detectionself-supervised learningBYOLtime-stamped datasettemporal biasobfuscationmalware classification

0 comments

The pith

Self-supervised pre-training on a time-verified Android app dataset delivers 98% accuracy and 89% F1 for malware detection under realistic time constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Android malware detectors trained with machine learning frequently overstate their performance because they are evaluated on data that includes apps released after the training period. This paper builds a dataset where each app's release time is verified to prevent such leakage. It then uses the BYOL self-supervised method to pre-train on app features before fine-tuning a classifier for malware versus benign. The resulting model reaches 98 percent accuracy and 89 percent F1 under strict time-aware splits. Readers should care because real deployments cannot access future data, so methods that ignore time order fail to predict how they will perform in practice.

Core claim

We address this by constructing a time-stamped dataset of benign and malicious Android apps and introducing a timestamp-verification procedure to ensure temporal accuracy. We then propose a detection framework that uses Bootstrap Your Own Latent (BYOL) for self-supervised pre-training to learn obfuscation-resilient representations, followed by supervised classification. Under time-aware evaluation, the method attains 98% accuracy and 89% F1. We further characterize malware behavior by analyzing true positives and false negatives using VirusTotal and the MITRE ATT&CK framework.

What carries the argument

Bootstrap Your Own Latent (BYOL) self-supervised pre-training applied to features from a timestamp-verified Android app dataset, which generates obfuscation-resilient representations for downstream binary classification of benign and malicious apps.

If this is right

Detectors can be deployed with greater confidence that performance will not degrade as new apps are released over time.
The released dataset and code enable other researchers to develop and compare methods under consistent temporal constraints.
Behavioral analysis with MITRE ATT&CK identifies specific tactics that cause false negatives and informs targeted improvements.
Self-supervised pre-training reduces the need for large labeled sets while improving robustness to common obfuscation techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The timestamp verification approach could be adapted to improve evaluations in other security domains where data arrives over time, such as network traffic analysis.
The learned representations may generalize to detect emerging malware families by capturing broad patterns rather than specific signatures.
Combining the pre-training with continual learning could allow the detector to update efficiently as fresh apps appear in the wild.

Load-bearing premise

The timestamp-verification procedure produces a dataset whose temporal distribution matches real-world app release patterns and that BYOL pre-training yields obfuscation-resilient representations sufficient for the downstream classification task.

What would settle it

A new evaluation set consisting only of apps released after the latest date in the training data, where the model fails to maintain accuracy near 98% and F1 near 89%, would show the time-aware performance claim does not hold.

Figures

Figures reproduced from arXiv: 2604.23025 by Annan Fu, Hao Pei, Maryam Tanha.

**Figure 1.** Figure 1: Overview of Our Methodology that can reveal malicious behaviors and remain less dependent on high-level obfuscation. Opcodes are often processed using n-gram models (e.g., FAMCF [22] and FAMD [23]). Further analysis in [24] showed that smaller N values (3 or 4) yield the best malware classification performance. Moreover, some studies such as DeepDetect [25] and DeepRefiner [26] use a simplified or reduced … view at source ↗

read the original abstract

Android malware detectors built with machine learning often suffer from temporal bias: models are trained and evaluated without respecting apps' actual release times, inflating accuracy and weakening real-world robustness. We address this by constructing a time-stamped dataset of benign and malicious Android apps and introducing a timestamp-verification procedure to ensure temporal accuracy. We then propose a detection framework that uses Bootstrap Your Own Latent (BYOL) for self-supervised pre-training to learn obfuscation-resilient representations, followed by supervised classification. Under time-aware evaluation, the method attains 98% accuracy and 89% F1. We further characterize malware behavior by analyzing true positives and false negatives using VirusTotal and the MITRE ATT&CK framework. To support reproducibility and further innovation, we release our dataset and source code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a time-stamped Android malware dataset with verification plus BYOL pre-training for time-aware detection, but the verification step needs close checking to confirm no leakage.

read the letter

The main point is they created a dataset of Android apps with verified release timestamps, applied BYOL self-supervised pre-training to learn features that hold up under obfuscation, then fine-tuned for classification and evaluated strictly by time splits. They report 98% accuracy and 89% F1, release the data and code, and map results to MITRE ATT&CK tactics via VirusTotal reports. That combination is the concrete output worth noting.

Referee Report

2 major / 2 minor

Summary. The paper claims to address temporal bias in Android malware detection by constructing a time-stamped dataset with a timestamp-verification procedure, applying BYOL self-supervised pre-training for obfuscation-resilient representations, and achieving 98% accuracy and 89% F1 under time-aware evaluation. It includes behavioral analysis of malware using VirusTotal and MITRE ATT&CK, and releases the dataset and code for reproducibility.

Significance. If the temporal separation is rigorously enforced without leakage and the performance is validated, this would meaningfully advance robust Android malware detection by tackling a known source of inflated results in the field. Releasing the dataset and code is a clear strength that supports reproducibility and further work on time-aware protocols.

major comments (2)

[§3] §3 (timestamp-verification procedure): The description does not specify handling of inconsistent metadata sources (e.g., first-seen vs. last-update dates, Google Play vs. third-party), multiple release timestamps per app, or explicit verification against post-release information; this is load-bearing for the central claim that the time-aware split supports the 98% accuracy / 89% F1 result without leakage.
[§5] §5 (evaluation): The reported 98% accuracy and 89% F1 under time-aware evaluation are given without baseline comparisons, ablation results isolating the BYOL pre-training contribution, dataset size, temporal distribution statistics, or split ratios; these omissions prevent verification that the protocol and method deliver the claimed gains.

minor comments (2)

[Abstract] Abstract: Lacks dataset size, temporal span, and any mention of baselines or ablations, which would help readers immediately assess the scale and strength of the results.
[§4] §4 (method): The transition from BYOL representations to the supervised classifier could clarify the exact fine-tuning protocol and any freezing of the encoder.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments correctly identify areas where additional methodological detail and experimental context are needed to fully substantiate the central claims regarding temporal integrity and performance gains. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (timestamp-verification procedure): The description does not specify handling of inconsistent metadata sources (e.g., first-seen vs. last-update dates, Google Play vs. third-party), multiple release timestamps per app, or explicit verification against post-release information; this is load-bearing for the central claim that the time-aware split supports the 98% accuracy / 89% F1 result without leakage.

Authors: We agree that the current description of the timestamp-verification procedure in §3 is insufficiently detailed. In the revised manuscript we will expand this section to explicitly describe: (i) our policy for reconciling inconsistent sources by selecting the earliest timestamp that can be corroborated across at least two independent sources (Google Play, VirusTotal first-seen, and third-party archives); (ii) the rule for apps with multiple release timestamps, where we adopt the first-seen date as the canonical release time; and (iii) our post-release verification step, which cross-checks the chosen timestamp against any subsequent VirusTotal or market re-uploads to confirm no later information was used in labeling. We will also add a flowchart and pseudocode for the procedure. These clarifications directly support the no-leakage guarantee of the time-aware split. revision: yes
Referee: [§5] §5 (evaluation): The reported 98% accuracy and 89% F1 under time-aware evaluation are given without baseline comparisons, ablation results isolating the BYOL pre-training contribution, dataset size, temporal distribution statistics, or split ratios; these omissions prevent verification that the protocol and method deliver the claimed gains.

Authors: We acknowledge that §5 currently lacks the supporting statistics and comparisons required for independent verification. In the revision we will add: (1) baseline results using a supervised ResNet without BYOL pre-training and at least one additional SSL method; (2) an ablation table isolating the contribution of the BYOL pre-training stage; (3) dataset summary statistics (total benign/malicious counts, temporal histogram by year, and exact train/validation/test split ratios under the time-aware protocol). These additions will be placed in §5 together with a new results table. Because the dataset and code have already been released, the referee (and readers) will be able to reproduce the exact splits and statistics. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper follows a standard pipeline: construct a timestamped dataset via a verification procedure, apply BYOL self-supervised pre-training (a known external method), then perform supervised fine-tuning and report accuracy/F1 under time-aware splits. No equations, parameters, or predictions are shown to reduce to fitted inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. The timestamp-verification step is presented as an empirical safeguard rather than a definitional tautology. The reported 98% accuracy and 89% F1 are externally measured outcomes, not forced by the method's internal definitions. This is a normal non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work rests on standard machine-learning assumptions about representation learning and data labeling that are not further detailed here.

pith-pipeline@v0.9.0 · 5428 in / 1144 out tokens · 43432 ms · 2026-05-08T11:20:57.625798+00:00 · methodology

Self-Supervised Learning for Android Malware Detection on a Time-Stamped Dataset

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)