pith. sign in

arxiv: 2604.05642 · v1 · submitted 2026-04-07 · 💻 cs.CR

T2T: Captioning Smartphone Activities Using Mobile Traffic

Pith reviewed 2026-05-10 19:01 UTC · model grok-4.3

classification 💻 cs.CR
keywords smartphone activity captioningencrypted traffic analysistraffic-to-textencoder-decoder modelvision-language modelmobile app usagecross-modal training
0
0 comments X

The pith

T2T turns encrypted mobile traffic into readable captions for smartphone activities

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes generating textual descriptions of smartphone activities and user interactions directly from encrypted mobile traffic, moving beyond traditional classification methods that offer limited scalability and poor readability. The core challenges are the semantic gap between raw traffic features and high-level activity descriptions plus the absence of text-annotated traffic datasets. T2T addresses these with a flow feature encoder that maps low-level traffic into latent representations and a caption decoder that produces natural language transcripts, while automatically creating training labels by running synchronized screen videos through a vision-language model and applying multi-stage losses. On a dataset of 40,000 real-world pairs collected from 8 users and 20 apps, the system delivers strong scores across standard captioning metrics and produces outputs comparable in semantic accuracy to direct visual analysis.

Core claim

T2T is a traffic-to-text system built from a flow feature encoder that converts encrypted traffic characteristics into latent features and a caption decoder that outputs readable activity descriptions. Automatic annotation of training data is performed by feeding synchronized screen-capture videos into the Qwen-VL-Max vision-language model and training with multi-stage losses for cross-modal alignment. Evaluated on 40,000 traffic-description pairs gathered in two real-world environments, T2T records BLEU-4 of 58.1, METEOR of 38.3, ROUGE-L of 70.5 and CIDEr of 108.7, generating semantically accurate captions that match the quality of the vision-language model itself.

What carries the argument

The T2T encoder-decoder architecture in which the flow feature encoder transforms low-level encrypted traffic into meaningful latent features and the caption decoder produces readable activity transcripts, trained with multi-stage losses on labels from a vision-language model.

If this is right

  • Enables scalable activity monitoring across many apps without installing per-app classifiers or accessing device screens directly.
  • Produces human-readable text outputs that describe detailed user interactions rather than coarse category labels.
  • Allows automatic creation of large-scale annotated traffic datasets by leveraging existing vision-language models on screen recordings.
  • Shows that encrypted network flows alone can convey sufficient information for high-level semantic reconstruction of smartphone usage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could support real-time activity logging for accessibility or parental-control applications.
  • It surfaces privacy risks because detailed user behavior can be inferred from traffic patterns without breaking encryption.
  • Similar traffic-to-text pipelines might extend to other encrypted channels such as IoT device communications or web browsing.

Load-bearing premise

The encoder-decoder architecture can reliably bridge the semantic gap between low-level encrypted traffic features and high-level activity descriptions when trained on automatically generated labels from a vision-language model applied to screen videos.

What would settle it

A substantial drop in all captioning metrics when T2T is tested on traffic collected from new users, different smartphone hardware, or apps outside the original set of 20 without retraining or additional screen-video labels.

Figures

Figures reproduced from arXiv: 2604.05642 by Jiyu Liu, Wanqing Tu, Yanzhao Lu, Yong Huang, Yun Tie.

Figure 1
Figure 1. Figure 1: Illustration of a “Traffic-to-Text” system. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of T2T. It contains a flow feature encoder, a caption decoder, and a cross-modal annotation and training scheme. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental setups during data collection. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of the flow number. B@4 M R C Metric 0 20 40 60 80 100 Prototype = 1 Prototype = 3 Prototype = 5 Prototype = 7 Prototype = 9 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustrations of captions generated by Qwen-VL-Max, the baseline model, and T2T. They cover types of mobile apps, including video, shopping, and [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

This paper studies the creation of textual descriptions of user activities and interactions on smartphones. Our approach of referring to encrypted mobile traffic exceeds traditional smartphone activity classification methods in terms of model scalability and output readability. The paper addresses two obstacles to the realization of this idea: the semantic gap between traffic features and smartphone activity captions, and the lack of textually annotated traffic data. To overcome these challenges, we introduce a novel smartphone activity captioning system, called T2T (Traffic-to-Text). T2T consists of a flow feature encoder that converts low-level traffic characteristics into meaningful latent features and a caption decoder to yield readable transcripts of smartphone activities. In addition, T2T achieves the automatic textual annotation of mobile traffic by feeding synchronized screen capture videos into the Qwen-VL-Max vision-language model, and proposing multi-stage losses for effective cross-model training. We evaluate T2T on 40,000 traffic-description pairs collected in two real-world environments, involving 8 smartphone users and 20 mobile apps. T2T achieves a BLEU-4 score of 58.1, a METEOR score of 38.3, a ROUGE-L score of 70.5, and a CIDEr score of 108.7. The quantitative and qualitative analyses show that T2T can generate semantically accurate captions that are comparable to the vision-language model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes T2T, an encoder-decoder architecture that generates natural-language captions describing smartphone user activities and app interactions directly from encrypted mobile traffic flows. It addresses the semantic gap and lack of labeled data by using Qwen-VL-Max to automatically produce textual labels from synchronized screen-capture videos, trains on a 40,000-pair dataset collected from 8 real users across 20 apps in two environments, and reports BLEU-4=58.1, METEOR=38.3, ROUGE-L=70.5, CIDEr=108.7, claiming the resulting captions are semantically accurate and comparable to the vision-language model.

Significance. If independently validated, the approach would offer a scalable, privacy-preserving alternative to screen-based or app-specific activity logging by producing readable textual descriptions from traffic alone. The automatic labeling strategy using a VLM is a practical contribution for dataset creation in this domain. However, the current results primarily quantify fidelity to the VLM's pseudo-labels rather than verified semantic correctness of the activity descriptions.

major comments (3)
  1. [Abstract / Evaluation] Abstract and evaluation description: the reported BLEU/METEOR/ROUGE-L/CIDEr scores are computed against labels automatically generated by Qwen-VL-Max with no human annotation, inter-rater reliability, or error analysis provided for those pseudo-labels. Consequently the metrics demonstrate how well the traffic encoder reproduces the VLM outputs rather than establishing that the captions correctly describe the underlying smartphone activities.
  2. [Evaluation] Evaluation section: no baseline models (e.g., traffic-only classifiers, simpler sequence models, or direct VLM comparison on traffic features) or statistical significance tests are reported for the metric scores on the 40,000-pair dataset. This leaves the improvement claim unanchored.
  3. [Data collection / Experiments] Data collection and experimental setup: the manuscript does not discuss potential data leakage between train and test splits or user/app overlap in the 40k pairs collected from real users, which is critical given the synchronized video-traffic pairing and the central claim of generalizable captioning.
minor comments (1)
  1. [Abstract] The abstract could more explicitly distinguish between fidelity to VLM-generated labels and independent ground-truth accuracy of the activity descriptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our paper. We have carefully considered each major comment and provide point-by-point responses below, along with plans for revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and evaluation description: the reported BLEU/METEOR/ROUGE-L/CIDEr scores are computed against labels automatically generated by Qwen-VL-Max with no human annotation, inter-rater reliability, or error analysis provided for those pseudo-labels. Consequently the metrics demonstrate how well the traffic encoder reproduces the VLM outputs rather than establishing that the captions correctly describe the underlying smartphone activities.

    Authors: We appreciate the referee pointing out this key aspect of our evaluation. The metrics are indeed computed against the pseudo-labels generated by Qwen-VL-Max, as this VLM serves as our source of automatic annotations for the large-scale dataset. This approach allows us to scale the training data significantly. We did include qualitative examples in the manuscript demonstrating that the captions align with the actual activities observed in the screen captures. However, we agree that a more rigorous validation of the pseudo-label quality and human assessment of T2T outputs would enhance the paper. In the revised version, we will add an error analysis of the VLM labels and a small-scale human evaluation study to verify semantic accuracy. revision: yes

  2. Referee: [Evaluation] Evaluation section: no baseline models (e.g., traffic-only classifiers, simpler sequence models, or direct VLM comparison on traffic features) or statistical significance tests are reported for the metric scores on the 40,000-pair dataset. This leaves the improvement claim unanchored.

    Authors: We acknowledge that the current manuscript lacks explicit baseline comparisons and statistical significance testing. To address this, we will incorporate several baseline models, including a simple traffic feature classifier and a basic LSTM-based sequence model, and report their performance on the same dataset. Additionally, we will include statistical significance tests, such as bootstrap resampling or paired tests, to validate the improvements in our metrics. These additions will be detailed in the revised evaluation section. revision: yes

  3. Referee: [Data collection / Experiments] Data collection and experimental setup: the manuscript does not discuss potential data leakage between train and test splits or user/app overlap in the 40k pairs collected from real users, which is critical given the synchronized video-traffic pairing and the central claim of generalizability.

    Authors: The referee correctly notes that the manuscript does not explicitly discuss data leakage prevention. In our data collection, we collected data from 8 distinct users across two environments, and the 40,000 pairs were split such that no user's data appears in both training and test sets to avoid leakage. App overlaps were handled by ensuring diversity in the splits. We will add a dedicated subsection in the experimental setup describing the splitting strategy, user/app separation, and measures taken to prevent leakage, thereby supporting the generalizability claims. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; encoder-decoder trained on external VLM pseudo-labels and evaluated with standard independent metrics

full rationale

The paper describes collecting 40,000 traffic-description pairs where textual labels are produced by feeding synchronized screen-capture videos into the external Qwen-VL-Max vision-language model, then training a flow feature encoder plus caption decoder with multi-stage losses. Reported BLEU-4, METEOR, ROUGE-L and CIDEr scores quantify agreement with those fixed pseudo-labels using standard NLP metrics on held-out pairs from real-world user sessions. No mathematical derivations, equations, or parameter-fitting steps are shown that reduce by construction to the inputs. No self-citations are invoked as load-bearing uniqueness theorems, ansatzes, or imported results. The architecture and training procedure are presented as self-contained solutions to the semantic-gap and data-annotation problems, with no renaming of known patterns or self-definitional loops. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that traffic statistics encode sufficient semantic information for natural-language activity description and on the reliability of the external vision-language model for label generation; no free parameters or invented entities are introduced beyond standard neural network training.

axioms (2)
  • domain assumption Encrypted mobile traffic features contain enough information to support generation of semantically accurate activity captions
    Invoked as the core premise enabling the encoder-decoder approach
  • domain assumption Qwen-VL-Max produces sufficiently accurate textual annotations from screen capture videos to serve as training targets
    Required for the multi-stage training procedure described

pith-pipeline@v0.9.0 · 5551 in / 1445 out tokens · 49648 ms · 2026-05-10T19:01:19.032353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation,

    I. A. Albadarneh, B. H. Hammo, and O. S. Al-Kadi, “Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation,”Comput. Sci. Rev., vol. 58, p. 100766, 2025

  2. [2]

    Predicting user behavior in smart spaces with LLM-enhanced logs and personalized prompts,

    Y . Song, J. Li, Y . Bian, and Z. Cai, “Predicting user behavior in smart spaces with LLM-enhanced logs and personalized prompts,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 1, 2025, pp. 764–772

  3. [3]

    Real-time smartphone activity classification using inertial sen- sors—recognition of scrolling, typing, and watching videos while sitting or walking,

    S. Zhuo, L. Sherlock, G. Dobbie, Y . S. Koh, G. Russello, and D. Lot- tridge, “Real-time smartphone activity classification using inertial sen- sors—recognition of scrolling, typing, and watching videos while sitting or walking,”Sensors, vol. 20, no. 3, p. 655, 2020

  4. [4]

    Eavesdropping mobile apps and actions through wireless traffic in the open world,

    X. Yang, Y . Huang, J. Guo, D. Zhang, and Q. Wang, “Eavesdropping mobile apps and actions through wireless traffic in the open world,” in Proc. Int. Conf. Intell. Comput., 2024, pp. 3–15

  5. [5]

    Qwen-VL: A frontier large vision-language model with versatile abilities,

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-VL: A frontier large vision-language model with versatile abilities,”arXiv preprint, 2023

  6. [6]

    Interactive concept network enhanced transformer for remote sensing image cap- tioning,

    C. Zhang, Z. Ren, B. Hou, J. Meng, W. Li, and L. Jiao, “Interactive concept network enhanced transformer for remote sensing image cap- tioning,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–16, 2025

  7. [7]

    Learning hierarchical modular networks for video captioning,

    G. Li, H. Ye, Y . Qi, S. Wang, L. Qing, Q. Huang, and M.-H. Yang, “Learning hierarchical modular networks for video captioning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 2, pp. 1049–1064, 2024

  8. [8]

    Efficient audio captioning with encoder-level knowledge distillation,

    X. Xu, H. Liu, M. Wu, W. Wang, and M. D. Plumbley, “Efficient audio captioning with encoder-level knowledge distillation,” inProc. Interspeech, 2024

  9. [9]

    In-home daily-life captioning using radio signals,

    L. Fan, T. Li, Y . Yuan, and D. Katabi, “In-home daily-life captioning using radio signals,” inProc. Eur. Conf. Comput. Vis., 2020, pp. 105– 123

  10. [10]

    Designwatch: Analyzing users’ operations of mobile apps based on screen recordings,

    X. Zhang, Y . Zeng, Q. Li, G. Chen, Q. Xu, X. Hu, and Z. Peng, “Designwatch: Analyzing users’ operations of mobile apps based on screen recordings,” inAdjunct Proc. 26th Int. Conf. Mobile Hum.- Comput. Interact., 2024, pp. 1–7

  11. [11]

    DCapsNet: Deep capsule net- work for human activity and gait recognition with smartphone sensors,

    A. Sezavar, R. Atta, and M. Ghanbari, “DCapsNet: Deep capsule net- work for human activity and gait recognition with smartphone sensors,” Pattern Recognit., vol. 147, p. 110054, 2024

  12. [12]

    FOAP: Fine-grained open-world android app fingerprinting,

    J. Li, H. Zhou, S. Wu, X. Luo, T. Wang, X. Zhan, and X. Ma, “FOAP: Fine-grained open-world android app fingerprinting,” inProc. 31st USENIX Security Symp., 2022, pp. 1579–1596

  13. [13]

    Smartphone user fingerprinting on wireless traffic,

    Y . Huang, Z. Dong, X. Yang, D. Zhang, Q. Wang, and Z. Wang, “Smartphone user fingerprinting on wireless traffic,”IEEE Trans. Mobile Comput., vol. 25, no. 4, pp. 5406–5420, 2025

  14. [14]

    Tscrnn: A novel classification scheme of encrypted traffic based on flow spatiotemporal features for efficient management of iiot,

    K. Lin, X. Xu, and H. Gao, “Tscrnn: A novel classification scheme of encrypted traffic based on flow spatiotemporal features for efficient management of iiot,”Computer Networks, vol. 190, p. 107974, 2021

  15. [15]

    CD-Net: Robust mobile traffic classification against apps updating,

    Y . Chen, B. Hou, B. Wu, and H. Hu, “CD-Net: Robust mobile traffic classification against apps updating,”Comput. Secur., vol. 150, p. 104214, 2025

  16. [16]

    FILM: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub, H. D. Vries, V . Dumoulin, and A. Courville, “FILM: Visual reasoning with a general conditioning layer,” inProc. AAAI Conf. Artif. Intell., vol. 32, no. 1, 2018

  17. [17]

    Sentence-bert: Sentence embeddings using siamese bert-networks,

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProc. Conf. Empir. Methods Nat. Lang. Process., 2019, pp. 3980–3990