T2T: Captioning Smartphone Activities Using Mobile Traffic
Pith reviewed 2026-05-10 19:01 UTC · model grok-4.3
The pith
T2T turns encrypted mobile traffic into readable captions for smartphone activities
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
T2T is a traffic-to-text system built from a flow feature encoder that converts encrypted traffic characteristics into latent features and a caption decoder that outputs readable activity descriptions. Automatic annotation of training data is performed by feeding synchronized screen-capture videos into the Qwen-VL-Max vision-language model and training with multi-stage losses for cross-modal alignment. Evaluated on 40,000 traffic-description pairs gathered in two real-world environments, T2T records BLEU-4 of 58.1, METEOR of 38.3, ROUGE-L of 70.5 and CIDEr of 108.7, generating semantically accurate captions that match the quality of the vision-language model itself.
What carries the argument
The T2T encoder-decoder architecture in which the flow feature encoder transforms low-level encrypted traffic into meaningful latent features and the caption decoder produces readable activity transcripts, trained with multi-stage losses on labels from a vision-language model.
If this is right
- Enables scalable activity monitoring across many apps without installing per-app classifiers or accessing device screens directly.
- Produces human-readable text outputs that describe detailed user interactions rather than coarse category labels.
- Allows automatic creation of large-scale annotated traffic datasets by leveraging existing vision-language models on screen recordings.
- Shows that encrypted network flows alone can convey sufficient information for high-level semantic reconstruction of smartphone usage.
Where Pith is reading between the lines
- The approach could support real-time activity logging for accessibility or parental-control applications.
- It surfaces privacy risks because detailed user behavior can be inferred from traffic patterns without breaking encryption.
- Similar traffic-to-text pipelines might extend to other encrypted channels such as IoT device communications or web browsing.
Load-bearing premise
The encoder-decoder architecture can reliably bridge the semantic gap between low-level encrypted traffic features and high-level activity descriptions when trained on automatically generated labels from a vision-language model applied to screen videos.
What would settle it
A substantial drop in all captioning metrics when T2T is tested on traffic collected from new users, different smartphone hardware, or apps outside the original set of 20 without retraining or additional screen-video labels.
Figures
read the original abstract
This paper studies the creation of textual descriptions of user activities and interactions on smartphones. Our approach of referring to encrypted mobile traffic exceeds traditional smartphone activity classification methods in terms of model scalability and output readability. The paper addresses two obstacles to the realization of this idea: the semantic gap between traffic features and smartphone activity captions, and the lack of textually annotated traffic data. To overcome these challenges, we introduce a novel smartphone activity captioning system, called T2T (Traffic-to-Text). T2T consists of a flow feature encoder that converts low-level traffic characteristics into meaningful latent features and a caption decoder to yield readable transcripts of smartphone activities. In addition, T2T achieves the automatic textual annotation of mobile traffic by feeding synchronized screen capture videos into the Qwen-VL-Max vision-language model, and proposing multi-stage losses for effective cross-model training. We evaluate T2T on 40,000 traffic-description pairs collected in two real-world environments, involving 8 smartphone users and 20 mobile apps. T2T achieves a BLEU-4 score of 58.1, a METEOR score of 38.3, a ROUGE-L score of 70.5, and a CIDEr score of 108.7. The quantitative and qualitative analyses show that T2T can generate semantically accurate captions that are comparable to the vision-language model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes T2T, an encoder-decoder architecture that generates natural-language captions describing smartphone user activities and app interactions directly from encrypted mobile traffic flows. It addresses the semantic gap and lack of labeled data by using Qwen-VL-Max to automatically produce textual labels from synchronized screen-capture videos, trains on a 40,000-pair dataset collected from 8 real users across 20 apps in two environments, and reports BLEU-4=58.1, METEOR=38.3, ROUGE-L=70.5, CIDEr=108.7, claiming the resulting captions are semantically accurate and comparable to the vision-language model.
Significance. If independently validated, the approach would offer a scalable, privacy-preserving alternative to screen-based or app-specific activity logging by producing readable textual descriptions from traffic alone. The automatic labeling strategy using a VLM is a practical contribution for dataset creation in this domain. However, the current results primarily quantify fidelity to the VLM's pseudo-labels rather than verified semantic correctness of the activity descriptions.
major comments (3)
- [Abstract / Evaluation] Abstract and evaluation description: the reported BLEU/METEOR/ROUGE-L/CIDEr scores are computed against labels automatically generated by Qwen-VL-Max with no human annotation, inter-rater reliability, or error analysis provided for those pseudo-labels. Consequently the metrics demonstrate how well the traffic encoder reproduces the VLM outputs rather than establishing that the captions correctly describe the underlying smartphone activities.
- [Evaluation] Evaluation section: no baseline models (e.g., traffic-only classifiers, simpler sequence models, or direct VLM comparison on traffic features) or statistical significance tests are reported for the metric scores on the 40,000-pair dataset. This leaves the improvement claim unanchored.
- [Data collection / Experiments] Data collection and experimental setup: the manuscript does not discuss potential data leakage between train and test splits or user/app overlap in the 40k pairs collected from real users, which is critical given the synchronized video-traffic pairing and the central claim of generalizable captioning.
minor comments (1)
- [Abstract] The abstract could more explicitly distinguish between fidelity to VLM-generated labels and independent ground-truth accuracy of the activity descriptions.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our paper. We have carefully considered each major comment and provide point-by-point responses below, along with plans for revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation description: the reported BLEU/METEOR/ROUGE-L/CIDEr scores are computed against labels automatically generated by Qwen-VL-Max with no human annotation, inter-rater reliability, or error analysis provided for those pseudo-labels. Consequently the metrics demonstrate how well the traffic encoder reproduces the VLM outputs rather than establishing that the captions correctly describe the underlying smartphone activities.
Authors: We appreciate the referee pointing out this key aspect of our evaluation. The metrics are indeed computed against the pseudo-labels generated by Qwen-VL-Max, as this VLM serves as our source of automatic annotations for the large-scale dataset. This approach allows us to scale the training data significantly. We did include qualitative examples in the manuscript demonstrating that the captions align with the actual activities observed in the screen captures. However, we agree that a more rigorous validation of the pseudo-label quality and human assessment of T2T outputs would enhance the paper. In the revised version, we will add an error analysis of the VLM labels and a small-scale human evaluation study to verify semantic accuracy. revision: yes
-
Referee: [Evaluation] Evaluation section: no baseline models (e.g., traffic-only classifiers, simpler sequence models, or direct VLM comparison on traffic features) or statistical significance tests are reported for the metric scores on the 40,000-pair dataset. This leaves the improvement claim unanchored.
Authors: We acknowledge that the current manuscript lacks explicit baseline comparisons and statistical significance testing. To address this, we will incorporate several baseline models, including a simple traffic feature classifier and a basic LSTM-based sequence model, and report their performance on the same dataset. Additionally, we will include statistical significance tests, such as bootstrap resampling or paired tests, to validate the improvements in our metrics. These additions will be detailed in the revised evaluation section. revision: yes
-
Referee: [Data collection / Experiments] Data collection and experimental setup: the manuscript does not discuss potential data leakage between train and test splits or user/app overlap in the 40k pairs collected from real users, which is critical given the synchronized video-traffic pairing and the central claim of generalizability.
Authors: The referee correctly notes that the manuscript does not explicitly discuss data leakage prevention. In our data collection, we collected data from 8 distinct users across two environments, and the 40,000 pairs were split such that no user's data appears in both training and test sets to avoid leakage. App overlaps were handled by ensuring diversity in the splits. We will add a dedicated subsection in the experimental setup describing the splitting strategy, user/app separation, and measures taken to prevent leakage, thereby supporting the generalizability claims. revision: yes
Circularity Check
No circularity in derivation chain; encoder-decoder trained on external VLM pseudo-labels and evaluated with standard independent metrics
full rationale
The paper describes collecting 40,000 traffic-description pairs where textual labels are produced by feeding synchronized screen-capture videos into the external Qwen-VL-Max vision-language model, then training a flow feature encoder plus caption decoder with multi-stage losses. Reported BLEU-4, METEOR, ROUGE-L and CIDEr scores quantify agreement with those fixed pseudo-labels using standard NLP metrics on held-out pairs from real-world user sessions. No mathematical derivations, equations, or parameter-fitting steps are shown that reduce by construction to the inputs. No self-citations are invoked as load-bearing uniqueness theorems, ansatzes, or imported results. The architecture and training procedure are presented as self-contained solutions to the semantic-gap and data-annotation problems, with no renaming of known patterns or self-definitional loops. The central claim therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Encrypted mobile traffic features contain enough information to support generation of semantically accurate activity captions
- domain assumption Qwen-VL-Max produces sufficiently accurate textual annotations from screen capture videos to serve as training targets
Reference graph
Works this paper leans on
-
[1]
I. A. Albadarneh, B. H. Hammo, and O. S. Al-Kadi, “Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation,”Comput. Sci. Rev., vol. 58, p. 100766, 2025
work page 2025
-
[2]
Predicting user behavior in smart spaces with LLM-enhanced logs and personalized prompts,
Y . Song, J. Li, Y . Bian, and Z. Cai, “Predicting user behavior in smart spaces with LLM-enhanced logs and personalized prompts,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 1, 2025, pp. 764–772
work page 2025
-
[3]
S. Zhuo, L. Sherlock, G. Dobbie, Y . S. Koh, G. Russello, and D. Lot- tridge, “Real-time smartphone activity classification using inertial sen- sors—recognition of scrolling, typing, and watching videos while sitting or walking,”Sensors, vol. 20, no. 3, p. 655, 2020
work page 2020
-
[4]
Eavesdropping mobile apps and actions through wireless traffic in the open world,
X. Yang, Y . Huang, J. Guo, D. Zhang, and Q. Wang, “Eavesdropping mobile apps and actions through wireless traffic in the open world,” in Proc. Int. Conf. Intell. Comput., 2024, pp. 3–15
work page 2024
-
[5]
Qwen-VL: A frontier large vision-language model with versatile abilities,
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-VL: A frontier large vision-language model with versatile abilities,”arXiv preprint, 2023
work page 2023
-
[6]
Interactive concept network enhanced transformer for remote sensing image cap- tioning,
C. Zhang, Z. Ren, B. Hou, J. Meng, W. Li, and L. Jiao, “Interactive concept network enhanced transformer for remote sensing image cap- tioning,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–16, 2025
work page 2025
-
[7]
Learning hierarchical modular networks for video captioning,
G. Li, H. Ye, Y . Qi, S. Wang, L. Qing, Q. Huang, and M.-H. Yang, “Learning hierarchical modular networks for video captioning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 2, pp. 1049–1064, 2024
work page 2024
-
[8]
Efficient audio captioning with encoder-level knowledge distillation,
X. Xu, H. Liu, M. Wu, W. Wang, and M. D. Plumbley, “Efficient audio captioning with encoder-level knowledge distillation,” inProc. Interspeech, 2024
work page 2024
-
[9]
In-home daily-life captioning using radio signals,
L. Fan, T. Li, Y . Yuan, and D. Katabi, “In-home daily-life captioning using radio signals,” inProc. Eur. Conf. Comput. Vis., 2020, pp. 105– 123
work page 2020
-
[10]
Designwatch: Analyzing users’ operations of mobile apps based on screen recordings,
X. Zhang, Y . Zeng, Q. Li, G. Chen, Q. Xu, X. Hu, and Z. Peng, “Designwatch: Analyzing users’ operations of mobile apps based on screen recordings,” inAdjunct Proc. 26th Int. Conf. Mobile Hum.- Comput. Interact., 2024, pp. 1–7
work page 2024
-
[11]
DCapsNet: Deep capsule net- work for human activity and gait recognition with smartphone sensors,
A. Sezavar, R. Atta, and M. Ghanbari, “DCapsNet: Deep capsule net- work for human activity and gait recognition with smartphone sensors,” Pattern Recognit., vol. 147, p. 110054, 2024
work page 2024
-
[12]
FOAP: Fine-grained open-world android app fingerprinting,
J. Li, H. Zhou, S. Wu, X. Luo, T. Wang, X. Zhan, and X. Ma, “FOAP: Fine-grained open-world android app fingerprinting,” inProc. 31st USENIX Security Symp., 2022, pp. 1579–1596
work page 2022
-
[13]
Smartphone user fingerprinting on wireless traffic,
Y . Huang, Z. Dong, X. Yang, D. Zhang, Q. Wang, and Z. Wang, “Smartphone user fingerprinting on wireless traffic,”IEEE Trans. Mobile Comput., vol. 25, no. 4, pp. 5406–5420, 2025
work page 2025
-
[14]
K. Lin, X. Xu, and H. Gao, “Tscrnn: A novel classification scheme of encrypted traffic based on flow spatiotemporal features for efficient management of iiot,”Computer Networks, vol. 190, p. 107974, 2021
work page 2021
-
[15]
CD-Net: Robust mobile traffic classification against apps updating,
Y . Chen, B. Hou, B. Wu, and H. Hu, “CD-Net: Robust mobile traffic classification against apps updating,”Comput. Secur., vol. 150, p. 104214, 2025
work page 2025
-
[16]
FILM: Visual reasoning with a general conditioning layer,
E. Perez, F. Strub, H. D. Vries, V . Dumoulin, and A. Courville, “FILM: Visual reasoning with a general conditioning layer,” inProc. AAAI Conf. Artif. Intell., vol. 32, no. 1, 2018
work page 2018
-
[17]
Sentence-bert: Sentence embeddings using siamese bert-networks,
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProc. Conf. Empir. Methods Nat. Lang. Process., 2019, pp. 3980–3990
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.