From Exploration to Revelation: Detecting Dark Patterns in Mobile Apps
Pith reviewed 2026-05-23 17:03 UTC · model grok-4.3
The pith
AppRay combines LLM-guided exploration and contrastive learning to detect both intra-page and inter-page deceptive patterns in mobile apps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AppRay operates in two stages: LLM-guided task-oriented exploration combined with random exploration captures diverse UI states, after which a contrastive learning-based multi-label classifier augmented with a rule-based refiner detects both intra-page and inter-page deceptive patterns, achieving the reported performance metrics and enabling detection of previously unexplored patterns.
What carries the argument
Two-stage pipeline of LLM-guided plus random UI exploration feeding a contrastive learning multi-label classifier with rule-based refiner for context-aware detection of deceptive patterns.
If this is right
- Detection now covers inter-page patterns that prior automated methods left out.
- Manual exploration and labeling effort drops because the LLM component directs the search.
- The two contributed datasets supply 2,185 labeled instances across 876 deceptive and 871 benign UIs for training or benchmarking.
- Performance gains range from 27.14 percent to 1200 percent relative to earlier methods on the same tasks.
Where Pith is reading between the lines
- The same exploration-plus-classifier structure could be tested on web or desktop interfaces where deceptive patterns also appear across multiple screens.
- App stores could integrate the refiner stage to flag submissions before release, provided the rule set is updated for new pattern variants.
- If the LLM component is replaced with a lighter model, runtime cost on resource-limited devices becomes measurable and could be compared directly to the current results.
Load-bearing premise
The mix of LLM-guided task-oriented exploration and random exploration produces a diverse enough set of UI states to let the classifier reliably find both single-page and multi-page deceptive patterns.
What would settle it
Run AppRay on a fresh set of apps containing known inter-page deceptive patterns that require specific task sequences the LLM guidance does not generate; if the system misses most of those patterns while human reviewers find them, the central claim does not hold.
Figures
read the original abstract
Mobile apps are essential in daily life but frequently employ deceptive patterns, such as visual emphasis or linguistic nudging, to manipulate user behavior. Existing research largely relies on manual detection, which is time-consuming and cannot keep pace with rapidly evolving apps. Although recent work has explored automated approaches, these methods are limited to intra-page patterns, depend on manual app exploration, and lack flexibility. To address these limitations, we present AppRay, a system that integrates task-oriented app exploration with automated deceptive pattern detection to reduce manual effort, expand detection coverage, and improve performance. AppRay operates in two stages. First, it combines large language model-guided task-oriented exploration with random exploration to capture diverse user interface (UI) states. Second, it detects both intra-page and inter-page deceptive patterns using a contrastive learning-based multi-label classifier augmented with a rule-based refiner for context-aware detection. We contribute two datasets, AppRay-Tainted-UIs and AppRay-Benign-UIs, comprising 2,185 deceptive pattern instances, including 149 intra-page cases, spanning 16 types across 876 deceptive and 871 benign UIs, while preserving UI relationships. Experimental results show that AppRay achieves macro/micro averaged precision of 0.92/0.85, recall of 0.86/0.88, and F1 scores of 0.89/0.85, yielding 27.14% to 1200% improvements over prior methods and enabling effective detection of previously unexplored deceptive patterns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AppRay, a two-stage system for automated detection of deceptive (dark) patterns in mobile apps. Stage 1 combines LLM-guided task-oriented exploration with random exploration to capture diverse UI states while preserving relationships; Stage 2 applies a contrastive learning multi-label classifier plus rule-based refiner to identify both intra-page and inter-page patterns across 16 types. The authors contribute two new datasets (AppRay-Tainted-UIs and AppRay-Benign-UIs) totaling 2,185 deceptive instances over 876 deceptive and 871 benign UIs, and report macro/micro precision 0.92/0.85, recall 0.86/0.88, and F1 0.89/0.85, claiming 27.14%–1200% gains over prior methods and the ability to detect previously unexplored inter-page patterns.
Significance. If the exploration coverage and dataset labeling can be validated, the work would meaningfully extend automated dark-pattern detection beyond the intra-page/manual-exploration limits of prior art, with the contributed datasets (preserving UI relationships) providing a reusable resource for the community. The contrastive-learning approach for multi-label context-aware detection is a plausible technical step forward.
major comments (3)
- [Abstract] Abstract: the headline performance metrics and the claim of detecting 'previously unexplored' inter-page patterns rest on the unverified assumption that LLM-guided + random exploration produces sufficiently diverse UI states and transitions; no state-coverage metrics, transition-graph statistics, or ablation removing the LLM component are reported to show that the LLM guidance reaches deceptive inter-page flows beyond random walks.
- [Abstract] Abstract (dataset description): the 2,185 deceptive pattern instances (including 149 intra-page cases) are presented without any information on the labeling process, inter-annotator agreement, or how deceptive vs. benign UIs were identified, which is load-bearing for trusting the reported precision/recall/F1 values and the claimed improvements.
- [Abstract] Abstract: the stated 27.14%–1200% improvements over prior methods are given without baseline implementation details, ablation studies, or clarification whether gains derive from the new exploration technique versus simply the new dataset construction, leaving the central methodological contribution unsupported.
minor comments (2)
- [Abstract] Abstract: the exact number of apps explored and the distribution of the 16 pattern types across the 876/871 UIs should be stated explicitly for reproducibility.
- The free parameters (LLM prompt choices and contrastive-learning hyperparameters/thresholds) are not listed; adding an explicit enumeration would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting areas where the abstract could better support its claims. We address each point below and commit to revisions that add the requested details and evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline performance metrics and the claim of detecting 'previously unexplored' inter-page patterns rest on the unverified assumption that LLM-guided + random exploration produces sufficiently diverse UI states and transitions; no state-coverage metrics, transition-graph statistics, or ablation removing the LLM component are reported to show that the LLM guidance reaches deceptive inter-page flows beyond random walks.
Authors: We agree that the abstract lacks quantitative support for the exploration component. The full paper describes the LLM-guided plus random strategy in Section 4, but does not report coverage metrics or ablations. In the revision we will add state-coverage metrics, transition-graph statistics, and an ablation removing the LLM component to demonstrate its contribution to inter-page pattern detection. revision: yes
-
Referee: [Abstract] Abstract (dataset description): the 2,185 deceptive pattern instances (including 149 intra-page cases) are presented without any information on the labeling process, inter-annotator agreement, or how deceptive vs. benign UIs were identified, which is load-bearing for trusting the reported precision/recall/F1 values and the claimed improvements.
Authors: We acknowledge that the abstract omits labeling methodology. The datasets were created via automated flagging followed by multi-annotator verification. In the revision we will expand the abstract and add a methods subsection detailing the labeling protocol, inter-annotator agreement, and criteria used to classify deceptive versus benign UIs. revision: yes
-
Referee: [Abstract] Abstract: the stated 27.14%–1200% improvements over prior methods are given without baseline implementation details, ablation studies, or clarification whether gains derive from the new exploration technique versus simply the new dataset construction, leaving the central methodological contribution unsupported.
Authors: The gains arise from the joint effect of the exploration method and the contrastive classifier on the new datasets. We agree more transparency is required. In the revision we will supply baseline re-implementation details, ablation studies isolating the exploration versus classifier/dataset contributions, and explicit attribution of the performance deltas. revision: yes
Circularity Check
No significant circularity; empirical evaluation on held-out data
full rationale
The paper presents AppRay as an empirical system whose performance numbers (precision, recall, F1) are reported as direct measurements on two newly contributed datasets (AppRay-Tainted-UIs and AppRay-Benign-UIs) containing 2,185 instances. No equations, parameters, or predictions are shown to reduce by construction to fitted inputs or self-definitions. The contrastive-learning classifier and rule-based refiner are standard ML components applied to the collected UI data; the exploration stage is described as a data-collection procedure rather than a derived quantity. No self-citation chain is invoked as a uniqueness theorem or load-bearing premise. The central claims therefore remain independent of the reported metrics.
Axiom & Free-Parameter Ledger
free parameters (2)
- LLM prompt choices for task guidance
- Contrastive learning hyperparameters and thresholds
axioms (1)
- domain assumption The 2,185 labeled deceptive pattern instances in AppRay-Tainted-UIs and AppRay-Benign-UIs accurately reflect real-world dark patterns across apps
Reference graph
Works this paper leans on
-
[1]
Reflect, not regret: Understanding regretful smartphone use with app feature-level analysis,
H. Cho, D. Choi, D. Kim, W. J. Kang, E. K. Choe, and S.-J. Lee, “Reflect, not regret: Understanding regretful smartphone use with app feature-level analysis,” Proceedings of the ACM on human-computer interaction, vol. 5, no. CSCW2, pp. 1–36, 2021
work page 2021
-
[2]
Defining and identifying attention capture deceptive designs in digital interfaces,
A. Monge Roffarello, K. Lukoff, and L. De Russis, “Defining and identifying attention capture deceptive designs in digital interfaces,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–19
work page 2023
-
[3]
The dark (patterns) side of ux design,
C. M. Gray, Y . Kou, B. Battles, J. Hoggatt, and A. L. Toombs, “The dark (patterns) side of ux design,” in Proceedings of the 2018 CHI conference on human factors in computing systems , 2018, pp. 1–14
work page 2018
-
[4]
Dark patterns and the legal requirements of consent banners: An interaction criticism perspective,
C. M. Gray, C. Santos, N. Bielova, M. Toth, and D. Clifford, “Dark patterns and the legal requirements of consent banners: An interaction criticism perspective,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , 2021, pp. 1–18
work page 2021
-
[5]
Ui dark patterns and where to find them: a study on mobile applications and user perception,
L. Di Geronimo, L. Braz, E. Fregnan, F. Palomba, and A. Bacchelli, “Ui dark patterns and where to find them: a study on mobile applications and user perception,” in Proceedings of the 2020 CHI conference on human factors in computing systems , 2020, pp. 1–14
work page 2020
- [6]
-
[7]
Dark patterns at scale: Findings from a crawl of 11k shopping websites,
A. Mathur, G. Acar, M. J. Friedman, E. Lucherini, J. Mayer, M. Chetty, and A. Narayanan, “Dark patterns at scale: Findings from a crawl of 11k shopping websites,” Proceedings of the ACM on Human-Computer Interaction, vol. 3, no. CSCW, pp. 1–32, 2019
work page 2019
-
[8]
Linguistic dead-ends and alphabet soup: Finding dark patterns in japanese apps,
S. Hidaka, S. Kobuki, M. Watanabe, and K. Seaborn, “Linguistic dead-ends and alphabet soup: Finding dark patterns in japanese apps,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–13
work page 2023
-
[9]
Understanding dark patterns in home iot devices,
M. Kowalczyk, J. T. Gunawan, D. Choffnes, D. J. Dubois, W. Hartzog, and C. Wilson, “Understanding dark patterns in home iot devices,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–27
work page 2023
-
[10]
T. T. Nguyen, M. Backes, and B. Stock, “Freely given consent? studying consent notice of third-party tracking and its violations of gdpr in android apps,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security , 2022, pp. 2369–2383
work page 2022
-
[11]
A. Chaudhary, J. Saroha, K. Monteiro, A. G. Forbes, and A. Parnami, ““are you still watching?”: Exploring unintended user behaviors and dark patterns on video streaming platforms,” in Designing Interactive Systems Conference, 2022, pp. 776–791
work page 2022
-
[12]
Aidui: Toward automated recognition of dark patterns in user interfaces,
S. H. Mansur, S. Salma, D. Awofisayo, and K. Moran, “Aidui: Toward automated recognition of dark patterns in user interfaces,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1958–1970
work page 2023
-
[13]
Unveiling the tricks: Automated detection of dark patterns in mobile applications,
J. Chen, J. Sun, S. Feng, Z. Xing, Q. Lu, X. Xu, and C. Chen, “Unveiling the tricks: Automated detection of dark patterns in mobile applications,” arXiv preprint arXiv:2308.05898 , 2023
-
[14]
A comparative study of dark patterns across web and mobile modalities,
J. Gunawan, A. Pradeep, D. Choffnes, W. Hartzog, and C. Wilson, “A comparative study of dark patterns across web and mobile modalities,” Proceedings of the ACM on Human-Computer Interaction , vol. 5, no. CSCW2, pp. 1–29, 2021
work page 2021
-
[15]
L. Gak, S. Olojo, and N. Salehi, “The distressing ads that persist: Uncovering the harms of targeted weight-loss ads among users with histories of disordered eating,” arXiv preprint arXiv:2204.03200 , 2022
-
[16]
“we need a big revolution in email advertising
A. Sergeeva, B. Rohles, V . Distler, and V . Koenig, ““we need a big revolution in email advertising”: Users’ perception of persuasion in permission-based advertising emails,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , 2023, pp. 1–21
work page 2023
-
[17]
” i am definitely manipulated, even when i am aware of it. it’s ridiculous!
K. Bongard-Blanchy, A. Rossi, S. Rivas, S. Doublet, V . Koenig, and G. Lenzini, “” i am definitely manipulated, even when i am aware of it. it’s ridiculous!”-dark patterns from the end-user perspective,” in Designing Interactive Systems Conference 2021 , 2021, pp. 763–776
work page 2021
-
[18]
I want my app that way: Reclaiming sovereignty over personal devices,
K. Kollnig, S. Datta, and M. Van Kleek, “I want my app that way: Reclaiming sovereignty over personal devices,” in Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , 2021, pp. 1–8
work page 2021
-
[19]
Greasevision: Rewriting the rules of the interface,
S. Datta, K. Kollnig, and N. Shadbolt, “Greasevision: Rewriting the rules of the interface,” arXiv preprint arXiv:2204.03731 , 2022
- [20]
-
[21]
A game of dark patterns: Designing healthy, highly-engaging mobile games,
J. Aagaard, M. E. C. Knudsen, P. Bækgaard, and K. Doherty, “A game of dark patterns: Designing healthy, highly-engaging mobile games,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts, 2022, pp. 1–8
work page 2022
-
[22]
OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Fastbot2: Reusable automated model-based gui testing for android enhanced by reinforcement learning,
Z. Lv, C. Peng, Z. Zhang, T. Su, K. Liu, and P. Yang, “Fastbot2: Reusable automated model-based gui testing for android enhanced by reinforcement learning,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–5
work page 2022
-
[24]
A survey of data augmentation approaches for NLP,
S. Y . Feng, V . Gangal, J. Wei, S. Chandar, S. V osoughi, T. Mitamura, and E. H. Hovy, “A survey of data augmentation approaches for NLP,” CoRR, vol. abs/2105.03075, 2021. [Online]. Available: https://arxiv.org/abs/2105.03075
-
[25]
Image data augmentation for deep learning: A survey,
S. Yang, W. Xiao, M. Zhang, S. Guo, J. Zhao, and F. Shen, “Image data augmentation for deep learning: A survey,” 2023. [Online]. Available: https://arxiv.org/abs/2204.08610
-
[26]
A Survey on Contrastive Self- supervised Learning,
A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” CoRR, vol. abs/2011.00362, 2020. [Online]. Available: https://arxiv.org/abs/2011. 00362
-
[27]
Contrastive learning with hard negative samples,
J. Robinson, C. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” CoRR, vol. abs/2010.04592, 2020. [Online]. Available: https://arxiv.org/abs/2010.04592
-
[28]
tzutalin, “Github - tzutalin/labelimg,” https://github.com/tzutalin/ labelImg, 2021, accessed: 24/09/2021
work page 2021
-
[29]
Unblind your apps: Predicting natural-language labels for mobile gui components by deep learning,
J. Chen, C. Chen, Z. Xing, X. Xu, L. Zhu, G. Li, and J. Wang, “Unblind your apps: Predicting natural-language labels for mobile gui components by deep learning,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , 2020, pp. 322–334
work page 2020
-
[30]
Object detection for graphical user interface: Old fashioned or deep learning or a combination?
J. Chen, M. Xie, Z. Xing, C. Chen, X. Xu, L. Zhu, and G. Li, “Object detection for graphical user interface: Old fashioned or deep learning or a combination?” in proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2020, pp. 1202–1214
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.