pith. sign in

arxiv: 2411.18084 · v2 · submitted 2024-11-27 · 💻 cs.SE · cs.AI· cs.HC

From Exploration to Revelation: Detecting Dark Patterns in Mobile Apps

Pith reviewed 2026-05-23 17:03 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC
keywords dark patternsdeceptive patternsmobile appsautomated detectionLLM-guided explorationcontrastive learningmulti-label classificationUI states
0
0 comments X

The pith

AppRay combines LLM-guided exploration and contrastive learning to detect both intra-page and inter-page deceptive patterns in mobile apps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AppRay to automate detection of deceptive patterns like visual emphasis or linguistic nudging in mobile apps, which existing manual methods cannot scale to. It first merges large language model-guided task-oriented exploration with random exploration to gather diverse UI states, then applies a contrastive learning multi-label classifier plus rule-based refiner to identify patterns across single pages and sequences of pages. This setup yields macro and micro averaged precision of 0.92 and 0.85, recall of 0.86 and 0.88, and F1 scores of 0.89 and 0.85 on new datasets covering 16 pattern types. A sympathetic reader would care because the approach cuts manual effort while expanding coverage beyond the intra-page limits of earlier automated tools.

Core claim

AppRay operates in two stages: LLM-guided task-oriented exploration combined with random exploration captures diverse UI states, after which a contrastive learning-based multi-label classifier augmented with a rule-based refiner detects both intra-page and inter-page deceptive patterns, achieving the reported performance metrics and enabling detection of previously unexplored patterns.

What carries the argument

Two-stage pipeline of LLM-guided plus random UI exploration feeding a contrastive learning multi-label classifier with rule-based refiner for context-aware detection of deceptive patterns.

If this is right

  • Detection now covers inter-page patterns that prior automated methods left out.
  • Manual exploration and labeling effort drops because the LLM component directs the search.
  • The two contributed datasets supply 2,185 labeled instances across 876 deceptive and 871 benign UIs for training or benchmarking.
  • Performance gains range from 27.14 percent to 1200 percent relative to earlier methods on the same tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same exploration-plus-classifier structure could be tested on web or desktop interfaces where deceptive patterns also appear across multiple screens.
  • App stores could integrate the refiner stage to flag submissions before release, provided the rule set is updated for new pattern variants.
  • If the LLM component is replaced with a lighter model, runtime cost on resource-limited devices becomes measurable and could be compared directly to the current results.

Load-bearing premise

The mix of LLM-guided task-oriented exploration and random exploration produces a diverse enough set of UI states to let the classifier reliably find both single-page and multi-page deceptive patterns.

What would settle it

Run AppRay on a fresh set of apps containing known inter-page deceptive patterns that require specific task sequences the LLM guidance does not generate; if the system misses most of those patterns while human reviewers find them, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2411.18084 by Jiamou Sun, Jieshan Chen, Liming Zhu, Qinghua Lu, Qing Huang, Xiwei Xu, Zhenchang Xing, Zhen Wang.

Figure 1
Figure 1. Figure 1: Dark pattern taxonomy. It consists of five main strategies. We employ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of deceptive patterns. obstruction, sneaking, and interface inference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of AppRay. It consists of two modules: app exploration and dark pattern detection. that contains dark patterns. Furthermore, some routine tasks can be hotspots for some dark patterns. Simply navigating to the notification settings page might unveil the “Preselection” tactic, where choices are made on behalf of the user without clear consent. Thus, task-oriented exploration, which mirrors human int… view at source ↗
Figure 4
Figure 4. Figure 4: Our LLM-based task-oriented app navigator employs a trial-and-error [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The Venn diagrams for Expert 1, Expert 2 and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Mobile apps are essential in daily life but frequently employ deceptive patterns, such as visual emphasis or linguistic nudging, to manipulate user behavior. Existing research largely relies on manual detection, which is time-consuming and cannot keep pace with rapidly evolving apps. Although recent work has explored automated approaches, these methods are limited to intra-page patterns, depend on manual app exploration, and lack flexibility. To address these limitations, we present AppRay, a system that integrates task-oriented app exploration with automated deceptive pattern detection to reduce manual effort, expand detection coverage, and improve performance. AppRay operates in two stages. First, it combines large language model-guided task-oriented exploration with random exploration to capture diverse user interface (UI) states. Second, it detects both intra-page and inter-page deceptive patterns using a contrastive learning-based multi-label classifier augmented with a rule-based refiner for context-aware detection. We contribute two datasets, AppRay-Tainted-UIs and AppRay-Benign-UIs, comprising 2,185 deceptive pattern instances, including 149 intra-page cases, spanning 16 types across 876 deceptive and 871 benign UIs, while preserving UI relationships. Experimental results show that AppRay achieves macro/micro averaged precision of 0.92/0.85, recall of 0.86/0.88, and F1 scores of 0.89/0.85, yielding 27.14% to 1200% improvements over prior methods and enabling effective detection of previously unexplored deceptive patterns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents AppRay, a two-stage system for automated detection of deceptive (dark) patterns in mobile apps. Stage 1 combines LLM-guided task-oriented exploration with random exploration to capture diverse UI states while preserving relationships; Stage 2 applies a contrastive learning multi-label classifier plus rule-based refiner to identify both intra-page and inter-page patterns across 16 types. The authors contribute two new datasets (AppRay-Tainted-UIs and AppRay-Benign-UIs) totaling 2,185 deceptive instances over 876 deceptive and 871 benign UIs, and report macro/micro precision 0.92/0.85, recall 0.86/0.88, and F1 0.89/0.85, claiming 27.14%–1200% gains over prior methods and the ability to detect previously unexplored inter-page patterns.

Significance. If the exploration coverage and dataset labeling can be validated, the work would meaningfully extend automated dark-pattern detection beyond the intra-page/manual-exploration limits of prior art, with the contributed datasets (preserving UI relationships) providing a reusable resource for the community. The contrastive-learning approach for multi-label context-aware detection is a plausible technical step forward.

major comments (3)
  1. [Abstract] Abstract: the headline performance metrics and the claim of detecting 'previously unexplored' inter-page patterns rest on the unverified assumption that LLM-guided + random exploration produces sufficiently diverse UI states and transitions; no state-coverage metrics, transition-graph statistics, or ablation removing the LLM component are reported to show that the LLM guidance reaches deceptive inter-page flows beyond random walks.
  2. [Abstract] Abstract (dataset description): the 2,185 deceptive pattern instances (including 149 intra-page cases) are presented without any information on the labeling process, inter-annotator agreement, or how deceptive vs. benign UIs were identified, which is load-bearing for trusting the reported precision/recall/F1 values and the claimed improvements.
  3. [Abstract] Abstract: the stated 27.14%–1200% improvements over prior methods are given without baseline implementation details, ablation studies, or clarification whether gains derive from the new exploration technique versus simply the new dataset construction, leaving the central methodological contribution unsupported.
minor comments (2)
  1. [Abstract] Abstract: the exact number of apps explored and the distribution of the 16 pattern types across the 876/871 UIs should be stated explicitly for reproducibility.
  2. The free parameters (LLM prompt choices and contrastive-learning hyperparameters/thresholds) are not listed; adding an explicit enumeration would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the abstract could better support its claims. We address each point below and commit to revisions that add the requested details and evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance metrics and the claim of detecting 'previously unexplored' inter-page patterns rest on the unverified assumption that LLM-guided + random exploration produces sufficiently diverse UI states and transitions; no state-coverage metrics, transition-graph statistics, or ablation removing the LLM component are reported to show that the LLM guidance reaches deceptive inter-page flows beyond random walks.

    Authors: We agree that the abstract lacks quantitative support for the exploration component. The full paper describes the LLM-guided plus random strategy in Section 4, but does not report coverage metrics or ablations. In the revision we will add state-coverage metrics, transition-graph statistics, and an ablation removing the LLM component to demonstrate its contribution to inter-page pattern detection. revision: yes

  2. Referee: [Abstract] Abstract (dataset description): the 2,185 deceptive pattern instances (including 149 intra-page cases) are presented without any information on the labeling process, inter-annotator agreement, or how deceptive vs. benign UIs were identified, which is load-bearing for trusting the reported precision/recall/F1 values and the claimed improvements.

    Authors: We acknowledge that the abstract omits labeling methodology. The datasets were created via automated flagging followed by multi-annotator verification. In the revision we will expand the abstract and add a methods subsection detailing the labeling protocol, inter-annotator agreement, and criteria used to classify deceptive versus benign UIs. revision: yes

  3. Referee: [Abstract] Abstract: the stated 27.14%–1200% improvements over prior methods are given without baseline implementation details, ablation studies, or clarification whether gains derive from the new exploration technique versus simply the new dataset construction, leaving the central methodological contribution unsupported.

    Authors: The gains arise from the joint effect of the exploration method and the contrastive classifier on the new datasets. We agree more transparency is required. In the revision we will supply baseline re-implementation details, ablation studies isolating the exploration versus classifier/dataset contributions, and explicit attribution of the performance deltas. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on held-out data

full rationale

The paper presents AppRay as an empirical system whose performance numbers (precision, recall, F1) are reported as direct measurements on two newly contributed datasets (AppRay-Tainted-UIs and AppRay-Benign-UIs) containing 2,185 instances. No equations, parameters, or predictions are shown to reduce by construction to fitted inputs or self-definitions. The contrastive-learning classifier and rule-based refiner are standard ML components applied to the collected UI data; the exploration stage is described as a data-collection procedure rather than a derived quantity. No self-citation chain is invoked as a uniqueness theorem or load-bearing premise. The central claims therefore remain independent of the reported metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The performance claims rest on the representativeness of the new datasets and the generalization of the LLM exploration plus contrastive classifier; these involve implicit domain assumptions about UI coverage and label quality that are not detailed in the abstract. No explicit free parameters or invented entities are described.

free parameters (2)
  • LLM prompt choices for task guidance
    Exploration stage depends on prompts to the language model, which are typically tuned or selected to achieve coverage.
  • Contrastive learning hyperparameters and thresholds
    The multi-label classifier is trained on the contributed data, implying fitted parameters whose values affect the reported precision and recall.
axioms (1)
  • domain assumption The 2,185 labeled deceptive pattern instances in AppRay-Tainted-UIs and AppRay-Benign-UIs accurately reflect real-world dark patterns across apps
    All reported metrics and improvement claims depend on the correctness and representativeness of these ground-truth labels.

pith-pipeline@v0.9.0 · 5828 in / 1514 out tokens · 71074 ms · 2026-05-23T17:03:17.906890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Reflect, not regret: Understanding regretful smartphone use with app feature-level analysis,

    H. Cho, D. Choi, D. Kim, W. J. Kang, E. K. Choe, and S.-J. Lee, “Reflect, not regret: Understanding regretful smartphone use with app feature-level analysis,” Proceedings of the ACM on human-computer interaction, vol. 5, no. CSCW2, pp. 1–36, 2021

  2. [2]

    Defining and identifying attention capture deceptive designs in digital interfaces,

    A. Monge Roffarello, K. Lukoff, and L. De Russis, “Defining and identifying attention capture deceptive designs in digital interfaces,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–19

  3. [3]

    The dark (patterns) side of ux design,

    C. M. Gray, Y . Kou, B. Battles, J. Hoggatt, and A. L. Toombs, “The dark (patterns) side of ux design,” in Proceedings of the 2018 CHI conference on human factors in computing systems , 2018, pp. 1–14

  4. [4]

    Dark patterns and the legal requirements of consent banners: An interaction criticism perspective,

    C. M. Gray, C. Santos, N. Bielova, M. Toth, and D. Clifford, “Dark patterns and the legal requirements of consent banners: An interaction criticism perspective,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , 2021, pp. 1–18

  5. [5]

    Ui dark patterns and where to find them: a study on mobile applications and user perception,

    L. Di Geronimo, L. Braz, E. Fregnan, F. Palomba, and A. Bacchelli, “Ui dark patterns and where to find them: a study on mobile applications and user perception,” in Proceedings of the 2020 CHI conference on human factors in computing systems , 2020, pp. 1–14

  6. [6]

    Brignull

    H. Brignull. (2010) Deceptive design. [Online]. Available: https: //www.darkpatterns.org/

  7. [7]

    Dark patterns at scale: Findings from a crawl of 11k shopping websites,

    A. Mathur, G. Acar, M. J. Friedman, E. Lucherini, J. Mayer, M. Chetty, and A. Narayanan, “Dark patterns at scale: Findings from a crawl of 11k shopping websites,” Proceedings of the ACM on Human-Computer Interaction, vol. 3, no. CSCW, pp. 1–32, 2019

  8. [8]

    Linguistic dead-ends and alphabet soup: Finding dark patterns in japanese apps,

    S. Hidaka, S. Kobuki, M. Watanabe, and K. Seaborn, “Linguistic dead-ends and alphabet soup: Finding dark patterns in japanese apps,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–13

  9. [9]

    Understanding dark patterns in home iot devices,

    M. Kowalczyk, J. T. Gunawan, D. Choffnes, D. J. Dubois, W. Hartzog, and C. Wilson, “Understanding dark patterns in home iot devices,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–27

  10. [10]

    Freely given consent? studying consent notice of third-party tracking and its violations of gdpr in android apps,

    T. T. Nguyen, M. Backes, and B. Stock, “Freely given consent? studying consent notice of third-party tracking and its violations of gdpr in android apps,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security , 2022, pp. 2369–2383

  11. [11]

    “are you still watching?

    A. Chaudhary, J. Saroha, K. Monteiro, A. G. Forbes, and A. Parnami, ““are you still watching?”: Exploring unintended user behaviors and dark patterns on video streaming platforms,” in Designing Interactive Systems Conference, 2022, pp. 776–791

  12. [12]

    Aidui: Toward automated recognition of dark patterns in user interfaces,

    S. H. Mansur, S. Salma, D. Awofisayo, and K. Moran, “Aidui: Toward automated recognition of dark patterns in user interfaces,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1958–1970

  13. [13]

    Unveiling the tricks: Automated detection of dark patterns in mobile applications,

    J. Chen, J. Sun, S. Feng, Z. Xing, Q. Lu, X. Xu, and C. Chen, “Unveiling the tricks: Automated detection of dark patterns in mobile applications,” arXiv preprint arXiv:2308.05898 , 2023

  14. [14]

    A comparative study of dark patterns across web and mobile modalities,

    J. Gunawan, A. Pradeep, D. Choffnes, W. Hartzog, and C. Wilson, “A comparative study of dark patterns across web and mobile modalities,” Proceedings of the ACM on Human-Computer Interaction , vol. 5, no. CSCW2, pp. 1–29, 2021

  15. [15]

    The distressing ads that persist: Uncovering the harms of targeted weight-loss ads among users with histories of disordered eating,

    L. Gak, S. Olojo, and N. Salehi, “The distressing ads that persist: Uncovering the harms of targeted weight-loss ads among users with histories of disordered eating,” arXiv preprint arXiv:2204.03200 , 2022

  16. [16]

    “we need a big revolution in email advertising

    A. Sergeeva, B. Rohles, V . Distler, and V . Koenig, ““we need a big revolution in email advertising”: Users’ perception of persuasion in permission-based advertising emails,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , 2023, pp. 1–21

  17. [17]

    ” i am definitely manipulated, even when i am aware of it. it’s ridiculous!

    K. Bongard-Blanchy, A. Rossi, S. Rivas, S. Doublet, V . Koenig, and G. Lenzini, “” i am definitely manipulated, even when i am aware of it. it’s ridiculous!”-dark patterns from the end-user perspective,” in Designing Interactive Systems Conference 2021 , 2021, pp. 763–776

  18. [18]

    I want my app that way: Reclaiming sovereignty over personal devices,

    K. Kollnig, S. Datta, and M. Van Kleek, “I want my app that way: Reclaiming sovereignty over personal devices,” in Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , 2021, pp. 1–8

  19. [19]

    Greasevision: Rewriting the rules of the interface,

    S. Datta, K. Kollnig, and N. Shadbolt, “Greasevision: Rewriting the rules of the interface,” arXiv preprint arXiv:2204.03731 , 2022

  20. [20]

    Brignull

    H. Brignull. (2010) Twitter: Deceptive design@darkpatterns. [Online]. Available: https://twitter.com/darkpatterns

  21. [21]

    A game of dark patterns: Designing healthy, highly-engaging mobile games,

    J. Aagaard, M. E. C. Knudsen, P. Bækgaard, and K. Doherty, “A game of dark patterns: Designing healthy, highly-engaging mobile games,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts, 2022, pp. 1–8

  22. [22]

    GPT-4 Technical Report

    OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  23. [23]

    Fastbot2: Reusable automated model-based gui testing for android enhanced by reinforcement learning,

    Z. Lv, C. Peng, Z. Zhang, T. Su, K. Liu, and P. Yang, “Fastbot2: Reusable automated model-based gui testing for android enhanced by reinforcement learning,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–5

  24. [24]

    A survey of data augmentation approaches for NLP,

    S. Y . Feng, V . Gangal, J. Wei, S. Chandar, S. V osoughi, T. Mitamura, and E. H. Hovy, “A survey of data augmentation approaches for NLP,” CoRR, vol. abs/2105.03075, 2021. [Online]. Available: https://arxiv.org/abs/2105.03075

  25. [25]

    Image data augmentation for deep learning: A survey,

    S. Yang, W. Xiao, M. Zhang, S. Guo, J. Zhao, and F. Shen, “Image data augmentation for deep learning: A survey,” 2023. [Online]. Available: https://arxiv.org/abs/2204.08610

  26. [26]

    A Survey on Contrastive Self- supervised Learning,

    A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” CoRR, vol. abs/2011.00362, 2020. [Online]. Available: https://arxiv.org/abs/2011. 00362

  27. [27]

    Contrastive learning with hard negative samples,

    J. Robinson, C. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” CoRR, vol. abs/2010.04592, 2020. [Online]. Available: https://arxiv.org/abs/2010.04592

  28. [28]

    Github - tzutalin/labelimg,

    tzutalin, “Github - tzutalin/labelimg,” https://github.com/tzutalin/ labelImg, 2021, accessed: 24/09/2021

  29. [29]

    Unblind your apps: Predicting natural-language labels for mobile gui components by deep learning,

    J. Chen, C. Chen, Z. Xing, X. Xu, L. Zhu, G. Li, and J. Wang, “Unblind your apps: Predicting natural-language labels for mobile gui components by deep learning,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , 2020, pp. 322–334

  30. [30]

    Object detection for graphical user interface: Old fashioned or deep learning or a combination?

    J. Chen, M. Xie, Z. Xing, C. Chen, X. Xu, L. Zhu, and G. Li, “Object detection for graphical user interface: Old fashioned or deep learning or a combination?” in proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2020, pp. 1202–1214