pith. machine review for the scientific record. sign in

arxiv: 2603.24160 · v2 · submitted 2026-03-25 · 💻 cs.SE

Recognition: no theorem link

Towards Automated Crowdsourced Testing via Personified-LLM

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:48 UTC · model grok-4.3

classification 💻 cs.SE
keywords PersonaTestercrowdsourced testingGUI testingLLM agentspersonified LLMsautomated software testingbehavioral simulation
0
0 comments X

The pith

PersonaTester injects three orthogonal dimensions of human tester personas into LLMs to simulate diverse crowdworker behaviors for automated GUI testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Crowdsourced GUI testing draws value from the varied behaviors of real human testers across devices and scenarios, yet scaling it remains costly and inconsistent. Purely automated LLM-based testing offers control and speed but typically produces uniform outputs that miss that diversity. PersonaTester defines personas along testing mindset, exploration strategy, and interaction habit, then embeds them into LLM agents so each agent generates distinct, repeatable testing traces. Experiments show these agents reproduce real crowdworker patterns with high intra-persona consistency and inter-persona differences, while producing more effective test events and uncovering additional crashes and functional bugs compared with non-persona baselines.

Core claim

PersonaTester is a framework that automates crowdsourced GUI testing by injecting representative personas—defined along the three orthogonal dimensions of testing mindset, exploration strategy, and interaction habit—into LLM-based agents. This injection produces controllable, repeatable simulations of human-like testing behaviors. The resulting agents exhibit strong behavioral fidelity to real crowdworkers and generate more effective test events, triggering over 100 crashes and 11 functional bugs beyond what non-persona baselines achieve.

What carries the argument

Persona injection into LLM agents along three orthogonal dimensions (testing mindset, exploration strategy, interaction habit) that drive distinct action sequences and coverage patterns.

If this is right

  • GUI testing coverage can be scaled with controlled behavioral diversity without recruiting new human testers for every session.
  • Test runs become fully reproducible while still spanning multiple distinct user profiles.
  • Higher rates of crash and functional-bug detection become achievable in automated pipelines.
  • Testing across varied devices and environments can be simulated by varying only the persona parameters rather than the underlying agent code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same persona framework could be applied to other interactive testing domains such as web or desktop applications.
  • Periodic recalibration of persona parameters against fresh crowdworker data might maintain fidelity as LLMs evolve.
  • Combining persona-driven agents with lightweight human oversight could further reduce the cost of crowdsourced campaigns while preserving coverage quality.

Load-bearing premise

The three chosen persona dimensions, when injected into current LLMs, produce faithful and generalizable simulations of real human crowdworker behavior rather than LLM-specific artifacts.

What would settle it

Running the same persona agents on a new set of unseen mobile apps and comparing their generated test-event distributions and bug-trigger rates against fresh logs from actual crowdworkers on those same apps.

Figures

Figures reproduced from arXiv: 2603.24160 by Chunrong Fang, Chunyang Chen, Shengcheng Yu, Yuchen Ling, Zhenyu Chen.

Figure 1
Figure 1. Figure 1: Motivating Example: Exploration Trace of Personified-LLM Agents with Different Personas [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PersonaTester Workflow 3.1 Personification for LLM To simulate the behavioral diversity observed in real-world crowdsourced testing, we adopt a persona-based modeling approach, drawing inspiration from established concepts in human￾computer interaction (HCI) and conversational AI. In general, a persona refers to a fictitious Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE166. Publication da… view at source ↗
Figure 3
Figure 3. Figure 3: Example of Prompt for LLM Personification ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ1.1: Intra-Cluster Cohesion To compute the similarity between two test paths, we first encode each path as a vector and then measure the cosine similarity between them. The path vectorization process is a multi￾step procedure designed to capture semantic-level behavioral patterns in a principled and interpretable man￾ner. First, we validate the exploration traces to construct cleaned and inter￾pretable t… view at source ↗
Figure 5
Figure 5. Figure 5: RQ1.2: Inter-Cluster Separation To assess behavioral distinctiveness across personas, we measure inter-cluster separation among agents executing the same tasks. As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RQ2: Test Generation Effectiveness Most persona-guided agents perform comparatively better than the non-personified baseline agent (Persona X) in terms of overall test event generation effectiveness, outperforming by 33% – 47% on average. Several agents consistently outperform the baseline. This indicates that the integration of persona-guided behavioral intent does not compromise the LLM’s ability to inte… view at source ↗
Figure 7
Figure 7. Figure 7: RQ3.1: Crash Bug Triggering Capability We observe that a few bugs are uniquely triggered by the non-personified baseline agent. These cases are largely due to the inherent randomness in the baseline’s decision-making, which occasion￾ally leads to alternative, unstructured paths that are not aligned with any defined persona. However, such behavior is non-repeatable and shows no consistent pattern. In contra… view at source ↗
read the original abstract

The rapid proliferation and increasing complexity of software demand robust quality assurance, with graphical user interface (GUI) testing playing a pivotal role. Crowdsourced testing has proven effective in this context by leveraging the diversity of human testers to achieve rich, scenario-based coverage across varied devices, user behaviors, and usage environments. In parallel, automated testing, particularly with the advent of large language models (LLMs), offers significant advantages in controllability, reproducibility, and efficiency, enabling scalable and systematic exploration. However, automated approaches often lack the behavioral diversity characteristic of human testers, limiting their capability to fully simulate real-world testing dynamics. To address this gap, we present PersonaTester, a novel personified-LLM-based framework designed to automate crowdsourced GUI testing. By injecting representative personas, defined along three orthogonal dimensions: testing mindset, exploration strategy, and interaction habit, into LLM-based agents, PersonaTester enables the simulation of diverse human-like testing behaviors in a controllable and repeatable manner. Experimental results demonstrate that PersonaTester faithfully reproduces the behavioral patterns of real crowdworkers, exhibiting strong intra-persona consistency and clear inter-persona variability (117.86% -- 126.23% improvement over the baseline). Moreover, persona-guided testing agents consistently generate more effective test events and trigger more crashes (100+) and functional bugs (11) than the baseline without persona, thus substantially advancing the realism and effectiveness of automated crowdsourced GUI testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PersonaTester, a framework that injects LLMs with personas defined along three orthogonal dimensions (testing mindset, exploration strategy, and interaction habit) to automate crowdsourced GUI testing. It claims this produces human-like behavioral diversity, with strong intra-persona consistency, inter-persona variability, 117.86%–126.23% improvement over a non-persona baseline, and higher effectiveness in generating test events that trigger 100+ crashes and 11 functional bugs.

Significance. If the central empirical claims are substantiated with direct evidence, the work could meaningfully advance automated GUI testing by offering a controllable way to simulate diverse human tester behaviors at scale, potentially reducing costs of crowdsourcing while preserving realism. The orthogonal persona design is a clear strength for reproducibility.

major comments (2)
  1. [Abstract and Experimental Results] Abstract and Experimental Results: The headline claim that PersonaTester 'faithfully reproduces the behavioral patterns of real crowdworkers' is supported only by indirect evidence (baseline improvements and higher crash/bug counts); no direct quantitative fidelity metrics (e.g., sequence edit distance, action-type KL divergence, or coverage overlap) comparing generated traces to real crowdworker logs on the same apps are reported. This leaves the reproduction claim dependent on the untested assumption that performance gains equal human fidelity rather than LLM artifacts.
  2. [Experimental Results] Experimental Results: The manuscript provides no details on experimental setup, including app selection criteria, number of runs, statistical significance tests, or how behavioral fidelity and consistency were measured against real human data. Without these, the reported 117–126% gains and crash/bug counts cannot be properly evaluated for robustness or generalizability.
minor comments (2)
  1. [Methodology] Clarify the exact prompting mechanism used to inject the three persona dimensions into the LLM agents, including any example prompts or templates, to aid reproducibility.
  2. [Experimental Setup] The baseline description should explicitly state whether it uses the same LLM backbone and prompting structure as PersonaTester (minus personas) to ensure a fair comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical claims and experimental transparency. We address each point below and will revise the manuscript to incorporate additional details and clarifications where appropriate.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and Experimental Results: The headline claim that PersonaTester 'faithfully reproduces the behavioral patterns of real crowdworkers' is supported only by indirect evidence (baseline improvements and higher crash/bug counts); no direct quantitative fidelity metrics (e.g., sequence edit distance, action-type KL divergence, or coverage overlap) comparing generated traces to real crowdworker logs on the same apps are reported. This leaves the reproduction claim dependent on the untested assumption that performance gains equal human fidelity rather than LLM artifacts.

    Authors: We acknowledge that the current manuscript relies on indirect evidence for the reproduction claim, specifically the quantified intra-persona consistency and inter-persona variability (via the 117.86%–126.23% gains) together with superior crash and bug detection rates. These metrics were chosen because they directly capture the expected properties of crowdworker behavior—consistent patterns within a tester style and diversity across styles—rather than assuming performance gains alone imply fidelity. We agree that direct metrics would provide stronger substantiation. In the revision we will add action-type KL divergence, sequence similarity measures, and coverage overlap analyses using the real crowdworker logs collected during our experiments, and we will tone down the abstract wording to 'reproduces key behavioral patterns' pending those results. revision: partial

  2. Referee: [Experimental Results] Experimental Results: The manuscript provides no details on experimental setup, including app selection criteria, number of runs, statistical significance tests, or how behavioral fidelity and consistency were measured against real human data. Without these, the reported 117–126% gains and crash/bug counts cannot be properly evaluated for robustness or generalizability.

    Authors: We agree that the Experimental Results section lacks sufficient methodological detail. In the revised manuscript we will expand this section to specify: (i) the criteria and rationale for selecting the evaluated apps, (ii) the exact number of independent runs per condition, (iii) the statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) used to assess significance of the reported gains, and (iv) the precise operational definitions and formulas for measuring intra-persona consistency (variance within persona runs) and inter-persona variability (divergence across personas), including how these were validated against the real human traces. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results stand independently

full rationale

The paper presents PersonaTester as a framework that injects three defined persona dimensions into LLM agents and evaluates it via direct experimental comparison against a non-persona baseline. Reported gains (117-126% improvement, more crashes and bugs) are measured outcomes from testing runs on apps, not quantities derived from or forced by the persona definitions themselves. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the fidelity or effectiveness claims to tautologies. The central assumption about human-like behavior is tested (or claimed tested) through observable metrics rather than presupposed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that hand-defined personas along three dimensions can be injected into LLMs to produce human-like behavior; no free parameters are explicitly fitted in the abstract, but the persona definitions themselves function as constructed inputs.

axioms (1)
  • domain assumption Personas defined along testing mindset, exploration strategy, and interaction habit dimensions can be faithfully represented and followed by LLMs.
    Invoked in the description of how personas are injected into agents.

pith-pipeline@v0.9.0 · 5560 in / 1217 out tokens · 24845 ms · 2026-05-15T00:48:39.085068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 4 internal anchors

  1. [1]

    Domenico Amalfitano, Anna Rita Fasolino, Porfirio Tramontana, Salvatore De Carmine, and Atif M Memon. 2012. Using GUI ripping for automated testing of Android applications. InProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. 258–261

  2. [2]

    Shaoheng Cao, Minxue Pan, Yuanhong Lan, and Xuandong Li. 2025. Intention-Based GUI Test Migration for Mobile Apps using Large Language Models.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 2296–2318

  3. [3]

    Hao Chen, Song Huang, Yuchan Liu, Run Luo, and Yifei Xie. 2021. An effective crowdsourced test report clustering model based on sentence embedding. In2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 888–899

  4. [4]

    Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang. 2025. Standing on the Shoulders of Giants: Bug-Aware Automated GUI Testing via Retrieval Augmentation.Proceedings of the ACM on Software Engineering2, FSE (2025), 825–846

  5. [5]

    Qiang Cui, Junjie Wang, Guowei Yang, Miao Xie, Qing Wang, and Mingshu Li. 2017. Who should be selected to perform a task in crowdsourced testing?. In2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. IEEE, 75–84

  6. [6]

    Chunrong Fang, Shengcheng Yu, Quanjun Zhang, Xin Li, Yulei Liu, and Zhenyu Chen. 2024. Enhanced Crowdsourced Test Report Prioritization via Image-and-Text Semantic Understanding and Feature Integration.IEEE Transactions on Software Engineering(2024)

  7. [7]

    Sidong Feng and Chunyang Chen. 2024. Prompting is all you need: Automated android bug replay with large language models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

  8. [8]

    Ruizhi Gao, Yabin Wang, Yang Feng, Zhenyu Chen, and W Eric Wong. 2019. Successes, challenges, and rethinking–an industrial investigation on crowdsourced mobile application testing.Empirical Software Engineering24, 2 (2019), 537–561

  9. [9]

    Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In2013 IEEE workshop on automatic speech recognition and understanding. IEEE, 273–278

  10. [10]

    Jonathan Grudin and John Pruitt. 2002. Personas, participatory design and product development: An infrastructure for engagement. InPDC. 144–152

  11. [11]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  12. [12]

    Rui Hao, Yang Feng, James A Jones, Yuying Li, and Zhenyu Chen. 2019. CTRAS: Crowdsourced test report aggregation and summarization. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 900–911

  13. [13]

    Tiancheng Hu and Nigel Collier. 2024. Quantifying the Persona Effect in LLM Simulations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10289–10307

  14. [14]

    Yuchao Huang, Junjie Wang, Zhe Liu, Mingyang Li, Song Wang, Chunyang Chen, Yuanzhe Hu, and Qing Wang. 2025. One Sentence Can Kill the Bug: Auto-replay Mobile App Crashes from One-sentence Overviews.IEEE Transactions on Software Engineering(2025)

  15. [15]

    Yuekai Huang, Junjie Wang, Song Wang, Zhe Liu, Yuanzhe Hu, and Qing Wang. 2020. Quest for the golden approach: An experimental evaluation of duplicate crowdtesting reports detection. InProceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–12

  16. [16]

    Ankur Joshi, Saket Kale, Satish Chandel, and D Kumar Pal. 2015. Likert scale: Explored and explained.British journal of applied science & technology7, 4 (2015), 396

  17. [17]

    Taemin Kim and Geunseok Yang. 2022. Predicting duplicate in bug report using topic-based duplicate learning with fine tuning-based bert algorithm.IEEE Access10 (2022), 129666–129675

  18. [18]

    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. Droidbot: a lightweight ui-guided test input generator for android. In2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 23–26

  19. [19]

    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A deep learning-based approach to automated black-box android app testing. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1070–1073

  20. [20]

    Yuchen Ling, Shengcheng Yu, Chunrong Fang, Guobin Pan, Jun Wang, and Jia Liu. 2025. Redefining crowdsourced test report prioritization: An innovative approach with large language model.Information and Software Technology179 (2025), 107629

  21. [21]

    Di Liu, Yang Feng, Xiaofang Zhang, James A Jones, and Zhenyu Chen. 2020. Clustering crowdsourced test reports of mobile applications using image understanding.IEEE transactions on software engineering48, 4 (2020), 1290–1308

  22. [22]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Arti...

  23. [23]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yuekai Huang, Jun Hu, and Qing Wang. 2024. Unblind text inputs: predicting hint-text of text input in mobile apps via LLM. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20

  24. [24]

    Jing Luo, Run Luo, Longze Chen, Liang Zhu, Chang Ao, Jiaming Li, Yukun Chen, Xin Cheng, Wen Yang, Jiayuan Su, et al. 2024. PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation.arXiv preprint arXiv:2410.01504(2024)

  25. [25]

    Mostafa Mohammed, Haipeng Cai, and Na Meng. 2019. An empirical comparison between monkey testing and human testing (wip paper). InProceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. 188–192

  26. [26]

    Changhai Nie and Hareton Leung. 2011. A survey of combinatorial testing.ACM Computing Surveys (CSUR)43, 2 (2011), 1–29

  27. [27]

    Minxue Pan, An Huang, Guoxin Wang, Tian Zhang, and Xuandong Li. 2020. Reinforcement learning based curiosity- driven testing of android applications. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 153–164

  28. [28]

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, Michael S Bernstein, et al. 2023. Generative agents: Interactive simulacra of human behavior. arXiv.Org (2023, April 7) https://arxiv. org/abs/2304.03442 v2(2023)

  29. [29]

    Chen Qian and Xin Cong. 2023. Communicative agents for software development.arXiv preprint arXiv:2307.079246, 3 (2023)

  30. [30]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Hong Kong, China, 3982–3992. doi:10.18653/v1/D19-1410

  31. [31]

    Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students representatives of professionals in software engineering experiments?. In2015 IEEE/ACM 37th IEEE international conference on software engineering, Vol. 1. IEEE, 666–676

  32. [32]

    Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su

  33. [33]

    InProceedings of the 2017 11th joint meeting on foundations of software engineering

    Guided, stochastic model-based GUI testing of Android apps. InProceedings of the 2017 11th joint meeting on foundations of software engineering. 245–256

  34. [34]

    Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry Xu, Qinghua Lu, and Liming Zhu. 2025. Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead.Proceedings of the ACM on Software Engineering2, FSE (2025), 757–778

  35. [35]

    Lipeipei Sun, Tianzi Qin, Anran Hu, Jiale Zhang, Shuojia Lin, Jianyan Chen, Mona Ali, and Mirjana Prpa. 2025. Persona-L has Entered the Chat: Leveraging LLMs and Ability-based Framework for Personas of People with Complex Needs. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–31

  36. [36]

    Yao Tong and Xiaofang Zhang. 2021. Crowdsourced test report prioritization considering bug severity.Information and Software Technology139 (2021), 106668

  37. [37]

    Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen

  38. [38]

    Two tales of persona in llms: A survey of role-playing and personalization.arXiv preprint arXiv:2406.01171 (2024)

  39. [39]

    Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang. 2025. LLMDroid: Enhancing Automated Mobile App GUI Testing Coverage with Large Language Model Guidance.Proceedings of the ACM on Software Engineering2, FSE (2025), 1001–1022

  40. [40]

    Dingbang Wang, Yu Zhao, Sidong Feng, Zhaoxu Zhang, William GJ Halfond, Chunyang Chen, Xiaoxia Sun, Jiangfan Shi, and Tingting Yu. 2024. Feedback-driven automated whole bug report reproduction for android apps. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1048–1060

  41. [41]

    Jue Wang, Yanyan Jiang, Chang Xu, Chun Cao, Xiaoxing Ma, and Jian Lu. 2020. Combodroid: generating high-quality test inputs for android apps via use case combinations. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 469–480

  42. [42]

    Junjie Wang, Mingyang Li, Song Wang, Tim Menzies, and Qing Wang. 2019. Images don’t lie: Duplicate crowdtesting reports detection with screenshot information.Information and Software Technology110 (2019), 139–155

  43. [43]

    Junjie Wang, Song Wang, Jianfeng Chen, Tim Menzies, Qiang Cui, Miao Xie, and Qing Wang. 2019. Characterizing crowds to better optimize worker recommendation in crowdsourced testing.IEEE Transactions on Software Engineering 47, 6 (2019), 1259–1276

  44. [44]

    Junjie Wang, Ye Yang, Song Wang, Jun Hu, and Qing Wang. 2022. Context-and fairness-aware in-process crowdworker recommendation.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 3 (2022), 1–31

  45. [45]

    Junjie Wang, Ye Yang, Song Wang, Yuanzhe Hu, Dandan Wang, and Qing Wang. 2020. Context-aware in-process crowdworker recommendation. InProceedings of the ACM/IEEE 42nd international conference on software engineering. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE166. Publication date: July 2026. Towards Automated Crowdsourced Testing via Personified-...

  46. [46]

    2022.Intelligent Crowdsourced Testing

    Qing Wang, Zhenyu Chen, Junjie Wang, and Yang Feng. 2022.Intelligent Crowdsourced Testing. Springer

  47. [47]

    Xiaoxue Wu, Wenjing Shan, Wei Zheng, Zhiguo Chen, Tao Ren, and Xiaobing Sun. 2023. An intelligent duplicate bug report detection method based on technical term extraction. In2023 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 1–12

  48. [48]

    Miao Xie, Qing Wang, Guowei Yang, and Mingshu Li. 2017. Cocoon: Crowdsourced testing quality maximization under context coverage constraint. In2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 316–327

  49. [49]

    Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su. 2024. General and practical property- based testing for android apps. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 53–64

  50. [50]

    Yuxuan Yang and Xin Chen. 2021. Crowdsourced test report prioritization based on text classification.IEEE Access10 (2021), 92692–92705

  51. [51]

    Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-driven mobile gui testing with autonomous large language model agents. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 129–139

  52. [52]

    Shengcheng Yu. 2019. Crowdsourced report generation via bug screenshot understanding. In2019 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 1277–1279

  53. [53]

    Shengcheng Yu, Chunrong Fang, Zhenfei Cao, Xu Wang, Tongyu Li, and Zhenyu Chen. 2021. Prioritize crowdsourced test reports via deep screenshot understanding. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 946–956

  54. [54]

    Shengcheng Yu, Chunrong Fang, Mingzhe Du, Zimin Ding, Zhenyu Chen, and Zhendong Su. 2024. Practical, Automated Scenario-based Mobile App Testing.IEEE Transactions on Software Engineering(2024)

  55. [55]

    Shengcheng Yu, Chunrong Fang, Xin Li, Yuchen Ling, Zhenyu Chen, and Zhendong Su. 2024. Effective, Platform- Independent GUI Testing via Image Embedding and Reinforcement Learning.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–27

  56. [56]

    Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen. 2023. Llm for test script generation and migration: Challenges, capabilities, and opportunities. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security. IEEE, 206–217

  57. [57]

    Shengcheng Yu, Chunrong Fang, Ziyuan Tuo, Quanjun Zhang, Chunyang Chen, Zhenyu Chen, and Zhendong Su

  58. [58]

    Vision-based mobile app gui testing: A survey.arXiv preprint arXiv:2310.13518(2023)

  59. [59]

    Shengcheng Yu, Chunrong Fang, Quanjun Zhang, Zhihao Cao, Yexiao Yun, Zhenfei Cao, Kai Mei, and Zhenyu Chen

  60. [60]

    Mobile app crowdsourced test report consistency detection via deep image-and-text fusion understanding.IEEE Transactions on Software Engineering49, 8 (2023), 4115–4134

  61. [61]

    Shengcheng Yu, Chunrong Fang, Quanjun Zhang, Mingzhe Du, Jia Liu, and Zhenyu Chen. 2024. Semi-supervised Crowd- sourced Test Report Clustering via Screenshot-Text Binding Rules.Proceedings of the ACM on Software Engineering1, FSE (2024), 1540–1563

  62. [62]

    Shengcheng Yu, Yuchen Ling, Chunrong Fang, Quan Zhou, Chunyang Chen, Shaomin Zhu, and Zhenyu Chen. 2025. LLM-Guided Scenario-based GUI Testing.arXiv preprint arXiv:2506.05079(2025)

  63. [63]

    Yakun Zhang, Chen Liu, Xiaofei Xie, Yun Lin, Jin Song Dong, Dan Hao, and Lu Zhang. 2024. LLM-based Abstraction and Concretization for GUI Test Migration.arXiv preprint arXiv:2409.05028(2024)

  64. [64]

    Yakun Zhang, Qihao Zhu, Jiwei Yan, Chen Liu, Wenjie Zhang, Yifan Zhao, Dan Hao, and Lu Zhang. 2024. Synthesis- Based Enhancement for GUI Test Case Migration. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 869–881. Received 2025-09-02; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE16...