arxiv: 2603.24160 · v2 · submitted 2026-03-25 · 💻 cs.SE

Recognition: no theorem link

Towards Automated Crowdsourced Testing via Personified-LLM

Shengcheng Yu , Yuchen Ling , Chunrong Fang , Zhenyu Chen , Chunyang Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:48 UTC · model grok-4.3

classification 💻 cs.SE

keywords PersonaTestercrowdsourced testingGUI testingLLM agentspersonified LLMsautomated software testingbehavioral simulation

0 comments

The pith

PersonaTester injects three orthogonal dimensions of human tester personas into LLMs to simulate diverse crowdworker behaviors for automated GUI testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Crowdsourced GUI testing draws value from the varied behaviors of real human testers across devices and scenarios, yet scaling it remains costly and inconsistent. Purely automated LLM-based testing offers control and speed but typically produces uniform outputs that miss that diversity. PersonaTester defines personas along testing mindset, exploration strategy, and interaction habit, then embeds them into LLM agents so each agent generates distinct, repeatable testing traces. Experiments show these agents reproduce real crowdworker patterns with high intra-persona consistency and inter-persona differences, while producing more effective test events and uncovering additional crashes and functional bugs compared with non-persona baselines.

Core claim

PersonaTester is a framework that automates crowdsourced GUI testing by injecting representative personas—defined along the three orthogonal dimensions of testing mindset, exploration strategy, and interaction habit—into LLM-based agents. This injection produces controllable, repeatable simulations of human-like testing behaviors. The resulting agents exhibit strong behavioral fidelity to real crowdworkers and generate more effective test events, triggering over 100 crashes and 11 functional bugs beyond what non-persona baselines achieve.

What carries the argument

Persona injection into LLM agents along three orthogonal dimensions (testing mindset, exploration strategy, interaction habit) that drive distinct action sequences and coverage patterns.

If this is right

GUI testing coverage can be scaled with controlled behavioral diversity without recruiting new human testers for every session.
Test runs become fully reproducible while still spanning multiple distinct user profiles.
Higher rates of crash and functional-bug detection become achievable in automated pipelines.
Testing across varied devices and environments can be simulated by varying only the persona parameters rather than the underlying agent code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same persona framework could be applied to other interactive testing domains such as web or desktop applications.
Periodic recalibration of persona parameters against fresh crowdworker data might maintain fidelity as LLMs evolve.
Combining persona-driven agents with lightweight human oversight could further reduce the cost of crowdsourced campaigns while preserving coverage quality.

Load-bearing premise

The three chosen persona dimensions, when injected into current LLMs, produce faithful and generalizable simulations of real human crowdworker behavior rather than LLM-specific artifacts.

What would settle it

Running the same persona agents on a new set of unseen mobile apps and comparing their generated test-event distributions and bug-trigger rates against fresh logs from actual crowdworkers on those same apps.

Figures

Figures reproduced from arXiv: 2603.24160 by Chunrong Fang, Chunyang Chen, Shengcheng Yu, Yuchen Ling, Zhenyu Chen.

**Figure 2.** Figure 2: Overview of PersonaTester Workflow 3.1 Personification for LLM To simulate the behavioral diversity observed in real-world crowdsourced testing, we adopt a persona-based modeling approach, drawing inspiration from established concepts in humancomputer interaction (HCI) and conversational AI. In general, a persona refers to a fictitious Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE166. Publication da… view at source ↗

**Figure 3.** Figure 3: Example of Prompt for LLM Personification ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: RQ1.1: Intra-Cluster Cohesion To compute the similarity between two test paths, we first encode each path as a vector and then measure the cosine similarity between them. The path vectorization process is a multistep procedure designed to capture semantic-level behavioral patterns in a principled and interpretable manner. First, we validate the exploration traces to construct cleaned and interpretable t… view at source ↗

**Figure 5.** Figure 5: RQ1.2: Inter-Cluster Separation To assess behavioral distinctiveness across personas, we measure inter-cluster separation among agents executing the same tasks. As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: RQ2: Test Generation Effectiveness Most persona-guided agents perform comparatively better than the non-personified baseline agent (Persona X) in terms of overall test event generation effectiveness, outperforming by 33% – 47% on average. Several agents consistently outperform the baseline. This indicates that the integration of persona-guided behavioral intent does not compromise the LLM’s ability to inte… view at source ↗

**Figure 7.** Figure 7: RQ3.1: Crash Bug Triggering Capability We observe that a few bugs are uniquely triggered by the non-personified baseline agent. These cases are largely due to the inherent randomness in the baseline’s decision-making, which occasionally leads to alternative, unstructured paths that are not aligned with any defined persona. However, such behavior is non-repeatable and shows no consistent pattern. In contra… view at source ↗

read the original abstract

The rapid proliferation and increasing complexity of software demand robust quality assurance, with graphical user interface (GUI) testing playing a pivotal role. Crowdsourced testing has proven effective in this context by leveraging the diversity of human testers to achieve rich, scenario-based coverage across varied devices, user behaviors, and usage environments. In parallel, automated testing, particularly with the advent of large language models (LLMs), offers significant advantages in controllability, reproducibility, and efficiency, enabling scalable and systematic exploration. However, automated approaches often lack the behavioral diversity characteristic of human testers, limiting their capability to fully simulate real-world testing dynamics. To address this gap, we present PersonaTester, a novel personified-LLM-based framework designed to automate crowdsourced GUI testing. By injecting representative personas, defined along three orthogonal dimensions: testing mindset, exploration strategy, and interaction habit, into LLM-based agents, PersonaTester enables the simulation of diverse human-like testing behaviors in a controllable and repeatable manner. Experimental results demonstrate that PersonaTester faithfully reproduces the behavioral patterns of real crowdworkers, exhibiting strong intra-persona consistency and clear inter-persona variability (117.86% -- 126.23% improvement over the baseline). Moreover, persona-guided testing agents consistently generate more effective test events and trigger more crashes (100+) and functional bugs (11) than the baseline without persona, thus substantially advancing the realism and effectiveness of automated crowdsourced GUI testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PersonaTester adds structured persona injection to LLM GUI testing agents and reports clear gains in crashes and bugs over a plain baseline, but the human-fidelity claim lacks direct evidence against real crowdworker traces.

read the letter

The paper's core move is to define three orthogonal persona dimensions—testing mindset, exploration strategy, and interaction habit—and inject them into LLM agents so the testing behavior becomes more varied and controllable than a generic automated run. That combination is new enough in the GUI testing space. They show the agents produce consistent behavior within each persona while differing across personas, and the numbers indicate 117-126% better effectiveness plus over 100 crashes and 11 functional bugs that the baseline missed. Those concrete improvements are the part worth paying attention to if you're building testing tools that need to scale without hiring people for every run. The framework itself looks practical for repeatability, which is a real advantage over actual crowdsourcing. The main weakness is that the fidelity argument stays indirect. The results only compare persona agents to the non-persona baseline; there are no reported metrics that line up generated action sequences or coverage patterns against logs from actual human crowdworkers on the same apps. Without that, the gains could come from useful extra randomness rather than anything that truly matches human testing habits. Experimental details on app selection, statistical checks, and how they quantified behavioral similarity are also thin in what is visible. This is aimed at people working on LLM agents for software engineering or automated testing who want a knob for diversity. A reader who needs a working prototype idea or a baseline for further experiments would find it useful. It deserves peer review because the idea is grounded enough and the reported edge is measurable, even if the human-simulation side needs tighter validation to hold up under scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces PersonaTester, a framework that injects LLMs with personas defined along three orthogonal dimensions (testing mindset, exploration strategy, and interaction habit) to automate crowdsourced GUI testing. It claims this produces human-like behavioral diversity, with strong intra-persona consistency, inter-persona variability, 117.86%–126.23% improvement over a non-persona baseline, and higher effectiveness in generating test events that trigger 100+ crashes and 11 functional bugs.

Significance. If the central empirical claims are substantiated with direct evidence, the work could meaningfully advance automated GUI testing by offering a controllable way to simulate diverse human tester behaviors at scale, potentially reducing costs of crowdsourcing while preserving realism. The orthogonal persona design is a clear strength for reproducibility.

major comments (2)

[Abstract and Experimental Results] Abstract and Experimental Results: The headline claim that PersonaTester 'faithfully reproduces the behavioral patterns of real crowdworkers' is supported only by indirect evidence (baseline improvements and higher crash/bug counts); no direct quantitative fidelity metrics (e.g., sequence edit distance, action-type KL divergence, or coverage overlap) comparing generated traces to real crowdworker logs on the same apps are reported. This leaves the reproduction claim dependent on the untested assumption that performance gains equal human fidelity rather than LLM artifacts.
[Experimental Results] Experimental Results: The manuscript provides no details on experimental setup, including app selection criteria, number of runs, statistical significance tests, or how behavioral fidelity and consistency were measured against real human data. Without these, the reported 117–126% gains and crash/bug counts cannot be properly evaluated for robustness or generalizability.

minor comments (2)

[Methodology] Clarify the exact prompting mechanism used to inject the three persona dimensions into the LLM agents, including any example prompts or templates, to aid reproducibility.
[Experimental Setup] The baseline description should explicitly state whether it uses the same LLM backbone and prompting structure as PersonaTester (minus personas) to ensure a fair comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical claims and experimental transparency. We address each point below and will revise the manuscript to incorporate additional details and clarifications where appropriate.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and Experimental Results: The headline claim that PersonaTester 'faithfully reproduces the behavioral patterns of real crowdworkers' is supported only by indirect evidence (baseline improvements and higher crash/bug counts); no direct quantitative fidelity metrics (e.g., sequence edit distance, action-type KL divergence, or coverage overlap) comparing generated traces to real crowdworker logs on the same apps are reported. This leaves the reproduction claim dependent on the untested assumption that performance gains equal human fidelity rather than LLM artifacts.

Authors: We acknowledge that the current manuscript relies on indirect evidence for the reproduction claim, specifically the quantified intra-persona consistency and inter-persona variability (via the 117.86%–126.23% gains) together with superior crash and bug detection rates. These metrics were chosen because they directly capture the expected properties of crowdworker behavior—consistent patterns within a tester style and diversity across styles—rather than assuming performance gains alone imply fidelity. We agree that direct metrics would provide stronger substantiation. In the revision we will add action-type KL divergence, sequence similarity measures, and coverage overlap analyses using the real crowdworker logs collected during our experiments, and we will tone down the abstract wording to 'reproduces key behavioral patterns' pending those results. revision: partial
Referee: [Experimental Results] Experimental Results: The manuscript provides no details on experimental setup, including app selection criteria, number of runs, statistical significance tests, or how behavioral fidelity and consistency were measured against real human data. Without these, the reported 117–126% gains and crash/bug counts cannot be properly evaluated for robustness or generalizability.

Authors: We agree that the Experimental Results section lacks sufficient methodological detail. In the revised manuscript we will expand this section to specify: (i) the criteria and rationale for selecting the evaluated apps, (ii) the exact number of independent runs per condition, (iii) the statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) used to assess significance of the reported gains, and (iv) the precise operational definitions and formulas for measuring intra-persona consistency (variance within persona runs) and inter-persona variability (divergence across personas), including how these were validated against the real human traces. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results stand independently

full rationale

The paper presents PersonaTester as a framework that injects three defined persona dimensions into LLM agents and evaluates it via direct experimental comparison against a non-persona baseline. Reported gains (117-126% improvement, more crashes and bugs) are measured outcomes from testing runs on apps, not quantities derived from or forced by the persona definitions themselves. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the fidelity or effectiveness claims to tautologies. The central assumption about human-like behavior is tested (or claimed tested) through observable metrics rather than presupposed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that hand-defined personas along three dimensions can be injected into LLMs to produce human-like behavior; no free parameters are explicitly fitted in the abstract, but the persona definitions themselves function as constructed inputs.

axioms (1)

domain assumption Personas defined along testing mindset, exploration strategy, and interaction habit dimensions can be faithfully represented and followed by LLMs.
Invoked in the description of how personas are injected into agents.

pith-pipeline@v0.9.0 · 5560 in / 1217 out tokens · 24845 ms · 2026-05-15T00:48:39.085068+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 4 internal anchors

[1]

Domenico Amalfitano, Anna Rita Fasolino, Porfirio Tramontana, Salvatore De Carmine, and Atif M Memon. 2012. Using GUI ripping for automated testing of Android applications. InProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. 258–261

work page 2012
[2]

Shaoheng Cao, Minxue Pan, Yuanhong Lan, and Xuandong Li. 2025. Intention-Based GUI Test Migration for Mobile Apps using Large Language Models.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 2296–2318

work page 2025
[3]

Hao Chen, Song Huang, Yuchan Liu, Run Luo, and Yifei Xie. 2021. An effective crowdsourced test report clustering model based on sentence embedding. In2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 888–899

work page 2021
[4]

Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang. 2025. Standing on the Shoulders of Giants: Bug-Aware Automated GUI Testing via Retrieval Augmentation.Proceedings of the ACM on Software Engineering2, FSE (2025), 825–846

work page 2025
[5]

Qiang Cui, Junjie Wang, Guowei Yang, Miao Xie, Qing Wang, and Mingshu Li. 2017. Who should be selected to perform a task in crowdsourced testing?. In2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. IEEE, 75–84

work page 2017
[6]

Chunrong Fang, Shengcheng Yu, Quanjun Zhang, Xin Li, Yulei Liu, and Zhenyu Chen. 2024. Enhanced Crowdsourced Test Report Prioritization via Image-and-Text Semantic Understanding and Feature Integration.IEEE Transactions on Software Engineering(2024)

work page 2024
[7]

Sidong Feng and Chunyang Chen. 2024. Prompting is all you need: Automated android bug replay with large language models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

work page 2024
[8]

Ruizhi Gao, Yabin Wang, Yang Feng, Zhenyu Chen, and W Eric Wong. 2019. Successes, challenges, and rethinking–an industrial investigation on crowdsourced mobile application testing.Empirical Software Engineering24, 2 (2019), 537–561

work page 2019
[9]

Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In2013 IEEE workshop on automatic speech recognition and understanding. IEEE, 273–278

work page 2013
[10]

Jonathan Grudin and John Pruitt. 2002. Personas, participatory design and product development: An infrastructure for engagement. InPDC. 144–152

work page 2002
[11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Rui Hao, Yang Feng, James A Jones, Yuying Li, and Zhenyu Chen. 2019. CTRAS: Crowdsourced test report aggregation and summarization. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 900–911

work page 2019
[13]

Tiancheng Hu and Nigel Collier. 2024. Quantifying the Persona Effect in LLM Simulations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10289–10307

work page 2024
[14]

Yuchao Huang, Junjie Wang, Zhe Liu, Mingyang Li, Song Wang, Chunyang Chen, Yuanzhe Hu, and Qing Wang. 2025. One Sentence Can Kill the Bug: Auto-replay Mobile App Crashes from One-sentence Overviews.IEEE Transactions on Software Engineering(2025)

work page 2025
[15]

Yuekai Huang, Junjie Wang, Song Wang, Zhe Liu, Yuanzhe Hu, and Qing Wang. 2020. Quest for the golden approach: An experimental evaluation of duplicate crowdtesting reports detection. InProceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–12

work page 2020
[16]

Ankur Joshi, Saket Kale, Satish Chandel, and D Kumar Pal. 2015. Likert scale: Explored and explained.British journal of applied science & technology7, 4 (2015), 396

work page 2015
[17]

Taemin Kim and Geunseok Yang. 2022. Predicting duplicate in bug report using topic-based duplicate learning with fine tuning-based bert algorithm.IEEE Access10 (2022), 129666–129675

work page 2022
[18]

Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. Droidbot: a lightweight ui-guided test input generator for android. In2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 23–26

work page 2017
[19]

Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A deep learning-based approach to automated black-box android app testing. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1070–1073

work page 2019
[20]

Yuchen Ling, Shengcheng Yu, Chunrong Fang, Guobin Pan, Jun Wang, and Jia Liu. 2025. Redefining crowdsourced test report prioritization: An innovative approach with large language model.Information and Software Technology179 (2025), 107629

work page 2025
[21]

Di Liu, Yang Feng, Xiaofang Zhang, James A Jones, and Zhenyu Chen. 2020. Clustering crowdsourced test reports of mobile applications using image understanding.IEEE transactions on software engineering48, 4 (2020), 1290–1308

work page 2020
[22]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Arti...

work page 2024
[23]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yuekai Huang, Jun Hu, and Qing Wang. 2024. Unblind text inputs: predicting hint-text of text input in mobile apps via LLM. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20

work page 2024
[24]

Jing Luo, Run Luo, Longze Chen, Liang Zhu, Chang Ao, Jiaming Li, Yukun Chen, Xin Cheng, Wen Yang, Jiayuan Su, et al. 2024. PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation.arXiv preprint arXiv:2410.01504(2024)

work page arXiv 2024
[25]

Mostafa Mohammed, Haipeng Cai, and Na Meng. 2019. An empirical comparison between monkey testing and human testing (wip paper). InProceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. 188–192

work page 2019
[26]

Changhai Nie and Hareton Leung. 2011. A survey of combinatorial testing.ACM Computing Surveys (CSUR)43, 2 (2011), 1–29

work page 2011
[27]

Minxue Pan, An Huang, Guoxin Wang, Tian Zhang, and Xuandong Li. 2020. Reinforcement learning based curiosity- driven testing of android applications. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 153–164

work page 2020
[28]

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, Michael S Bernstein, et al. 2023. Generative agents: Interactive simulacra of human behavior. arXiv.Org (2023, April 7) https://arxiv. org/abs/2304.03442 v2(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Chen Qian and Xin Cong. 2023. Communicative agents for software development.arXiv preprint arXiv:2307.079246, 3 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Hong Kong, China, 3982–3992. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[31]

Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students representatives of professionals in software engineering experiments?. In2015 IEEE/ACM 37th IEEE international conference on software engineering, Vol. 1. IEEE, 666–676

work page 2015
[32]

Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su

work page
[33]

InProceedings of the 2017 11th joint meeting on foundations of software engineering

Guided, stochastic model-based GUI testing of Android apps. InProceedings of the 2017 11th joint meeting on foundations of software engineering. 245–256

work page 2017
[34]

Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry Xu, Qinghua Lu, and Liming Zhu. 2025. Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead.Proceedings of the ACM on Software Engineering2, FSE (2025), 757–778

work page 2025
[35]

Lipeipei Sun, Tianzi Qin, Anran Hu, Jiale Zhang, Shuojia Lin, Jianyan Chen, Mona Ali, and Mirjana Prpa. 2025. Persona-L has Entered the Chat: Leveraging LLMs and Ability-based Framework for Personas of People with Complex Needs. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–31

work page 2025
[36]

Yao Tong and Xiaofang Zhang. 2021. Crowdsourced test report prioritization considering bug severity.Information and Software Technology139 (2021), 106668

work page 2021
[37]

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen

work page
[38]

Two tales of persona in llms: A survey of role-playing and personalization.arXiv preprint arXiv:2406.01171 (2024)

work page arXiv 2024
[39]

Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang. 2025. LLMDroid: Enhancing Automated Mobile App GUI Testing Coverage with Large Language Model Guidance.Proceedings of the ACM on Software Engineering2, FSE (2025), 1001–1022

work page 2025
[40]

Dingbang Wang, Yu Zhao, Sidong Feng, Zhaoxu Zhang, William GJ Halfond, Chunyang Chen, Xiaoxia Sun, Jiangfan Shi, and Tingting Yu. 2024. Feedback-driven automated whole bug report reproduction for android apps. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1048–1060

work page 2024
[41]

Jue Wang, Yanyan Jiang, Chang Xu, Chun Cao, Xiaoxing Ma, and Jian Lu. 2020. Combodroid: generating high-quality test inputs for android apps via use case combinations. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 469–480

work page 2020
[42]

Junjie Wang, Mingyang Li, Song Wang, Tim Menzies, and Qing Wang. 2019. Images don’t lie: Duplicate crowdtesting reports detection with screenshot information.Information and Software Technology110 (2019), 139–155

work page 2019
[43]

Junjie Wang, Song Wang, Jianfeng Chen, Tim Menzies, Qiang Cui, Miao Xie, and Qing Wang. 2019. Characterizing crowds to better optimize worker recommendation in crowdsourced testing.IEEE Transactions on Software Engineering 47, 6 (2019), 1259–1276

work page 2019
[44]

Junjie Wang, Ye Yang, Song Wang, Jun Hu, and Qing Wang. 2022. Context-and fairness-aware in-process crowdworker recommendation.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 3 (2022), 1–31

work page 2022
[45]

Junjie Wang, Ye Yang, Song Wang, Yuanzhe Hu, Dandan Wang, and Qing Wang. 2020. Context-aware in-process crowdworker recommendation. InProceedings of the ACM/IEEE 42nd international conference on software engineering. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE166. Publication date: July 2026. Towards Automated Crowdsourced Testing via Personified-...

work page 2020
[46]

2022.Intelligent Crowdsourced Testing

Qing Wang, Zhenyu Chen, Junjie Wang, and Yang Feng. 2022.Intelligent Crowdsourced Testing. Springer

work page 2022
[47]

Xiaoxue Wu, Wenjing Shan, Wei Zheng, Zhiguo Chen, Tao Ren, and Xiaobing Sun. 2023. An intelligent duplicate bug report detection method based on technical term extraction. In2023 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 1–12

work page 2023
[48]

Miao Xie, Qing Wang, Guowei Yang, and Mingshu Li. 2017. Cocoon: Crowdsourced testing quality maximization under context coverage constraint. In2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 316–327

work page 2017
[49]

Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su. 2024. General and practical property- based testing for android apps. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 53–64

work page 2024
[50]

Yuxuan Yang and Xin Chen. 2021. Crowdsourced test report prioritization based on text classification.IEEE Access10 (2021), 92692–92705

work page 2021
[51]

Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-driven mobile gui testing with autonomous large language model agents. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 129–139

work page 2024
[52]

Shengcheng Yu. 2019. Crowdsourced report generation via bug screenshot understanding. In2019 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 1277–1279

work page 2019
[53]

Shengcheng Yu, Chunrong Fang, Zhenfei Cao, Xu Wang, Tongyu Li, and Zhenyu Chen. 2021. Prioritize crowdsourced test reports via deep screenshot understanding. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 946–956

work page 2021
[54]

Shengcheng Yu, Chunrong Fang, Mingzhe Du, Zimin Ding, Zhenyu Chen, and Zhendong Su. 2024. Practical, Automated Scenario-based Mobile App Testing.IEEE Transactions on Software Engineering(2024)

work page 2024
[55]

Shengcheng Yu, Chunrong Fang, Xin Li, Yuchen Ling, Zhenyu Chen, and Zhendong Su. 2024. Effective, Platform- Independent GUI Testing via Image Embedding and Reinforcement Learning.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–27

work page 2024
[56]

Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen. 2023. Llm for test script generation and migration: Challenges, capabilities, and opportunities. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security. IEEE, 206–217

work page 2023
[57]

Shengcheng Yu, Chunrong Fang, Ziyuan Tuo, Quanjun Zhang, Chunyang Chen, Zhenyu Chen, and Zhendong Su

work page
[58]

Vision-based mobile app gui testing: A survey.arXiv preprint arXiv:2310.13518(2023)

work page arXiv 2023
[59]

Shengcheng Yu, Chunrong Fang, Quanjun Zhang, Zhihao Cao, Yexiao Yun, Zhenfei Cao, Kai Mei, and Zhenyu Chen

work page
[60]

Mobile app crowdsourced test report consistency detection via deep image-and-text fusion understanding.IEEE Transactions on Software Engineering49, 8 (2023), 4115–4134

work page 2023
[61]

Shengcheng Yu, Chunrong Fang, Quanjun Zhang, Mingzhe Du, Jia Liu, and Zhenyu Chen. 2024. Semi-supervised Crowd- sourced Test Report Clustering via Screenshot-Text Binding Rules.Proceedings of the ACM on Software Engineering1, FSE (2024), 1540–1563

work page 2024
[62]

Shengcheng Yu, Yuchen Ling, Chunrong Fang, Quan Zhou, Chunyang Chen, Shaomin Zhu, and Zhenyu Chen. 2025. LLM-Guided Scenario-based GUI Testing.arXiv preprint arXiv:2506.05079(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Yakun Zhang, Chen Liu, Xiaofei Xie, Yun Lin, Jin Song Dong, Dan Hao, and Lu Zhang. 2024. LLM-based Abstraction and Concretization for GUI Test Migration.arXiv preprint arXiv:2409.05028(2024)

work page arXiv 2024
[64]

Yakun Zhang, Qihao Zhu, Jiwei Yan, Chen Liu, Wenjie Zhang, Yifan Zhao, Dan Hao, and Lu Zhang. 2024. Synthesis- Based Enhancement for GUI Test Case Migration. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 869–881. Received 2025-09-02; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE16...

work page 2024