Scenario-Guided LLM-based Mobile App GUI Testing

Chunrong Fang; Chunyang Chen; Quan Zhou; Shaomin Zhu; Shengcheng Yu; Yi Zhao; Yuchen Ling; Zhenyu Chen

arxiv: 2506.05079 · v4 · submitted 2025-06-05 · 💻 cs.SE

Scenario-Guided LLM-based Mobile App GUI Testing

Shengcheng Yu , Yuchen Ling , Chunrong Fang , Quan Zhou , Yi Zhao , Chunyang Chen , Shaomin Zhu , Zhenyu Chen This is my paper

Pith reviewed 2026-05-19 11:01 UTC · model grok-4.3

classification 💻 cs.SE

keywords GUI testinglarge language modelsmulti-agent systemsmobile applicationsscenario-guided testingautomated software testing

0 comments

The pith

A multi-agent system of LLMs can automate mobile app GUI testing by pursuing specific business scenarios instead of random exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Manual testers focus on completing concrete scenarios tied to app business logic, while most automated tools simply wander through the interface and leave key paths untested. This paper shows how LLMs can be organized into five cooperating agents that read the current GUI, choose the next widget and action based on the target scenario, carry out the step, check whether the scenario goal has been met, and keep a running record of what happened. The approach therefore tries to close the gap between automated runs and the functions that actually matter to users. If the agents make reliable decisions, test suites would cover critical functionality more thoroughly without requiring exhaustive manual scripting.

Core claim

ScenGen is a scenario-guided LLM-based GUI testing framework that employs a multi-agent collaboration mechanism to simulate and automate the phases of manual testing. It integrates an Observer that extracts and structures GUI widgets and layouts to interpret semantic information, a Decider that uses LLMs to identify target widgets and actions aligned with a given testing scenario, an Executor that performs the operations on the app, a Supervisor that verifies whether results match the intended scenario completion, and a Recorder that logs operations into context memory while monitoring for runtime bugs.

What carries the argument

Five-agent collaboration in which the Observer supplies structured semantic GUI state, the Decider applies LLM reasoning to scenario context for widget and action selection, the Executor applies the chosen operation, the Supervisor confirms scenario fulfillment, and the Recorder maintains memory and bug detection.

If this is right

Testing effort concentrates on business-critical paths rather than uniform coverage of every screen.
Each test run produces traceable decisions and execution logs that link back to the original scenario.
Runtime monitoring occurs continuously as part of the scenario flow instead of as a separate step.
Context memory accumulated by the Recorder can be reused to improve later decisions within the same test session.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent structure might be adapted to web or desktop interfaces where semantic understanding of UI elements is also available.
If the method proves reliable, teams could shift from writing many individual test scripts toward describing high-level scenarios and letting the agents handle the details.
Hybrid setups could combine this LLM guidance with existing model-based or symbolic testing tools to handle both scenario goals and low-level constraints.

Load-bearing premise

Large language models can reliably interpret semantic information from GUI states and make correct scenario-driven decisions for widget identification and action selection without significant hallucinations or errors that derail test completion.

What would settle it

A head-to-head run on the same set of apps and scenarios in which ScenGen completes fewer targeted scenarios or detects fewer known scenario-specific defects than a standard random-exploration tester.

Figures

Figures reproduced from arXiv: 2506.05079 by Chunrong Fang, Chunyang Chen, Quan Zhou, Shaomin Zhu, Shengcheng Yu, Yi Zhao, Yuchen Ling, Zhenyu Chen.

**Figure 1.** Figure 1: Motivating Examples 2) Domain Knowledge Dependency: In real-world scenarios, testing often heavily relies on domain-specific knowledge. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: S [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LLM Interaction of Decider Is the page undergoing a loading process? Yes. There is a loading spinner overlaying the page, indicating a loading process is occurring. (a) Real-time Loading Verification Scenario: Calculation Action: Touch the Button "+”, has the app correctly responded to my action? No. The current GUI screen shows "8-", while "8+" is expected. Has my action successfully caused the change of … view at source ↗

**Figure 4.** Figure 4: LLM Interaction of Supervisor specification of a target widget. The input operation needs not only the target widget and the text to be input but also the positioning of the input relative to the target widget. The scroll operation demands the specification of the target widget (if not specified, the default is the entire screen) and the scroll direction. Notably, the input operation must be carried out in… view at source ↗

read the original abstract

The assurance of mobile app GUI has become increasingly important, as the GUI serves as the primary medium of interaction between users and apps. Although numerous automated GUI testing approaches have been developed with diverse strategies, a substantial gap remains between these approaches and the underlying app business logic. Most existing approaches focus on general exploration rather than the completion of specific testing scenarios, often resulting in missed coverage of critical functionalities. Inspired by the manual testing process, which treats business logic, driven testing scenarios as the fundamental unit of testing, this paper introduces an approach that leverages large language models (LLMs) to comprehend the semantics expressed in app GUIs and their contextual relevance to given testing scenarios. Building upon this capability, we propose ScenGen, a novel scenario-guided LLM-based GUI testing framework that employs a multi-agent collaboration mechanism to simulate and automate the phases of manual testing. ScenGen integrates five agents. The Observer perceives the app GUI state by extracting and structuring GUI widgets and layouts, thereby interpreting the semantic information presented in the GUI. This information is then passed to the Decider, which makes scenario-driven decisions with the guidance of LLMs to identify target widgets and determine appropriate actions toward fulfilling specific testing goals. The Executor executes the decided operations on the app, while the Supervisor verifies whether the execution results align with the intended testing scenario completion, ensuring traceability and consistency in test generation and execution. Finally, the Recorder records the corresponding GUI operations into the context memory as a knowledge base for subsequent decision-making and concurrently monitors runtime bug occurrences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScenGen sketches a five-agent LLM setup to drive GUI tests from business scenarios, but the abstract gives no results or error handling so the practical payoff stays unclear.

read the letter

The core idea is a multi-agent system that tries to make automated mobile GUI testing follow specific business scenarios instead of broad random walks. Observer pulls structured widget info from the screen, Decider uses an LLM to pick the next widget and action based on the scenario, Executor runs it, Supervisor checks if the goal was met, and Recorder logs everything for later use. That split is the main new piece; it directly copies the manual tester's loop of look-decide-act-verify while adding traceability through the supervisor step. The framing around business logic coverage is sensible and matches a real pain point in current tools that just explore without finishing key flows. The architecture itself looks workable on paper and avoids some of the usual black-box problems by keeping each agent's role narrow. The soft spot is the complete absence of any numbers. The description never shows error rates on widget selection, how often the LLM picks the wrong action, or whether the supervisor actually catches drift. A single bad Decider call can break the scenario, yet the prompt says nothing about few-shot examples, output validation, or fallbacks. Without experiments or even a small case study, it's impossible to tell if the loop holds together in practice. The paper is aimed at testing researchers and mobile QA teams who already use LLMs and want a scenario-focused alternative to pure exploration tools. It would be worth sending to referees if the full version includes runs on real apps with coverage and bug-finding metrics; right now the design is clear enough to review but the claims rest on untested assumptions about LLM reliability.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ScenGen, a scenario-guided LLM-based GUI testing framework for mobile apps that uses a multi-agent collaboration mechanism (Observer, Decider, Executor, Supervisor, Recorder) to simulate manual testing phases. The approach leverages LLMs to interpret GUI semantics and contextual relevance to given testing scenarios, aiming to close the gap between general exploration-based automated testing and coverage of specific business logic functionalities.

Significance. If the framework can be shown through rigorous evaluation to reliably complete scenario-driven tests while mitigating LLM errors, it would advance automated GUI testing by aligning test generation more closely with app business logic rather than undirected exploration, potentially improving coverage of critical functionalities.

major comments (2)

[Abstract] Abstract: The central claim that ScenGen automates manual testing phases via LLM-guided decisions rests on the Decider agent's ability to correctly map GUI semantics to scenario goals and select valid widgets/actions, yet the description supplies no details on prompt construction, few-shot examples, output parsing, or fallback mechanisms. A single misidentification would break scenario completion, and the Supervisor's verification is described only at high level with similar LLM dependence, leaving error rates and drift risks unaddressed.
[Abstract] Abstract: The manuscript supplies no experimental results, validation data, error analysis, or ablation studies on decision accuracy. Without these, the claim that the multi-agent loop fulfills testing goals remains unverified, as the soundness assessment notes the absence of any empirical grounding for the framework's performance.

minor comments (1)

The abstract is clearly written but could explicitly state the intended evaluation methodology (e.g., metrics for scenario completion rate or comparison baselines) to help readers assess the proposal's scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major comment below and commit to a major revision that incorporates the requested details and empirical evaluation to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that ScenGen automates manual testing phases via LLM-guided decisions rests on the Decider agent's ability to correctly map GUI semantics to scenario goals and select valid widgets/actions, yet the description supplies no details on prompt construction, few-shot examples, output parsing, or fallback mechanisms. A single misidentification would break scenario completion, and the Supervisor's verification is described only at high level with similar LLM dependence, leaving error rates and drift risks unaddressed.

Authors: We agree that the current high-level description leaves important implementation details unspecified. In the revised manuscript we will add a dedicated subsection under Methodology that specifies the prompt templates for the Decider and Supervisor, the few-shot examples used, the structured output parsing logic (including JSON schema enforcement), and the fallback strategies (e.g., re-prompting with error feedback or conservative default actions). We will also include a quantitative error analysis of decision accuracy and discuss mechanisms to detect and mitigate drift across the multi-agent loop. revision: yes
Referee: [Abstract] Abstract: The manuscript supplies no experimental results, validation data, error analysis, or ablation studies on decision accuracy. Without these, the claim that the multi-agent loop fulfills testing goals remains unverified, as the soundness assessment notes the absence of any empirical grounding for the framework's performance.

Authors: The initial submission focused on presenting the ScenGen architecture and its alignment with manual testing phases. To address the lack of empirical grounding, the revised version will include a new Evaluation section reporting results on multiple open-source Android apps. This will contain scenario-completion rates, decision-accuracy metrics, error analysis of LLM misidentifications, and ablation studies that isolate the contribution of each agent and the memory mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: self-contained engineering framework proposal

full rationale

The paper introduces ScenGen as a multi-agent LLM framework for scenario-guided mobile GUI testing, describing agent roles (Observer, Decider, Executor, Supervisor, Recorder) and their interactions at a high level. No equations, derivations, fitted parameters, or self-referential definitions appear. The central claim rests on LLM semantic comprehension and multi-agent collaboration inspired by manual testing processes, without reducing any result to its own inputs by construction or via load-bearing self-citations. The framework is presented as an independent engineering contribution with external benchmarks in mind (e.g., comparison to existing GUI testing approaches).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that LLMs possess sufficient semantic understanding of GUIs for reliable decision-making in testing contexts; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Large language models can comprehend the semantics expressed in app GUIs and their contextual relevance to given testing scenarios
This capability is invoked as the basis for the Observer and Decider agents to interpret GUI states and make scenario-driven decisions.

pith-pipeline@v0.9.0 · 5821 in / 1189 out tokens · 66979 ms · 2026-05-19T11:01:39.506426+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ScenGen integrates five agents. The Observer perceives the app GUI state... The Decider... makes scenario-driven decisions with the guidance of LLMs...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use vision-based approaches to identify GUI widgets... computer vision algorithms, i.e., edge detection... OCR algorithms...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Automated Crowdsourced Testing via Personified-LLM
cs.SE 2026-03 unverdicted novelty 6.0

PersonaTester uses LLMs guided by three-dimensional personas to replicate crowdworker testing patterns, yielding higher behavioral consistency, variability, and more bug detections than baseline LLM agents.
WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems
cs.SE 2026-04 unverdicted novelty 5.0

WebMAC uses three specialized multi-agent modules to clarify test scenarios, partition them for adequacy, and generate executable scripts, yielding 30-60% higher success rates and 29% better efficiency than SOTA on fo...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 2 Pith papers

[1]

Effective, platform- independent gui testing via image embedding and reinforcement learn- ing,

S. Yu, C. Fang, X. Li, Y . Ling, Z. Chen, and Z. Su, “Effective, platform- independent gui testing via image embedding and reinforcement learn- ing,” ACM Transactions on Software Engineering and Methodology , 2024

work page 2024
[2]

Practical, automated scenario-based mobile app testing,

S. Yu, C. Fang, M. Du, Z. Ding, Z. Chen, and Z. Su, “Practical, automated scenario-based mobile app testing,” IEEE Transactions on Software Engineering, vol. 50, no. 7, pp. 1949 – 1966, 2024

work page 1949
[3]

Improving random gui testing with image-based widget detection,

T. D. White, G. Fraser, and G. J. Brown, “Improving random gui testing with image-based widget detection,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis , ser. ISSTA 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 307–317

work page 2019
[4]

Guided, stochastic model-based gui testing of android apps,

T. Su, G. Meng, Y . Chen, K. Wu, W. Yang, Y . Yao, G. Pu, Y . Liu, and Z. Su, “Guided, stochastic model-based gui testing of android apps,” in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 245–256

work page 2017
[5]

Ui test migration across mobile platforms,

S. Talebipour, Y . Zhao, L. Dojcilovi ´c, C. Li, and N. Medvidovi ´c, “Ui test migration across mobile platforms,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2021, pp. 756–767

work page 2021
[6]

Test migration between mobile apps with similar functionality,

F. Behrang and A. Orso, “Test migration between mobile apps with similar functionality,” in2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2019, pp. 54–65

work page 2019
[7]

Appium, “Appium,” https://appium.io/, accessed: 2024-10-31

work page 2024
[8]

Repairing fragile gui test cases using word and layout embedding,

J. Yoon, S. Chung, K. Shin, J. Kim, S. Hong, and S. Yoo, “Repairing fragile gui test cases using word and layout embedding,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST) , 2022, pp. 291–301

work page 2022
[9]

[Online]

“Monkey,” 2024, accessed: 2024-10-31. [Online]. Available: https: //developer.android.com/studio/test/monkeyrunner

work page 2024
[10]

Dynodroid: An input generation system for android apps,

A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An input generation system for android apps,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering , 2013, pp. 224–234

work page 2013
[11]

Sapienz: Multi-objective automated testing for android applications,

K. Mao, M. Harman, and Y . Jia, “Sapienz: Multi-objective automated testing for android applications,” inProceedings of the 25th international symposium on software testing and analysis , 2016, pp. 94–105

work page 2016
[12]

Reinforcement learning based curiosity-driven testing of android applications,

M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement learning based curiosity-driven testing of android applications,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis , 2020, pp. 153–164

work page 2020
[13]

Reinforcement learning for android gui testing,

D. Adamo, M. K. Khan, S. Koppula, and R. Bryce, “Reinforcement learning for android gui testing,” in Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation , 2018, pp. 2–8

work page 2018
[14]

Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,

Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, Z. Tian, Y . Huang, J. Hu, and Q. Wang, “Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

work page 2024
[15]

Llm for test script generation and migration: Challenges, capabilities, and opportunities,

S. Yu, C. Fang, Y . Ling, C. Wu, and Z. Chen, “Llm for test script generation and migration: Challenges, capabilities, and opportunities,” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS) , 2023, pp. 206–217

work page 2023
[16]

2025 , publisher =

S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su, “Vision-based mobile app gui testing: A survey,” arXiv preprint arXiv:2310.13518, 2023

work page arXiv 2023
[17]

Boyd, Destruction and creation

J. Boyd, Destruction and creation . US Army Command and General Staff College Leavenworth, W A, 1987

work page 1987
[18]

Working memory,

A. Baddeley, “Working memory,” Science, vol. 255, no. 5044, pp. 556– 559, 1992

work page 1992
[19]

Gpt-4 technical report,

O. et al., “Gpt-4 technical report,” 2024

work page 2024
[20]

Owl eyes: Spotting ui display issues via visual understanding,

Z. Liu, C. Chen, J. Wang, Y . Huang, J. Hu, and Q. Wang, “Owl eyes: Spotting ui display issues via visual understanding,” in 2020 35th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), 2020, pp. 398–409

work page 2020
[21]

Uied: a hybrid tool for gui element detection,

M. Xie, S. Feng, Z. Xing, J. Chen, and C. Chen, “Uied: a hybrid tool for gui element detection,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2020. New York, NY , USA: Association for Computing Machinery, 2020, p. 1655–1659

work page 2020
[22]

Screen recognition: Creating accessibility metadata for mobile appli- cations from pixels,

X. Zhang, L. de Greef, A. Swearngin, S. White, K. Murray, L. Yu, Q. Shan, J. Nichols, J. Wu, C. Fleizach, A. Everitt, and J. P. Bigham, “Screen recognition: Creating accessibility metadata for mobile appli- cations from pixels,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , ser. CHI ’21. New York, NY , USA: Association ...

work page 2021
[23]

Aidui: Toward automated recognition of dark patterns in user interfaces,

S. M. Hasan Mansur, S. Salma, D. Awofisayo, and K. Moran, “Aidui: Toward automated recognition of dark patterns in user interfaces,” pp. 1958–1970, 2023

work page 1958
[24]

Automating gui-based test oracles for mobile apps,

K. Baral, J. Johnson, J. Mahmud, S. Salma, M. Fazzini, J. Rubin, J. Offutt, and K. Moran, “Automating gui-based test oracles for mobile apps,” in Proceedings of the 21st International Conference on Mining IEEE TRANSACTIONS ON SOFTW ARE ENGINEERING 13 Software Repositories, ser. MSR ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 309–321

work page 2024
[25]

Deep gui: Black-box gui input generation with deep learning,

F. YazdaniBanafsheDaragh and S. Malek, “Deep gui: Black-box gui input generation with deep learning,” in 2021 36th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE) , 2021, pp. 905–916

work page 2021
[26]

Resplay: Improving cross-platform record-and-replay with gui sequence matching,

S. Zhang, L. Wu, Y . Li, Z. Zhang, H. Lei, D. Li, Y . Guo, and X. Chen, “Resplay: Improving cross-platform record-and-replay with gui sequence matching,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) , 2023, pp. 439–450

work page 2023
[27]

Vision-based widget mapping for test migration across mobile platforms: Are we there yet?

R. Ji, T. Zhu, X. Zhu, C. Chen, M. Pan, and T. Zhang, “Vision-based widget mapping for test migration across mobile platforms: Are we there yet?” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) , 2023, pp. 1416–1428

work page 2023
[28]

Automated cross-platform inconsistency detection for mobile apps,

M. Fazzini and A. Orso, “Automated cross-platform inconsistency detection for mobile apps,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering , ser. ASE ’17. IEEE Press, 2017, p. 308–318

work page 2017
[29]

Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,

J. Wang, M. Li, S. Wang, T. Menzies, and Q. Wang, “Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,” Information and Software Technology , vol. 110, pp. 139–155, 2019

work page 2019
[30]

Guider: Gui structure and vision co-guided test script repair for android apps,

T. Xu, M. Pan, Y . Pei, G. Li, X. Zeng, T. Zhang, Y . Deng, and X. Li, “Guider: Gui structure and vision co-guided test script repair for android apps,” in Proceedings of the 30th ACM SIGSOFT International Sympo- sium on Software Testing and Analysis , ser. ISSTA 2021. New York, NY , USA: Association for Computing Machinery, 2021, p. 191–203

work page 2021
[31]

Automatic bug inference via deep image understanding,

S. Yu, W. Huang, J. Zhang, and H. Zheng, “Automatic bug inference via deep image understanding,” in 2022 9th International Conference on Dependable Systems and Their Applications (DSA) , 2022, pp. 330–334

work page 2022
[32]

Seman- tic gui scene learning and video alignment for detecting duplicate video- based bug reports,

Y . Yan, N. Cooper, O. Chaparro, K. Moran, and D. Poshyvanyk, “Seman- tic gui scene learning and video alignment for detecting duplicate video- based bug reports,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

work page 2024
[33]

BERT: Pre- training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Minneapolis, Minnesota: Association for ...

work page 2019
[34]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kel- ton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proceedings of the 36th International Conference on ...

work page 2024
[35]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural In- formation Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 22 199–22 213

work page 2022
[36]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2024

work page 2024
[37]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

work page 2020
[38]

The dawn of lmms: Preliminary explorations with gpt-4v(ision),

Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v(ision),” 2023

work page 2023
[39]

Chatgpt and soft- ware testing education: Promises & perils,

S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “Chatgpt and soft- ware testing education: Promises & perils,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2023, pp. 4130–4137

work page 2023
[40]

Vulre- pair: a t5-based automated software vulnerability repair,

M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulre- pair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE

work page
[41]

New York, NY , USA: Association for Computing Machinery, 2022, p. 935–947

work page 2022
[42]

Effective test generation using pre-trained large language models and mutation testing,

A. M. Dakhel, A. Nikanjam, V . Majdinasab, F. Khomh, and M. C. Desmarais, “Effective test generation using pre-trained large language models and mutation testing,”Information and Software Technology, vol. 171, p. 107468, 2024

work page 2024
[43]

Chatunitest: A framework for llm-based test generation,

Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” 2024

work page 2024
[44]

An empirical evaluation of using large language models for automated unit test generation,

M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2024

work page 2024
[45]

Fill in the blank: Context-aware automated text input generation for mobile gui testing,

Z. Liu, C. Chen, J. Wang, X. Che, Y . Huang, J. Hu, and Q. Wang, “Fill in the blank: Context-aware automated text input generation for mobile gui testing,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , 2023, pp. 1355–1367

work page 2023
[46]

Make llm a testing expert: Bringing human-like interac- tion to mobile gui testing via functionality-aware decisions,

Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, and Q. Wang, “Make llm a testing expert: Bringing human-like interac- tion to mobile gui testing via functionality-aware decisions,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), 2024, pp. 1222–1234

work page 2024
[47]

Crashtranslator: Automatically reproducing mobile ap- plication crashes directly from stack trace,

Y . Huang, J. Wang, Z. Liu, Y . Wang, S. Wang, C. Chen, Y . Hu, and Q. Wang, “Crashtranslator: Automatically reproducing mobile ap- plication crashes directly from stack trace,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) , 2024, pp. 190–202

work page 2024
[48]

Prompting is all you need: Automated an- droid bug replay with large language models,

S. Feng and C. Chen, “Prompting is all you need: Automated an- droid bug replay with large language models,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

work page 2024
[49]

Inferfix: End-to-end program repair with llms,

M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 1646–1656

work page 2023
[50]

Framing program repair as code completion,

F. Ribeiro, R. Abreu, and J. Saraiva, “Framing program repair as code completion,” in 2022 IEEE/ACM International Workshop on Automated Program Repair (APR) , 2022, pp. 38–45

work page 2022
[51]

Examining zero-shot vulnerability repair with large language models,

H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in2023 IEEE Symposium on Security and Privacy (SP) , 2023, pp. 2339–2356

work page 2023

[1] [1]

Effective, platform- independent gui testing via image embedding and reinforcement learn- ing,

S. Yu, C. Fang, X. Li, Y . Ling, Z. Chen, and Z. Su, “Effective, platform- independent gui testing via image embedding and reinforcement learn- ing,” ACM Transactions on Software Engineering and Methodology , 2024

work page 2024

[2] [2]

Practical, automated scenario-based mobile app testing,

S. Yu, C. Fang, M. Du, Z. Ding, Z. Chen, and Z. Su, “Practical, automated scenario-based mobile app testing,” IEEE Transactions on Software Engineering, vol. 50, no. 7, pp. 1949 – 1966, 2024

work page 1949

[3] [3]

Improving random gui testing with image-based widget detection,

T. D. White, G. Fraser, and G. J. Brown, “Improving random gui testing with image-based widget detection,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis , ser. ISSTA 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 307–317

work page 2019

[4] [4]

Guided, stochastic model-based gui testing of android apps,

T. Su, G. Meng, Y . Chen, K. Wu, W. Yang, Y . Yao, G. Pu, Y . Liu, and Z. Su, “Guided, stochastic model-based gui testing of android apps,” in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 245–256

work page 2017

[5] [5]

Ui test migration across mobile platforms,

S. Talebipour, Y . Zhao, L. Dojcilovi ´c, C. Li, and N. Medvidovi ´c, “Ui test migration across mobile platforms,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2021, pp. 756–767

work page 2021

[6] [6]

Test migration between mobile apps with similar functionality,

F. Behrang and A. Orso, “Test migration between mobile apps with similar functionality,” in2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2019, pp. 54–65

work page 2019

[7] [7]

Appium, “Appium,” https://appium.io/, accessed: 2024-10-31

work page 2024

[8] [8]

Repairing fragile gui test cases using word and layout embedding,

J. Yoon, S. Chung, K. Shin, J. Kim, S. Hong, and S. Yoo, “Repairing fragile gui test cases using word and layout embedding,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST) , 2022, pp. 291–301

work page 2022

[9] [9]

[Online]

“Monkey,” 2024, accessed: 2024-10-31. [Online]. Available: https: //developer.android.com/studio/test/monkeyrunner

work page 2024

[10] [10]

Dynodroid: An input generation system for android apps,

A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An input generation system for android apps,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering , 2013, pp. 224–234

work page 2013

[11] [11]

Sapienz: Multi-objective automated testing for android applications,

K. Mao, M. Harman, and Y . Jia, “Sapienz: Multi-objective automated testing for android applications,” inProceedings of the 25th international symposium on software testing and analysis , 2016, pp. 94–105

work page 2016

[12] [12]

Reinforcement learning based curiosity-driven testing of android applications,

M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement learning based curiosity-driven testing of android applications,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis , 2020, pp. 153–164

work page 2020

[13] [13]

Reinforcement learning for android gui testing,

D. Adamo, M. K. Khan, S. Koppula, and R. Bryce, “Reinforcement learning for android gui testing,” in Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation , 2018, pp. 2–8

work page 2018

[14] [14]

Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,

Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, Z. Tian, Y . Huang, J. Hu, and Q. Wang, “Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

work page 2024

[15] [15]

Llm for test script generation and migration: Challenges, capabilities, and opportunities,

S. Yu, C. Fang, Y . Ling, C. Wu, and Z. Chen, “Llm for test script generation and migration: Challenges, capabilities, and opportunities,” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS) , 2023, pp. 206–217

work page 2023

[16] [16]

2025 , publisher =

S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su, “Vision-based mobile app gui testing: A survey,” arXiv preprint arXiv:2310.13518, 2023

work page arXiv 2023

[17] [17]

Boyd, Destruction and creation

J. Boyd, Destruction and creation . US Army Command and General Staff College Leavenworth, W A, 1987

work page 1987

[18] [18]

Working memory,

A. Baddeley, “Working memory,” Science, vol. 255, no. 5044, pp. 556– 559, 1992

work page 1992

[19] [19]

Gpt-4 technical report,

O. et al., “Gpt-4 technical report,” 2024

work page 2024

[20] [20]

Owl eyes: Spotting ui display issues via visual understanding,

Z. Liu, C. Chen, J. Wang, Y . Huang, J. Hu, and Q. Wang, “Owl eyes: Spotting ui display issues via visual understanding,” in 2020 35th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), 2020, pp. 398–409

work page 2020

[21] [21]

Uied: a hybrid tool for gui element detection,

M. Xie, S. Feng, Z. Xing, J. Chen, and C. Chen, “Uied: a hybrid tool for gui element detection,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2020. New York, NY , USA: Association for Computing Machinery, 2020, p. 1655–1659

work page 2020

[22] [22]

Screen recognition: Creating accessibility metadata for mobile appli- cations from pixels,

X. Zhang, L. de Greef, A. Swearngin, S. White, K. Murray, L. Yu, Q. Shan, J. Nichols, J. Wu, C. Fleizach, A. Everitt, and J. P. Bigham, “Screen recognition: Creating accessibility metadata for mobile appli- cations from pixels,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , ser. CHI ’21. New York, NY , USA: Association ...

work page 2021

[23] [23]

Aidui: Toward automated recognition of dark patterns in user interfaces,

S. M. Hasan Mansur, S. Salma, D. Awofisayo, and K. Moran, “Aidui: Toward automated recognition of dark patterns in user interfaces,” pp. 1958–1970, 2023

work page 1958

[24] [24]

Automating gui-based test oracles for mobile apps,

K. Baral, J. Johnson, J. Mahmud, S. Salma, M. Fazzini, J. Rubin, J. Offutt, and K. Moran, “Automating gui-based test oracles for mobile apps,” in Proceedings of the 21st International Conference on Mining IEEE TRANSACTIONS ON SOFTW ARE ENGINEERING 13 Software Repositories, ser. MSR ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 309–321

work page 2024

[25] [25]

Deep gui: Black-box gui input generation with deep learning,

F. YazdaniBanafsheDaragh and S. Malek, “Deep gui: Black-box gui input generation with deep learning,” in 2021 36th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE) , 2021, pp. 905–916

work page 2021

[26] [26]

Resplay: Improving cross-platform record-and-replay with gui sequence matching,

S. Zhang, L. Wu, Y . Li, Z. Zhang, H. Lei, D. Li, Y . Guo, and X. Chen, “Resplay: Improving cross-platform record-and-replay with gui sequence matching,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) , 2023, pp. 439–450

work page 2023

[27] [27]

Vision-based widget mapping for test migration across mobile platforms: Are we there yet?

R. Ji, T. Zhu, X. Zhu, C. Chen, M. Pan, and T. Zhang, “Vision-based widget mapping for test migration across mobile platforms: Are we there yet?” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) , 2023, pp. 1416–1428

work page 2023

[28] [28]

Automated cross-platform inconsistency detection for mobile apps,

M. Fazzini and A. Orso, “Automated cross-platform inconsistency detection for mobile apps,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering , ser. ASE ’17. IEEE Press, 2017, p. 308–318

work page 2017

[29] [29]

Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,

J. Wang, M. Li, S. Wang, T. Menzies, and Q. Wang, “Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,” Information and Software Technology , vol. 110, pp. 139–155, 2019

work page 2019

[30] [30]

Guider: Gui structure and vision co-guided test script repair for android apps,

T. Xu, M. Pan, Y . Pei, G. Li, X. Zeng, T. Zhang, Y . Deng, and X. Li, “Guider: Gui structure and vision co-guided test script repair for android apps,” in Proceedings of the 30th ACM SIGSOFT International Sympo- sium on Software Testing and Analysis , ser. ISSTA 2021. New York, NY , USA: Association for Computing Machinery, 2021, p. 191–203

work page 2021

[31] [31]

Automatic bug inference via deep image understanding,

S. Yu, W. Huang, J. Zhang, and H. Zheng, “Automatic bug inference via deep image understanding,” in 2022 9th International Conference on Dependable Systems and Their Applications (DSA) , 2022, pp. 330–334

work page 2022

[32] [32]

Seman- tic gui scene learning and video alignment for detecting duplicate video- based bug reports,

Y . Yan, N. Cooper, O. Chaparro, K. Moran, and D. Poshyvanyk, “Seman- tic gui scene learning and video alignment for detecting duplicate video- based bug reports,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

work page 2024

[33] [33]

BERT: Pre- training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Minneapolis, Minnesota: Association for ...

work page 2019

[34] [34]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kel- ton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proceedings of the 36th International Conference on ...

work page 2024

[35] [35]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural In- formation Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 22 199–22 213

work page 2022

[36] [36]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2024

work page 2024

[37] [37]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

work page 2020

[38] [38]

The dawn of lmms: Preliminary explorations with gpt-4v(ision),

Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v(ision),” 2023

work page 2023

[39] [39]

Chatgpt and soft- ware testing education: Promises & perils,

S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “Chatgpt and soft- ware testing education: Promises & perils,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2023, pp. 4130–4137

work page 2023

[40] [40]

Vulre- pair: a t5-based automated software vulnerability repair,

M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulre- pair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE

work page

[41] [41]

New York, NY , USA: Association for Computing Machinery, 2022, p. 935–947

work page 2022

[42] [42]

Effective test generation using pre-trained large language models and mutation testing,

A. M. Dakhel, A. Nikanjam, V . Majdinasab, F. Khomh, and M. C. Desmarais, “Effective test generation using pre-trained large language models and mutation testing,”Information and Software Technology, vol. 171, p. 107468, 2024

work page 2024

[43] [43]

Chatunitest: A framework for llm-based test generation,

Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” 2024

work page 2024

[44] [44]

An empirical evaluation of using large language models for automated unit test generation,

M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2024

work page 2024

[45] [45]

Fill in the blank: Context-aware automated text input generation for mobile gui testing,

Z. Liu, C. Chen, J. Wang, X. Che, Y . Huang, J. Hu, and Q. Wang, “Fill in the blank: Context-aware automated text input generation for mobile gui testing,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , 2023, pp. 1355–1367

work page 2023

[46] [46]

Make llm a testing expert: Bringing human-like interac- tion to mobile gui testing via functionality-aware decisions,

Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, and Q. Wang, “Make llm a testing expert: Bringing human-like interac- tion to mobile gui testing via functionality-aware decisions,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), 2024, pp. 1222–1234

work page 2024

[47] [47]

Crashtranslator: Automatically reproducing mobile ap- plication crashes directly from stack trace,

Y . Huang, J. Wang, Z. Liu, Y . Wang, S. Wang, C. Chen, Y . Hu, and Q. Wang, “Crashtranslator: Automatically reproducing mobile ap- plication crashes directly from stack trace,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) , 2024, pp. 190–202

work page 2024

[48] [48]

Prompting is all you need: Automated an- droid bug replay with large language models,

S. Feng and C. Chen, “Prompting is all you need: Automated an- droid bug replay with large language models,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

work page 2024

[49] [49]

Inferfix: End-to-end program repair with llms,

M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 1646–1656

work page 2023

[50] [50]

Framing program repair as code completion,

F. Ribeiro, R. Abreu, and J. Saraiva, “Framing program repair as code completion,” in 2022 IEEE/ACM International Workshop on Automated Program Repair (APR) , 2022, pp. 38–45

work page 2022

[51] [51]

Examining zero-shot vulnerability repair with large language models,

H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in2023 IEEE Symposium on Security and Privacy (SP) , 2023, pp. 2339–2356

work page 2023