pith. sign in

arxiv: 2506.05079 · v4 · submitted 2025-06-05 · 💻 cs.SE

Scenario-Guided LLM-based Mobile App GUI Testing

Pith reviewed 2026-05-19 11:01 UTC · model grok-4.3

classification 💻 cs.SE
keywords GUI testinglarge language modelsmulti-agent systemsmobile applicationsscenario-guided testingautomated software testing
0
0 comments X

The pith

A multi-agent system of LLMs can automate mobile app GUI testing by pursuing specific business scenarios instead of random exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Manual testers focus on completing concrete scenarios tied to app business logic, while most automated tools simply wander through the interface and leave key paths untested. This paper shows how LLMs can be organized into five cooperating agents that read the current GUI, choose the next widget and action based on the target scenario, carry out the step, check whether the scenario goal has been met, and keep a running record of what happened. The approach therefore tries to close the gap between automated runs and the functions that actually matter to users. If the agents make reliable decisions, test suites would cover critical functionality more thoroughly without requiring exhaustive manual scripting.

Core claim

ScenGen is a scenario-guided LLM-based GUI testing framework that employs a multi-agent collaboration mechanism to simulate and automate the phases of manual testing. It integrates an Observer that extracts and structures GUI widgets and layouts to interpret semantic information, a Decider that uses LLMs to identify target widgets and actions aligned with a given testing scenario, an Executor that performs the operations on the app, a Supervisor that verifies whether results match the intended scenario completion, and a Recorder that logs operations into context memory while monitoring for runtime bugs.

What carries the argument

Five-agent collaboration in which the Observer supplies structured semantic GUI state, the Decider applies LLM reasoning to scenario context for widget and action selection, the Executor applies the chosen operation, the Supervisor confirms scenario fulfillment, and the Recorder maintains memory and bug detection.

If this is right

  • Testing effort concentrates on business-critical paths rather than uniform coverage of every screen.
  • Each test run produces traceable decisions and execution logs that link back to the original scenario.
  • Runtime monitoring occurs continuously as part of the scenario flow instead of as a separate step.
  • Context memory accumulated by the Recorder can be reused to improve later decisions within the same test session.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent structure might be adapted to web or desktop interfaces where semantic understanding of UI elements is also available.
  • If the method proves reliable, teams could shift from writing many individual test scripts toward describing high-level scenarios and letting the agents handle the details.
  • Hybrid setups could combine this LLM guidance with existing model-based or symbolic testing tools to handle both scenario goals and low-level constraints.

Load-bearing premise

Large language models can reliably interpret semantic information from GUI states and make correct scenario-driven decisions for widget identification and action selection without significant hallucinations or errors that derail test completion.

What would settle it

A head-to-head run on the same set of apps and scenarios in which ScenGen completes fewer targeted scenarios or detects fewer known scenario-specific defects than a standard random-exploration tester.

Figures

Figures reproduced from arXiv: 2506.05079 by Chunrong Fang, Chunyang Chen, Quan Zhou, Shaomin Zhu, Shengcheng Yu, Yi Zhao, Yuchen Ling, Zhenyu Chen.

Figure 1
Figure 1. Figure 1: Motivating Examples 2) Domain Knowledge Dependency: In real-world scenar￾ios, testing often heavily relies on domain-specific knowledge. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: S [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM Interaction of Decider Is the page undergoing a loading process? Yes. There is a loading spinner overlaying the page, indicating a loading process is occurring. (a) Real-time Loading Verification Scenario: Calculation Action: Touch the Button "+”, has the app correctly responded to my action? No. The current GUI screen shows "8-", while "8+" is expected. Has my action successfully caused the change of … view at source ↗
Figure 4
Figure 4. Figure 4: LLM Interaction of Supervisor specification of a target widget. The input operation needs not only the target widget and the text to be input but also the positioning of the input relative to the target widget. The scroll operation demands the specification of the target widget (if not specified, the default is the entire screen) and the scroll direction. Notably, the input operation must be carried out in… view at source ↗
read the original abstract

The assurance of mobile app GUI has become increasingly important, as the GUI serves as the primary medium of interaction between users and apps. Although numerous automated GUI testing approaches have been developed with diverse strategies, a substantial gap remains between these approaches and the underlying app business logic. Most existing approaches focus on general exploration rather than the completion of specific testing scenarios, often resulting in missed coverage of critical functionalities. Inspired by the manual testing process, which treats business logic, driven testing scenarios as the fundamental unit of testing, this paper introduces an approach that leverages large language models (LLMs) to comprehend the semantics expressed in app GUIs and their contextual relevance to given testing scenarios. Building upon this capability, we propose ScenGen, a novel scenario-guided LLM-based GUI testing framework that employs a multi-agent collaboration mechanism to simulate and automate the phases of manual testing. ScenGen integrates five agents. The Observer perceives the app GUI state by extracting and structuring GUI widgets and layouts, thereby interpreting the semantic information presented in the GUI. This information is then passed to the Decider, which makes scenario-driven decisions with the guidance of LLMs to identify target widgets and determine appropriate actions toward fulfilling specific testing goals. The Executor executes the decided operations on the app, while the Supervisor verifies whether the execution results align with the intended testing scenario completion, ensuring traceability and consistency in test generation and execution. Finally, the Recorder records the corresponding GUI operations into the context memory as a knowledge base for subsequent decision-making and concurrently monitors runtime bug occurrences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ScenGen, a scenario-guided LLM-based GUI testing framework for mobile apps that uses a multi-agent collaboration mechanism (Observer, Decider, Executor, Supervisor, Recorder) to simulate manual testing phases. The approach leverages LLMs to interpret GUI semantics and contextual relevance to given testing scenarios, aiming to close the gap between general exploration-based automated testing and coverage of specific business logic functionalities.

Significance. If the framework can be shown through rigorous evaluation to reliably complete scenario-driven tests while mitigating LLM errors, it would advance automated GUI testing by aligning test generation more closely with app business logic rather than undirected exploration, potentially improving coverage of critical functionalities.

major comments (2)
  1. [Abstract] Abstract: The central claim that ScenGen automates manual testing phases via LLM-guided decisions rests on the Decider agent's ability to correctly map GUI semantics to scenario goals and select valid widgets/actions, yet the description supplies no details on prompt construction, few-shot examples, output parsing, or fallback mechanisms. A single misidentification would break scenario completion, and the Supervisor's verification is described only at high level with similar LLM dependence, leaving error rates and drift risks unaddressed.
  2. [Abstract] Abstract: The manuscript supplies no experimental results, validation data, error analysis, or ablation studies on decision accuracy. Without these, the claim that the multi-agent loop fulfills testing goals remains unverified, as the soundness assessment notes the absence of any empirical grounding for the framework's performance.
minor comments (1)
  1. The abstract is clearly written but could explicitly state the intended evaluation methodology (e.g., metrics for scenario completion rate or comparison baselines) to help readers assess the proposal's scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major comment below and commit to a major revision that incorporates the requested details and empirical evaluation to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that ScenGen automates manual testing phases via LLM-guided decisions rests on the Decider agent's ability to correctly map GUI semantics to scenario goals and select valid widgets/actions, yet the description supplies no details on prompt construction, few-shot examples, output parsing, or fallback mechanisms. A single misidentification would break scenario completion, and the Supervisor's verification is described only at high level with similar LLM dependence, leaving error rates and drift risks unaddressed.

    Authors: We agree that the current high-level description leaves important implementation details unspecified. In the revised manuscript we will add a dedicated subsection under Methodology that specifies the prompt templates for the Decider and Supervisor, the few-shot examples used, the structured output parsing logic (including JSON schema enforcement), and the fallback strategies (e.g., re-prompting with error feedback or conservative default actions). We will also include a quantitative error analysis of decision accuracy and discuss mechanisms to detect and mitigate drift across the multi-agent loop. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript supplies no experimental results, validation data, error analysis, or ablation studies on decision accuracy. Without these, the claim that the multi-agent loop fulfills testing goals remains unverified, as the soundness assessment notes the absence of any empirical grounding for the framework's performance.

    Authors: The initial submission focused on presenting the ScenGen architecture and its alignment with manual testing phases. To address the lack of empirical grounding, the revised version will include a new Evaluation section reporting results on multiple open-source Android apps. This will contain scenario-completion rates, decision-accuracy metrics, error analysis of LLM misidentifications, and ablation studies that isolate the contribution of each agent and the memory mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: self-contained engineering framework proposal

full rationale

The paper introduces ScenGen as a multi-agent LLM framework for scenario-guided mobile GUI testing, describing agent roles (Observer, Decider, Executor, Supervisor, Recorder) and their interactions at a high level. No equations, derivations, fitted parameters, or self-referential definitions appear. The central claim rests on LLM semantic comprehension and multi-agent collaboration inspired by manual testing processes, without reducing any result to its own inputs by construction or via load-bearing self-citations. The framework is presented as an independent engineering contribution with external benchmarks in mind (e.g., comparison to existing GUI testing approaches).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that LLMs possess sufficient semantic understanding of GUIs for reliable decision-making in testing contexts; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Large language models can comprehend the semantics expressed in app GUIs and their contextual relevance to given testing scenarios
    This capability is invoked as the basis for the Observer and Decider agents to interpret GUI states and make scenario-driven decisions.

pith-pipeline@v0.9.0 · 5821 in / 1189 out tokens · 66979 ms · 2026-05-19T11:01:39.506426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Automated Crowdsourced Testing via Personified-LLM

    cs.SE 2026-03 unverdicted novelty 6.0

    PersonaTester uses LLMs guided by three-dimensional personas to replicate crowdworker testing patterns, yielding higher behavioral consistency, variability, and more bug detections than baseline LLM agents.

  2. WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems

    cs.SE 2026-04 unverdicted novelty 5.0

    WebMAC uses three specialized multi-agent modules to clarify test scenarios, partition them for adequacy, and generate executable scripts, yielding 30-60% higher success rates and 29% better efficiency than SOTA on fo...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 2 Pith papers

  1. [1]

    Effective, platform- independent gui testing via image embedding and reinforcement learn- ing,

    S. Yu, C. Fang, X. Li, Y . Ling, Z. Chen, and Z. Su, “Effective, platform- independent gui testing via image embedding and reinforcement learn- ing,” ACM Transactions on Software Engineering and Methodology , 2024

  2. [2]

    Practical, automated scenario-based mobile app testing,

    S. Yu, C. Fang, M. Du, Z. Ding, Z. Chen, and Z. Su, “Practical, automated scenario-based mobile app testing,” IEEE Transactions on Software Engineering, vol. 50, no. 7, pp. 1949 – 1966, 2024

  3. [3]

    Improving random gui testing with image-based widget detection,

    T. D. White, G. Fraser, and G. J. Brown, “Improving random gui testing with image-based widget detection,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis , ser. ISSTA 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 307–317

  4. [4]

    Guided, stochastic model-based gui testing of android apps,

    T. Su, G. Meng, Y . Chen, K. Wu, W. Yang, Y . Yao, G. Pu, Y . Liu, and Z. Su, “Guided, stochastic model-based gui testing of android apps,” in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 245–256

  5. [5]

    Ui test migration across mobile platforms,

    S. Talebipour, Y . Zhao, L. Dojcilovi ´c, C. Li, and N. Medvidovi ´c, “Ui test migration across mobile platforms,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2021, pp. 756–767

  6. [6]

    Test migration between mobile apps with similar functionality,

    F. Behrang and A. Orso, “Test migration between mobile apps with similar functionality,” in2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2019, pp. 54–65

  7. [7]

    Appium, “Appium,” https://appium.io/, accessed: 2024-10-31

  8. [8]

    Repairing fragile gui test cases using word and layout embedding,

    J. Yoon, S. Chung, K. Shin, J. Kim, S. Hong, and S. Yoo, “Repairing fragile gui test cases using word and layout embedding,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST) , 2022, pp. 291–301

  9. [9]

    [Online]

    “Monkey,” 2024, accessed: 2024-10-31. [Online]. Available: https: //developer.android.com/studio/test/monkeyrunner

  10. [10]

    Dynodroid: An input generation system for android apps,

    A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An input generation system for android apps,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering , 2013, pp. 224–234

  11. [11]

    Sapienz: Multi-objective automated testing for android applications,

    K. Mao, M. Harman, and Y . Jia, “Sapienz: Multi-objective automated testing for android applications,” inProceedings of the 25th international symposium on software testing and analysis , 2016, pp. 94–105

  12. [12]

    Reinforcement learning based curiosity-driven testing of android applications,

    M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement learning based curiosity-driven testing of android applications,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis , 2020, pp. 153–164

  13. [13]

    Reinforcement learning for android gui testing,

    D. Adamo, M. K. Khan, S. Koppula, and R. Bryce, “Reinforcement learning for android gui testing,” in Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation , 2018, pp. 2–8

  14. [14]

    Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,

    Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, Z. Tian, Y . Huang, J. Hu, and Q. Wang, “Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

  15. [15]

    Llm for test script generation and migration: Challenges, capabilities, and opportunities,

    S. Yu, C. Fang, Y . Ling, C. Wu, and Z. Chen, “Llm for test script generation and migration: Challenges, capabilities, and opportunities,” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS) , 2023, pp. 206–217

  16. [16]

    2025 , publisher =

    S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su, “Vision-based mobile app gui testing: A survey,” arXiv preprint arXiv:2310.13518, 2023

  17. [17]

    Boyd, Destruction and creation

    J. Boyd, Destruction and creation . US Army Command and General Staff College Leavenworth, W A, 1987

  18. [18]

    Working memory,

    A. Baddeley, “Working memory,” Science, vol. 255, no. 5044, pp. 556– 559, 1992

  19. [19]

    Gpt-4 technical report,

    O. et al., “Gpt-4 technical report,” 2024

  20. [20]

    Owl eyes: Spotting ui display issues via visual understanding,

    Z. Liu, C. Chen, J. Wang, Y . Huang, J. Hu, and Q. Wang, “Owl eyes: Spotting ui display issues via visual understanding,” in 2020 35th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), 2020, pp. 398–409

  21. [21]

    Uied: a hybrid tool for gui element detection,

    M. Xie, S. Feng, Z. Xing, J. Chen, and C. Chen, “Uied: a hybrid tool for gui element detection,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2020. New York, NY , USA: Association for Computing Machinery, 2020, p. 1655–1659

  22. [22]

    Screen recognition: Creating accessibility metadata for mobile appli- cations from pixels,

    X. Zhang, L. de Greef, A. Swearngin, S. White, K. Murray, L. Yu, Q. Shan, J. Nichols, J. Wu, C. Fleizach, A. Everitt, and J. P. Bigham, “Screen recognition: Creating accessibility metadata for mobile appli- cations from pixels,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , ser. CHI ’21. New York, NY , USA: Association ...

  23. [23]

    Aidui: Toward automated recognition of dark patterns in user interfaces,

    S. M. Hasan Mansur, S. Salma, D. Awofisayo, and K. Moran, “Aidui: Toward automated recognition of dark patterns in user interfaces,” pp. 1958–1970, 2023

  24. [24]

    Automating gui-based test oracles for mobile apps,

    K. Baral, J. Johnson, J. Mahmud, S. Salma, M. Fazzini, J. Rubin, J. Offutt, and K. Moran, “Automating gui-based test oracles for mobile apps,” in Proceedings of the 21st International Conference on Mining IEEE TRANSACTIONS ON SOFTW ARE ENGINEERING 13 Software Repositories, ser. MSR ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 309–321

  25. [25]

    Deep gui: Black-box gui input generation with deep learning,

    F. YazdaniBanafsheDaragh and S. Malek, “Deep gui: Black-box gui input generation with deep learning,” in 2021 36th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE) , 2021, pp. 905–916

  26. [26]

    Resplay: Improving cross-platform record-and-replay with gui sequence matching,

    S. Zhang, L. Wu, Y . Li, Z. Zhang, H. Lei, D. Li, Y . Guo, and X. Chen, “Resplay: Improving cross-platform record-and-replay with gui sequence matching,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) , 2023, pp. 439–450

  27. [27]

    Vision-based widget mapping for test migration across mobile platforms: Are we there yet?

    R. Ji, T. Zhu, X. Zhu, C. Chen, M. Pan, and T. Zhang, “Vision-based widget mapping for test migration across mobile platforms: Are we there yet?” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) , 2023, pp. 1416–1428

  28. [28]

    Automated cross-platform inconsistency detection for mobile apps,

    M. Fazzini and A. Orso, “Automated cross-platform inconsistency detection for mobile apps,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering , ser. ASE ’17. IEEE Press, 2017, p. 308–318

  29. [29]

    Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,

    J. Wang, M. Li, S. Wang, T. Menzies, and Q. Wang, “Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,” Information and Software Technology , vol. 110, pp. 139–155, 2019

  30. [30]

    Guider: Gui structure and vision co-guided test script repair for android apps,

    T. Xu, M. Pan, Y . Pei, G. Li, X. Zeng, T. Zhang, Y . Deng, and X. Li, “Guider: Gui structure and vision co-guided test script repair for android apps,” in Proceedings of the 30th ACM SIGSOFT International Sympo- sium on Software Testing and Analysis , ser. ISSTA 2021. New York, NY , USA: Association for Computing Machinery, 2021, p. 191–203

  31. [31]

    Automatic bug inference via deep image understanding,

    S. Yu, W. Huang, J. Zhang, and H. Zheng, “Automatic bug inference via deep image understanding,” in 2022 9th International Conference on Dependable Systems and Their Applications (DSA) , 2022, pp. 330–334

  32. [32]

    Seman- tic gui scene learning and video alignment for detecting duplicate video- based bug reports,

    Y . Yan, N. Cooper, O. Chaparro, K. Moran, and D. Poshyvanyk, “Seman- tic gui scene learning and video alignment for detecting duplicate video- based bug reports,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

  33. [33]

    BERT: Pre- training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Minneapolis, Minnesota: Association for ...

  34. [34]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kel- ton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proceedings of the 36th International Conference on ...

  35. [35]

    Large language models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural In- formation Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 22 199–22 213

  36. [36]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2024

  37. [37]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  38. [38]

    The dawn of lmms: Preliminary explorations with gpt-4v(ision),

    Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v(ision),” 2023

  39. [39]

    Chatgpt and soft- ware testing education: Promises & perils,

    S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “Chatgpt and soft- ware testing education: Promises & perils,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2023, pp. 4130–4137

  40. [40]

    Vulre- pair: a t5-based automated software vulnerability repair,

    M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulre- pair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE

  41. [41]

    New York, NY , USA: Association for Computing Machinery, 2022, p. 935–947

  42. [42]

    Effective test generation using pre-trained large language models and mutation testing,

    A. M. Dakhel, A. Nikanjam, V . Majdinasab, F. Khomh, and M. C. Desmarais, “Effective test generation using pre-trained large language models and mutation testing,”Information and Software Technology, vol. 171, p. 107468, 2024

  43. [43]

    Chatunitest: A framework for llm-based test generation,

    Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” 2024

  44. [44]

    An empirical evaluation of using large language models for automated unit test generation,

    M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2024

  45. [45]

    Fill in the blank: Context-aware automated text input generation for mobile gui testing,

    Z. Liu, C. Chen, J. Wang, X. Che, Y . Huang, J. Hu, and Q. Wang, “Fill in the blank: Context-aware automated text input generation for mobile gui testing,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , 2023, pp. 1355–1367

  46. [46]

    Make llm a testing expert: Bringing human-like interac- tion to mobile gui testing via functionality-aware decisions,

    Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, and Q. Wang, “Make llm a testing expert: Bringing human-like interac- tion to mobile gui testing via functionality-aware decisions,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), 2024, pp. 1222–1234

  47. [47]

    Crashtranslator: Automatically reproducing mobile ap- plication crashes directly from stack trace,

    Y . Huang, J. Wang, Z. Liu, Y . Wang, S. Wang, C. Chen, Y . Hu, and Q. Wang, “Crashtranslator: Automatically reproducing mobile ap- plication crashes directly from stack trace,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) , 2024, pp. 190–202

  48. [48]

    Prompting is all you need: Automated an- droid bug replay with large language models,

    S. Feng and C. Chen, “Prompting is all you need: Automated an- droid bug replay with large language models,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024

  49. [49]

    Inferfix: End-to-end program repair with llms,

    M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 1646–1656

  50. [50]

    Framing program repair as code completion,

    F. Ribeiro, R. Abreu, and J. Saraiva, “Framing program repair as code completion,” in 2022 IEEE/ACM International Workshop on Automated Program Repair (APR) , 2022, pp. 38–45

  51. [51]

    Examining zero-shot vulnerability repair with large language models,

    H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in2023 IEEE Symposium on Security and Privacy (SP) , 2023, pp. 2339–2356