Scenario-Guided LLM-based Mobile App GUI Testing
Pith reviewed 2026-05-19 11:01 UTC · model grok-4.3
The pith
A multi-agent system of LLMs can automate mobile app GUI testing by pursuing specific business scenarios instead of random exploration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScenGen is a scenario-guided LLM-based GUI testing framework that employs a multi-agent collaboration mechanism to simulate and automate the phases of manual testing. It integrates an Observer that extracts and structures GUI widgets and layouts to interpret semantic information, a Decider that uses LLMs to identify target widgets and actions aligned with a given testing scenario, an Executor that performs the operations on the app, a Supervisor that verifies whether results match the intended scenario completion, and a Recorder that logs operations into context memory while monitoring for runtime bugs.
What carries the argument
Five-agent collaboration in which the Observer supplies structured semantic GUI state, the Decider applies LLM reasoning to scenario context for widget and action selection, the Executor applies the chosen operation, the Supervisor confirms scenario fulfillment, and the Recorder maintains memory and bug detection.
If this is right
- Testing effort concentrates on business-critical paths rather than uniform coverage of every screen.
- Each test run produces traceable decisions and execution logs that link back to the original scenario.
- Runtime monitoring occurs continuously as part of the scenario flow instead of as a separate step.
- Context memory accumulated by the Recorder can be reused to improve later decisions within the same test session.
Where Pith is reading between the lines
- The same agent structure might be adapted to web or desktop interfaces where semantic understanding of UI elements is also available.
- If the method proves reliable, teams could shift from writing many individual test scripts toward describing high-level scenarios and letting the agents handle the details.
- Hybrid setups could combine this LLM guidance with existing model-based or symbolic testing tools to handle both scenario goals and low-level constraints.
Load-bearing premise
Large language models can reliably interpret semantic information from GUI states and make correct scenario-driven decisions for widget identification and action selection without significant hallucinations or errors that derail test completion.
What would settle it
A head-to-head run on the same set of apps and scenarios in which ScenGen completes fewer targeted scenarios or detects fewer known scenario-specific defects than a standard random-exploration tester.
Figures
read the original abstract
The assurance of mobile app GUI has become increasingly important, as the GUI serves as the primary medium of interaction between users and apps. Although numerous automated GUI testing approaches have been developed with diverse strategies, a substantial gap remains between these approaches and the underlying app business logic. Most existing approaches focus on general exploration rather than the completion of specific testing scenarios, often resulting in missed coverage of critical functionalities. Inspired by the manual testing process, which treats business logic, driven testing scenarios as the fundamental unit of testing, this paper introduces an approach that leverages large language models (LLMs) to comprehend the semantics expressed in app GUIs and their contextual relevance to given testing scenarios. Building upon this capability, we propose ScenGen, a novel scenario-guided LLM-based GUI testing framework that employs a multi-agent collaboration mechanism to simulate and automate the phases of manual testing. ScenGen integrates five agents. The Observer perceives the app GUI state by extracting and structuring GUI widgets and layouts, thereby interpreting the semantic information presented in the GUI. This information is then passed to the Decider, which makes scenario-driven decisions with the guidance of LLMs to identify target widgets and determine appropriate actions toward fulfilling specific testing goals. The Executor executes the decided operations on the app, while the Supervisor verifies whether the execution results align with the intended testing scenario completion, ensuring traceability and consistency in test generation and execution. Finally, the Recorder records the corresponding GUI operations into the context memory as a knowledge base for subsequent decision-making and concurrently monitors runtime bug occurrences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ScenGen, a scenario-guided LLM-based GUI testing framework for mobile apps that uses a multi-agent collaboration mechanism (Observer, Decider, Executor, Supervisor, Recorder) to simulate manual testing phases. The approach leverages LLMs to interpret GUI semantics and contextual relevance to given testing scenarios, aiming to close the gap between general exploration-based automated testing and coverage of specific business logic functionalities.
Significance. If the framework can be shown through rigorous evaluation to reliably complete scenario-driven tests while mitigating LLM errors, it would advance automated GUI testing by aligning test generation more closely with app business logic rather than undirected exploration, potentially improving coverage of critical functionalities.
major comments (2)
- [Abstract] Abstract: The central claim that ScenGen automates manual testing phases via LLM-guided decisions rests on the Decider agent's ability to correctly map GUI semantics to scenario goals and select valid widgets/actions, yet the description supplies no details on prompt construction, few-shot examples, output parsing, or fallback mechanisms. A single misidentification would break scenario completion, and the Supervisor's verification is described only at high level with similar LLM dependence, leaving error rates and drift risks unaddressed.
- [Abstract] Abstract: The manuscript supplies no experimental results, validation data, error analysis, or ablation studies on decision accuracy. Without these, the claim that the multi-agent loop fulfills testing goals remains unverified, as the soundness assessment notes the absence of any empirical grounding for the framework's performance.
minor comments (1)
- The abstract is clearly written but could explicitly state the intended evaluation methodology (e.g., metrics for scenario completion rate or comparison baselines) to help readers assess the proposal's scope.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments. We address each major comment below and commit to a major revision that incorporates the requested details and empirical evaluation to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that ScenGen automates manual testing phases via LLM-guided decisions rests on the Decider agent's ability to correctly map GUI semantics to scenario goals and select valid widgets/actions, yet the description supplies no details on prompt construction, few-shot examples, output parsing, or fallback mechanisms. A single misidentification would break scenario completion, and the Supervisor's verification is described only at high level with similar LLM dependence, leaving error rates and drift risks unaddressed.
Authors: We agree that the current high-level description leaves important implementation details unspecified. In the revised manuscript we will add a dedicated subsection under Methodology that specifies the prompt templates for the Decider and Supervisor, the few-shot examples used, the structured output parsing logic (including JSON schema enforcement), and the fallback strategies (e.g., re-prompting with error feedback or conservative default actions). We will also include a quantitative error analysis of decision accuracy and discuss mechanisms to detect and mitigate drift across the multi-agent loop. revision: yes
-
Referee: [Abstract] Abstract: The manuscript supplies no experimental results, validation data, error analysis, or ablation studies on decision accuracy. Without these, the claim that the multi-agent loop fulfills testing goals remains unverified, as the soundness assessment notes the absence of any empirical grounding for the framework's performance.
Authors: The initial submission focused on presenting the ScenGen architecture and its alignment with manual testing phases. To address the lack of empirical grounding, the revised version will include a new Evaluation section reporting results on multiple open-source Android apps. This will contain scenario-completion rates, decision-accuracy metrics, error analysis of LLM misidentifications, and ablation studies that isolate the contribution of each agent and the memory mechanism. revision: yes
Circularity Check
No circularity: self-contained engineering framework proposal
full rationale
The paper introduces ScenGen as a multi-agent LLM framework for scenario-guided mobile GUI testing, describing agent roles (Observer, Decider, Executor, Supervisor, Recorder) and their interactions at a high level. No equations, derivations, fitted parameters, or self-referential definitions appear. The central claim rests on LLM semantic comprehension and multi-agent collaboration inspired by manual testing processes, without reducing any result to its own inputs by construction or via load-bearing self-citations. The framework is presented as an independent engineering contribution with external benchmarks in mind (e.g., comparison to existing GUI testing approaches).
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can comprehend the semantics expressed in app GUIs and their contextual relevance to given testing scenarios
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ScenGen integrates five agents. The Observer perceives the app GUI state... The Decider... makes scenario-driven decisions with the guidance of LLMs...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use vision-based approaches to identify GUI widgets... computer vision algorithms, i.e., edge detection... OCR algorithms...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Towards Automated Crowdsourced Testing via Personified-LLM
PersonaTester uses LLMs guided by three-dimensional personas to replicate crowdworker testing patterns, yielding higher behavioral consistency, variability, and more bug detections than baseline LLM agents.
-
WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems
WebMAC uses three specialized multi-agent modules to clarify test scenarios, partition them for adequacy, and generate executable scripts, yielding 30-60% higher success rates and 29% better efficiency than SOTA on fo...
Reference graph
Works this paper leans on
-
[1]
Effective, platform- independent gui testing via image embedding and reinforcement learn- ing,
S. Yu, C. Fang, X. Li, Y . Ling, Z. Chen, and Z. Su, “Effective, platform- independent gui testing via image embedding and reinforcement learn- ing,” ACM Transactions on Software Engineering and Methodology , 2024
work page 2024
-
[2]
Practical, automated scenario-based mobile app testing,
S. Yu, C. Fang, M. Du, Z. Ding, Z. Chen, and Z. Su, “Practical, automated scenario-based mobile app testing,” IEEE Transactions on Software Engineering, vol. 50, no. 7, pp. 1949 – 1966, 2024
work page 1949
-
[3]
Improving random gui testing with image-based widget detection,
T. D. White, G. Fraser, and G. J. Brown, “Improving random gui testing with image-based widget detection,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis , ser. ISSTA 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 307–317
work page 2019
-
[4]
Guided, stochastic model-based gui testing of android apps,
T. Su, G. Meng, Y . Chen, K. Wu, W. Yang, Y . Yao, G. Pu, Y . Liu, and Z. Su, “Guided, stochastic model-based gui testing of android apps,” in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 245–256
work page 2017
-
[5]
Ui test migration across mobile platforms,
S. Talebipour, Y . Zhao, L. Dojcilovi ´c, C. Li, and N. Medvidovi ´c, “Ui test migration across mobile platforms,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2021, pp. 756–767
work page 2021
-
[6]
Test migration between mobile apps with similar functionality,
F. Behrang and A. Orso, “Test migration between mobile apps with similar functionality,” in2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2019, pp. 54–65
work page 2019
-
[7]
Appium, “Appium,” https://appium.io/, accessed: 2024-10-31
work page 2024
-
[8]
Repairing fragile gui test cases using word and layout embedding,
J. Yoon, S. Chung, K. Shin, J. Kim, S. Hong, and S. Yoo, “Repairing fragile gui test cases using word and layout embedding,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST) , 2022, pp. 291–301
work page 2022
- [9]
-
[10]
Dynodroid: An input generation system for android apps,
A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An input generation system for android apps,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering , 2013, pp. 224–234
work page 2013
-
[11]
Sapienz: Multi-objective automated testing for android applications,
K. Mao, M. Harman, and Y . Jia, “Sapienz: Multi-objective automated testing for android applications,” inProceedings of the 25th international symposium on software testing and analysis , 2016, pp. 94–105
work page 2016
-
[12]
Reinforcement learning based curiosity-driven testing of android applications,
M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement learning based curiosity-driven testing of android applications,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis , 2020, pp. 153–164
work page 2020
-
[13]
Reinforcement learning for android gui testing,
D. Adamo, M. K. Khan, S. Koppula, and R. Bryce, “Reinforcement learning for android gui testing,” in Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation , 2018, pp. 2–8
work page 2018
-
[14]
Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, Z. Tian, Y . Huang, J. Hu, and Q. Wang, “Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024
work page 2024
-
[15]
Llm for test script generation and migration: Challenges, capabilities, and opportunities,
S. Yu, C. Fang, Y . Ling, C. Wu, and Z. Chen, “Llm for test script generation and migration: Challenges, capabilities, and opportunities,” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS) , 2023, pp. 206–217
work page 2023
-
[16]
S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su, “Vision-based mobile app gui testing: A survey,” arXiv preprint arXiv:2310.13518, 2023
-
[17]
Boyd, Destruction and creation
J. Boyd, Destruction and creation . US Army Command and General Staff College Leavenworth, W A, 1987
work page 1987
-
[18]
A. Baddeley, “Working memory,” Science, vol. 255, no. 5044, pp. 556– 559, 1992
work page 1992
- [19]
-
[20]
Owl eyes: Spotting ui display issues via visual understanding,
Z. Liu, C. Chen, J. Wang, Y . Huang, J. Hu, and Q. Wang, “Owl eyes: Spotting ui display issues via visual understanding,” in 2020 35th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), 2020, pp. 398–409
work page 2020
-
[21]
Uied: a hybrid tool for gui element detection,
M. Xie, S. Feng, Z. Xing, J. Chen, and C. Chen, “Uied: a hybrid tool for gui element detection,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2020. New York, NY , USA: Association for Computing Machinery, 2020, p. 1655–1659
work page 2020
-
[22]
Screen recognition: Creating accessibility metadata for mobile appli- cations from pixels,
X. Zhang, L. de Greef, A. Swearngin, S. White, K. Murray, L. Yu, Q. Shan, J. Nichols, J. Wu, C. Fleizach, A. Everitt, and J. P. Bigham, “Screen recognition: Creating accessibility metadata for mobile appli- cations from pixels,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , ser. CHI ’21. New York, NY , USA: Association ...
work page 2021
-
[23]
Aidui: Toward automated recognition of dark patterns in user interfaces,
S. M. Hasan Mansur, S. Salma, D. Awofisayo, and K. Moran, “Aidui: Toward automated recognition of dark patterns in user interfaces,” pp. 1958–1970, 2023
work page 1958
-
[24]
Automating gui-based test oracles for mobile apps,
K. Baral, J. Johnson, J. Mahmud, S. Salma, M. Fazzini, J. Rubin, J. Offutt, and K. Moran, “Automating gui-based test oracles for mobile apps,” in Proceedings of the 21st International Conference on Mining IEEE TRANSACTIONS ON SOFTW ARE ENGINEERING 13 Software Repositories, ser. MSR ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 309–321
work page 2024
-
[25]
Deep gui: Black-box gui input generation with deep learning,
F. YazdaniBanafsheDaragh and S. Malek, “Deep gui: Black-box gui input generation with deep learning,” in 2021 36th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE) , 2021, pp. 905–916
work page 2021
-
[26]
Resplay: Improving cross-platform record-and-replay with gui sequence matching,
S. Zhang, L. Wu, Y . Li, Z. Zhang, H. Lei, D. Li, Y . Guo, and X. Chen, “Resplay: Improving cross-platform record-and-replay with gui sequence matching,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) , 2023, pp. 439–450
work page 2023
-
[27]
Vision-based widget mapping for test migration across mobile platforms: Are we there yet?
R. Ji, T. Zhu, X. Zhu, C. Chen, M. Pan, and T. Zhang, “Vision-based widget mapping for test migration across mobile platforms: Are we there yet?” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) , 2023, pp. 1416–1428
work page 2023
-
[28]
Automated cross-platform inconsistency detection for mobile apps,
M. Fazzini and A. Orso, “Automated cross-platform inconsistency detection for mobile apps,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering , ser. ASE ’17. IEEE Press, 2017, p. 308–318
work page 2017
-
[29]
Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,
J. Wang, M. Li, S. Wang, T. Menzies, and Q. Wang, “Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,” Information and Software Technology , vol. 110, pp. 139–155, 2019
work page 2019
-
[30]
Guider: Gui structure and vision co-guided test script repair for android apps,
T. Xu, M. Pan, Y . Pei, G. Li, X. Zeng, T. Zhang, Y . Deng, and X. Li, “Guider: Gui structure and vision co-guided test script repair for android apps,” in Proceedings of the 30th ACM SIGSOFT International Sympo- sium on Software Testing and Analysis , ser. ISSTA 2021. New York, NY , USA: Association for Computing Machinery, 2021, p. 191–203
work page 2021
-
[31]
Automatic bug inference via deep image understanding,
S. Yu, W. Huang, J. Zhang, and H. Zheng, “Automatic bug inference via deep image understanding,” in 2022 9th International Conference on Dependable Systems and Their Applications (DSA) , 2022, pp. 330–334
work page 2022
-
[32]
Seman- tic gui scene learning and video alignment for detecting duplicate video- based bug reports,
Y . Yan, N. Cooper, O. Chaparro, K. Moran, and D. Poshyvanyk, “Seman- tic gui scene learning and video alignment for detecting duplicate video- based bug reports,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024
work page 2024
-
[33]
BERT: Pre- training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Minneapolis, Minnesota: Association for ...
work page 2019
-
[34]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kel- ton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proceedings of the 36th International Conference on ...
work page 2024
-
[35]
Large language models are zero-shot reasoners,
T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural In- formation Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 22 199–22 213
work page 2022
-
[36]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2024
work page 2024
-
[37]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...
work page 2020
-
[38]
The dawn of lmms: Preliminary explorations with gpt-4v(ision),
Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v(ision),” 2023
work page 2023
-
[39]
Chatgpt and soft- ware testing education: Promises & perils,
S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “Chatgpt and soft- ware testing education: Promises & perils,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2023, pp. 4130–4137
work page 2023
-
[40]
Vulre- pair: a t5-based automated software vulnerability repair,
M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulre- pair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE
-
[41]
New York, NY , USA: Association for Computing Machinery, 2022, p. 935–947
work page 2022
-
[42]
Effective test generation using pre-trained large language models and mutation testing,
A. M. Dakhel, A. Nikanjam, V . Majdinasab, F. Khomh, and M. C. Desmarais, “Effective test generation using pre-trained large language models and mutation testing,”Information and Software Technology, vol. 171, p. 107468, 2024
work page 2024
-
[43]
Chatunitest: A framework for llm-based test generation,
Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” 2024
work page 2024
-
[44]
An empirical evaluation of using large language models for automated unit test generation,
M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2024
work page 2024
-
[45]
Fill in the blank: Context-aware automated text input generation for mobile gui testing,
Z. Liu, C. Chen, J. Wang, X. Che, Y . Huang, J. Hu, and Q. Wang, “Fill in the blank: Context-aware automated text input generation for mobile gui testing,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , 2023, pp. 1355–1367
work page 2023
-
[46]
Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, and Q. Wang, “Make llm a testing expert: Bringing human-like interac- tion to mobile gui testing via functionality-aware decisions,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), 2024, pp. 1222–1234
work page 2024
-
[47]
Crashtranslator: Automatically reproducing mobile ap- plication crashes directly from stack trace,
Y . Huang, J. Wang, Z. Liu, Y . Wang, S. Wang, C. Chen, Y . Hu, and Q. Wang, “Crashtranslator: Automatically reproducing mobile ap- plication crashes directly from stack trace,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE) , 2024, pp. 190–202
work page 2024
-
[48]
Prompting is all you need: Automated an- droid bug replay with large language models,
S. Feng and C. Chen, “Prompting is all you need: Automated an- droid bug replay with large language models,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024
work page 2024
-
[49]
Inferfix: End-to-end program repair with llms,
M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 1646–1656
work page 2023
-
[50]
Framing program repair as code completion,
F. Ribeiro, R. Abreu, and J. Saraiva, “Framing program repair as code completion,” in 2022 IEEE/ACM International Workshop on Automated Program Repair (APR) , 2022, pp. 38–45
work page 2022
-
[51]
Examining zero-shot vulnerability repair with large language models,
H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in2023 IEEE Symposium on Security and Privacy (SP) , 2023, pp. 2339–2356
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.