Recognition: unknown
PlayCoder: Making LLM-Generated GUI Code Playable
Pith reviewed 2026-05-10 01:58 UTC · model grok-4.3
The pith
LLMs compile GUI code but almost none of it runs interactively without logic errors
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
State-of-the-art LLMs produce GUI application code that compiles at high rates but contains logic errors that prevent complete interactive playthroughs across user action sequences. PlayEval and Play@k expose this gap by requiring end-to-end execution without state-transition or event-handling violations. PlayCoder addresses it through a repository-aware multi-agent loop that generates candidates, evaluates them via PlayTester for logic issues, and performs targeted repairs, yielding up to 20.3 percent Play@3 on the benchmark.
What carries the argument
PlayCoder, a multi-agent repository-aware framework that generates GUI code, evaluates it through PlayTester's task-oriented playthroughs, and iteratively repairs detected logic violations in a closed loop.
If this is right
- GUI code generation requires evaluation on full interaction sequences and state transitions rather than compilation or static tests alone.
- Iterative multi-agent repair can fix silent logic bugs that traditional metrics overlook in generated applications.
- The approach improves both functional execution rates and semantic playability across open-source and closed-source models.
- Benchmarks focused on playable end-to-end flows better reflect real usability for event-driven software.
Where Pith is reading between the lines
- Similar interactive evaluation and repair loops could expose comparable hidden failures when LLMs generate other event-driven systems such as web services or simulations.
- Integrating automated playtesting agents into code-generation pipelines may become necessary for any user-facing interactive output.
- The near-zero baseline suggests training data for code LLMs contains few complete examples of bug-free GUI logic flows.
Load-bearing premise
PlayTester can reliably detect all relevant logic violations in GUI interaction flows without missing subtle state bugs or introducing its own errors.
What would settle it
Human testers playing through the same set of generated GUI samples and comparing their identified logical errors against PlayTester's automated detections for agreement rate.
Figures
read the original abstract
Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PlayEval, a benchmark of 43 multilingual GUI applications across six categories, along with the Play@k metric for end-to-end playability without logical errors. It shows that 10 state-of-the-art LLMs achieve high compilation rates but near-zero Play@3 scores. The authors propose PlayCoder, a multi-agent repository-aware framework that iteratively generates, evaluates, and repairs GUI code, reporting gains to 38.1% Exec@3 and 20.3% Play@3. PlayTester, an LLM-based agent, is used to automatically detect logic violations during task-oriented playthroughs.
Significance. If the evaluation holds, the work usefully highlights that compilation success is a poor proxy for functional correctness in interactive, stateful GUI applications and provides a concrete multi-agent repair loop that measurably improves semantic alignment. The repository-aware setting and the distinction between Exec@k and Play@k are valuable contributions to GUI code-generation research.
major comments (3)
- [PlayTester description and experimental evaluation] The headline Play@3 results (near-zero baseline, up to 20.3% with PlayCoder) are measured entirely by PlayTester, yet the manuscript reports no human agreement study, no precision/recall figures against manual playtesting, and no error analysis on state-transition detection. This directly affects the reliability of both the baseline failure rates and the claimed improvements.
- [PlayEval benchmark construction] The selection criteria and construction details for the 43 applications in PlayEval are not specified (e.g., how representativeness across the six categories was ensured or whether any filtering for complexity was applied). This makes it difficult to assess the generalizability of the near-zero Play@3 finding.
- [PlayCoder framework and iterative repair loop] The closed-loop repair process in PlayCoder depends on PlayTester feedback; without quantified reliability of that feedback, it is unclear whether the reported gains reflect genuine logic fixes or merely PlayTester's own biases or blind spots.
minor comments (2)
- [Metrics definition] Clarify the exact definition and implementation of Play@k (e.g., whether it requires successful completion of all tasks or allows partial success) and how it differs from Exec@k in the reported tables.
- [PlayCoder architecture] Provide more detail on the multi-agent roles and prompting strategies inside PlayCoder so that the framework can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of evaluation reliability and benchmark transparency that we will address in the revision. We respond to each major comment below.
read point-by-point responses
-
Referee: [PlayTester description and experimental evaluation] The headline Play@3 results (near-zero baseline, up to 20.3% with PlayCoder) are measured entirely by PlayTester, yet the manuscript reports no human agreement study, no precision/recall figures against manual playtesting, and no error analysis on state-transition detection. This directly affects the reliability of both the baseline failure rates and the claimed improvements.
Authors: We agree that the absence of a human validation study for PlayTester is a limitation that affects confidence in the Play@3 metric. In the revised manuscript, we will add a human agreement study performed on a random subset of 10 applications from PlayEval. This study will report inter-rater agreement, precision, and recall of PlayTester's logic violation detections against manual playthroughs by two independent human evaluators. We will also include an error analysis categorizing false positives and negatives in state-transition detection. These additions will directly support the reliability of both baseline and PlayCoder results. revision: yes
-
Referee: [PlayEval benchmark construction] The selection criteria and construction details for the 43 applications in PlayEval are not specified (e.g., how representativeness across the six categories was ensured or whether any filtering for complexity was applied). This makes it difficult to assess the generalizability of the near-zero Play@3 finding.
Authors: We acknowledge the need for greater transparency in benchmark construction. PlayEval was assembled by curating publicly available open-source GUI repositories in Python, TypeScript, and JavaScript, stratified across the six categories to balance diversity in interaction patterns and domain. In the revision, we will add a new subsection (Section 3.1) that explicitly describes the selection process, inclusion criteria, steps taken to ensure category representativeness, and any complexity-based filtering applied. This will allow readers to better evaluate the generalizability of the near-zero baseline Play@3 results. revision: yes
-
Referee: [PlayCoder framework and iterative repair loop] The closed-loop repair process in PlayCoder depends on PlayTester feedback; without quantified reliability of that feedback, it is unclear whether the reported gains reflect genuine logic fixes or merely PlayTester's own biases or blind spots.
Authors: This concern is well-founded, as PlayCoder's iterative improvements rely on PlayTester signals. We will address it by incorporating the human agreement study and error analysis described in our response to the first comment; these will quantify PlayTester's reliability and help rule out systematic biases. In addition, the revision will expand the case studies to include manually verified examples of logic bugs that were correctly identified and repaired by the closed loop, providing concrete evidence that gains correspond to genuine fixes rather than artifacts of the tester. We believe these changes will clarify that the reported Play@3 improvements are substantive. revision: partial
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper defines PlayEval as an independent repository-aware benchmark from 43 GUI apps, introduces Play@k as a metric for end-to-end playable candidates, and develops PlayTester as an LLM agent for detection. PlayCoder is presented as a separate multi-agent repair framework whose outputs are then evaluated on these external definitions. No equations, self-definitions, or fitted parameters reduce the reported Play@3 gains to quantities constructed inside the same loop; the improvements are measured against benchmarks defined independently of the PlayCoder process itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-generated code can be iteratively improved by feeding execution feedback from an LLM-based tester back into the generator.
- domain assumption An LLM agent can perform reliable task-oriented GUI playthroughs and detect logic violations automatically.
invented entities (2)
-
PlayTester
no independent evidence
-
PlayCoder
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Reem Aleithan. 2025. Revisiting SWE-Bench: On the Importance of Data Quality for LLM-Based Code Models. In 2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 235–236
2025
- [2]
-
[3]
Android Developers. 2010. UI/Application Exerciser Monkey. https://developer.android.com/studio/test/monkey. Accessed: 2025-08-14
2010
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al
-
[5]
Qwen technical report.arXiv preprint arXiv:2309.16609(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Kesina Baral, John Johnson, Junayed Mahmud, Sabiha Salma, Mattia Fazzini, Julia Rubin, Jeff Offutt, and Kevin Moran
-
[7]
InProceedings of the 21st International Conference on Mining Software Repositories
Automating gui-based test oracles for mobile apps. InProceedings of the 21st International Conference on Mining Software Repositories. 309–321
- [8]
- [9]
- [10]
- [11]
-
[12]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128(2023)
work page internal anchor Pith review arXiv 2023
-
[14]
DeepSeek. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. https://huggingface.co/papers/2401.14196 Accessed: 2025-02-5
work page internal anchor Pith review arXiv 2024
-
[15]
DeepSeek. 2025. DeepSeek-R1. https://github.com/deepseek-ai/DeepSeek-R1 Accessed: 2025-02-5
2025
-
[16]
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Horton, and et al. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST). 845–854
2017
- [17]
-
[18]
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-collaboration code generation via chatgpt.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–38
2024
-
[19]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
2024
-
[20]
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2023. InCoder: A Generative Model for Code Infilling and Synthesis. InThe Eleventh International Conference on Learning Representations
2023
-
[21]
Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, and Zibin Zheng. 2025. Omnigirl: A multilingual and multimodal benchmark for github issue resolution.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 24–46
2025
- [22]
-
[23]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938 , Vol. 1, No. 1, Article . Publication date: April 2026. 22 Peng, Tao, et al. (2021)
work page internal anchor Pith review arXiv 2021
-
[24]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. 2024. MetaGPT: Meta programming for a multi-agent collaborative framework. International Conference on Learning Representations, ICLR
2024
-
[25]
Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnación, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault-aware neural code rankers.Advances in Neural Information Processing Systems35 (2022), 13419–13432
2022
-
[26]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations
2025
-
[27]
Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–30
2024
-
[28]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InICLR
2024
-
[29]
Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty
- [30]
-
[31]
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks.Advances in Neural Information Processing Systems36 (2023), 39648–39677
2023
-
[32]
Data Intelligence Lab. 2025. DeepCode: Open Agentic Coding. https://github.com/HKUDS/DeepCode
2025
- [33]
-
[34]
Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2025. Structured chain-of-thought prompting for code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–23
2025
- [35]
-
[36]
Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Zhi Jin, Hao Zhu, Huanyu Liu, Kaibo Liu, Lecheng Wang, Zheng Fang, et al
- [37]
- [38]
-
[39]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097
2022
-
[40]
Yushi Li, Qian Liu, Panupong Pasupat, Zi He, Yu Wang, Benjamin Hsu, and Yang Li. 2020. Mapping Natural Language Instructions to Mobile UI Action Sequences. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 1945–1955
2020
-
[41]
Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A deep learning-based approach to automated black-box android app testing. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1070–1073
2019
-
[42]
Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, Sirui Hong, Sheng Fan, and Xiao Tang. 2025. OpenManus: An open-source framework for building general AI agents. doi:10.5281/zenodo.15186407
-
[43]
Dianshu Liao, Shidong Pan, Xiaoyu Sun, Xiaoxue Ren, Qing Huang, Zhenchang Xing, Huan Jin, and Qinying Li. 2024. A3-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware.IEEE Transactions on Software Engineering(2024)
2024
-
[44]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [45]
-
[46]
Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
2024
-
[47]
Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang. 2020. Owl eyes: Spotting ui display issues via visual understanding. InProceedings of the 35th IEEE/ACM international conference on automated software engineering. 398–409. , Vol. 1, No. 1, Article . Publication date: April 2026. PlayCoder: Making LLM-Generated GUI Code Playable 23
2020
-
[48]
Aravind Machiry, Rohan Tahiliani, and Mayur Naik. 2013. Dynodroid: An Input Generation System for Android Apps. InProceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE). 224–234
2013
-
[49]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594
2023
-
[50]
SM Hasan Mansur, Sabiha Salma, Damilola Awofisayo, and Kevin Moran. 2023. Aidui: Toward automated recognition of dark patterns in user interfaces. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1958–1970
2023
-
[51]
Ke Mao, Mark Harman, Yue Jia, Yuanyuan Zhang, and Zheng Li. 2016. Sapienz: Multi-Objective Automated Testing for Android Applications. InProceedings of the 2016 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 94–105
2016
-
[52]
Memon, Mary Lou Soffa, and Martha E
Atif M. Memon, Mary Lou Soffa, and Martha E. Pollack. 2003. Automatically Testing GUIs Using Event-Flow Graphs. IEEE Transactions on Software Engineering29, 6 (2003), 531–555
2003
-
[53]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. InThe Eleventh International Conference on Learning Representations
2023
-
[54]
OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/
2022
-
[55]
Zhiyuan Peng, Wei Tao, Xin Yin, Chenhao Ying, Yuan Luo, and Yiwen Guo. 2025. PlayCoder Replication Package: Source Code, Prompts, and Benchmark. https://github.com/Tencent/PlayCoder
2025
-
[56]
Zhiyuan Peng, Xin Yin, Rui Qian, Peiqin Lin, Yongkang Liu, Hao Zhang, Chenhao Ying, and Yuan Luo. 2025. SolEval: Benchmarking Large Language Models for Repository-level Solidity Smart Contract Generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 4388–4411
2025
- [57]
-
[58]
Zhiyuan Peng, Xin Yin, Pu Zhao, Fangkai Yang, Lu Wang, Ran Jia, Xu Chen, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. 2026. RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository. arXiv preprint arXiv:2601.13943(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)
work page internal anchor Pith review arXiv 2023
-
[60]
Sijie Shen, Xiang Zhu, Yihong Dong, Qizhi Guo, Yankun Zhen, and Ge Li. 2022. Incorporating domain knowledge through task augmentation for front-end javascript code generation. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1533–1543
2022
-
[61]
Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023. Repository-level prompt generation for large language models of code. InInternational Conference on Machine Learning. PMLR, 31693–31715
2023
-
[62]
Ting Su, Guozhu Meng, Yang Chen, Ke Wang, Zhendong Su, Yang Liu, and others. 2017. Stoat: A Model-Based GUI Testing Tool for Android Applications. In2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). 571–575
2017
-
[63]
Hao Tang, Keya Hu, Jin Zhou, Si Cheng Zhong, Wei-Long Zheng, Xujie Si, and Kevin Ellis. 2024. Code repair with llms gives an exploration-exploitation tradeoff.Advances in Neural Information Processing Systems37 (2024), 117954–117996
2024
-
[64]
Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. Magis: Llm-based multi-agent framework for github issue resolution.Advances in Neural Information Processing Systems37 (2024), 51963–51993
2024
-
[65]
Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang. 2025. LLMDroid: Enhancing Automated Mobile App GUI Testing Coverage with Large Language Model Guidance.Proceedings of the ACM on Software Engineering2, FSE (2025), 1001–1022
2025
-
[66]
Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, and Yepang Liu. 2024. Leveraging large vision-language model for better automatic web GUI testing. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 125–137
2024
-
[67]
Yanlin Wang, Kefeng Duan, Dewu Zheng, Ensheng Shi, Fengji Zhang, Yanli Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Hongyu Zhang, et al. 2026. Towards an understanding of context utilization in code intelligence.Comput. Surveys (2026)
2026
-
[68]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859(2021)
work page internal anchor Pith review arXiv 2021
-
[69]
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: empowering code generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. 52632–52657. , Vol. 1, No. 1, Article . Publication date: April 2026. 24 Peng, Tao, et al
2024
-
[70]
Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831
2024
- [71]
-
[72]
Mulong Xie, Sidong Feng, Zhenchang Xing, Jieshan Chen, and Chunyang Chen. 2020. UIED: a hybrid tool for GUI element detection. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1655–1659
2020
-
[73]
Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, et al. 2023. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation.arXiv preprint arXiv:2311.08588(2023)
-
[74]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652
2024
-
[75]
John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. 2025. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?. InThe Thirteenth International Conference on Learning Representations
2025
-
[76]
Faraz YazdaniBanafsheDaragh and Sam Malek. 2021. Deep GUI: Black-box GUI input generation with deep learning. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 905–916
2021
- [77]
-
[78]
Xin Yin, Chao Ni, Shaohua Wang, Zhenhao Li, Limin Zeng, and Xiaohu Yang. 2024. Thinkrepair: Self-directed automated program repair. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1274–1286
2024
-
[79]
Xin Yin, Chao Ni, Xiaodan Xu, and Xiaohu Yang. 2025. What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering
2025
-
[80]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.