arxiv: 2604.19742 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

PlayCoder: Making LLM-Generated GUI Code Playable

Zhiyuan Peng , Wei Tao , Xin Yin , Chenhao Ying , Yuan Luo , Yiwen Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:58 UTC · model grok-4.3

classification 💻 cs.SE

keywords GUI code generationLLM code repairinteractive applicationsPlayEval benchmarkPlay@k metricPlayTester agentlogic errorsmulti-agent framework

0 comments

The pith

LLMs compile GUI code but almost none of it runs interactively without logic errors

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard benchmarks for LLM code generation overlook the interactive demands of GUI applications, where correctness depends on proper sequences of events and state changes rather than isolated pass/fail tests. It introduces PlayEval, a benchmark of 43 real multilingual GUI apps, and the Play@k metric that requires at least one generated sample to complete an end-to-end playthrough without logical violations. Experiments across ten leading models show high compilation rates yet near-zero Play@3 scores. The authors then present PlayCoder, a multi-agent system that generates code, uses an LLM agent called PlayTester to simulate play and find bugs, and iteratively repairs the code in a closed loop, lifting performance to 38.1 percent Exec@3 and 20.3 percent Play@3.

Core claim

State-of-the-art LLMs produce GUI application code that compiles at high rates but contains logic errors that prevent complete interactive playthroughs across user action sequences. PlayEval and Play@k expose this gap by requiring end-to-end execution without state-transition or event-handling violations. PlayCoder addresses it through a repository-aware multi-agent loop that generates candidates, evaluates them via PlayTester for logic issues, and performs targeted repairs, yielding up to 20.3 percent Play@3 on the benchmark.

What carries the argument

PlayCoder, a multi-agent repository-aware framework that generates GUI code, evaluates it through PlayTester's task-oriented playthroughs, and iteratively repairs detected logic violations in a closed loop.

If this is right

GUI code generation requires evaluation on full interaction sequences and state transitions rather than compilation or static tests alone.
Iterative multi-agent repair can fix silent logic bugs that traditional metrics overlook in generated applications.
The approach improves both functional execution rates and semantic playability across open-source and closed-source models.
Benchmarks focused on playable end-to-end flows better reflect real usability for event-driven software.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar interactive evaluation and repair loops could expose comparable hidden failures when LLMs generate other event-driven systems such as web services or simulations.
Integrating automated playtesting agents into code-generation pipelines may become necessary for any user-facing interactive output.
The near-zero baseline suggests training data for code LLMs contains few complete examples of bug-free GUI logic flows.

Load-bearing premise

PlayTester can reliably detect all relevant logic violations in GUI interaction flows without missing subtle state bugs or introducing its own errors.

What would settle it

Human testers playing through the same set of generated GUI samples and comparing their identified logical errors against PlayTester's automated detections for agreement rate.

Figures

Figures reproduced from arXiv: 2604.19742 by Chenhao Ying, Wei Tao, Xin Yin, Yiwen Guo, Yuan Luo, Zhiyuan Peng.

**Figure 1.** Figure 1: Flappy Bird generated by GPT-4o-mini and human programmer. Top-right: code written by a human programmer, where collision correctly kills the bird and ends the game. Bottom-right: code generated by GPT-4o-mini, where the bird can pass through the pipe, which is a critical logic flaw. Challenge 1: Testing Dilemma for GUI Application Code Generation. Traditional evaluation of code emphasizes compilation … view at source ↗

**Figure 2.** Figure 2: The structure of PlayEval Data. reflecting focused algorithmic implementations, while the overall distribution ranges from simple utilities to sophisticated emulation systems. Nesting and Control Flow: The benchmark exhibits an average nesting depth of 11.0 levels, with 107 files whose per-file maximum nesting depth exceeds 20 levels, creating substantial structural complexity. Control-flow analysis reveal… view at source ↗

**Figure 3.** Figure 3: Testing a 2048 implementation. Left: the rendered 4 × 4 grid shows tiles at (3,1), (3,4), and (4,4) with values 2, 2, and 4. Right: the agent’s structured reasoning process including state analysis, coverage assessment, verification protocols, strategy selection, and exception-aware considerations that lead to a recommended rightward swipe. key mechanics: swipe responsiveness, merge algorithm correctness, … view at source ↗

**Figure 4.** Figure 4: The overview of PlayCoder. (3) Diagnosis & Repair. PlayRefiner analyzes execution traces and testing feedback from the behavioral testing modules, synthesizes patches with repository context, and applies fixes with compilation/runtime checks. (4) Iterative Feedback. The updated application is re-evaluated through automated behavioral testing, checking behavior against specifications, including interactive … view at source ↗

**Figure 5.** Figure 5: Case study of PlayCoder testing a 2048 game. In the depicted scenario, the system recognizes that a rightward swipe (→) operation serves dual objectives: (1) implementing corner strategy optimization by merging the two 2-tiles into a 4- tile at position r3c4, and (2) validating critical game mechanics including swipe responsiveness, tile merger algorithms, and score calculation accuracy. This approach ensu… view at source ↗

read the original abstract

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlayCoder shows LLMs fail at playable GUI code despite compiling, and its repair loop lifts Play@3 to 20%, but the gains rest on an unvalidated LLM playtester.

read the letter

The main thing here is that LLMs generate GUI code that compiles but almost never runs correctly when you interact with it, and PlayCoder's closed-loop repair using an LLM playtester lifts the playable rate to 20.3% on a new benchmark of 43 apps. What the paper does well is introduce PlayEval as a repository-aware set of real GUI applications across categories and languages, along with Play@k to measure if a candidate can be played end-to-end without logic errors. The multi-agent PlayCoder that generates, evaluates, and repairs in a loop addresses the practical issue of silent bugs in interactive flows, and the experiments on 10 LLMs back up the compilation-playability disconnect. The case studies on uncovering missed bugs add some qualitative support. The soft spot is PlayTester. It's central to both the baseline failures and the reported gains, yet the description gives no human validation, no precision/recall numbers, and no comparison to manual testing. If the LLM agent misses subtle state bugs or flags false issues, the numbers don't hold up. Details on how the 43 apps were chosen and result variance are also thin, which limits how much we can generalize. This paper is for people working on LLM tools for building interactive applications like games or desktop GUIs. It shows clear thinking on why unit tests fall short for these systems and proposes a concrete fix, so it deserves a serious referee to sort out the evaluation questions. I'd send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces PlayEval, a benchmark of 43 multilingual GUI applications across six categories, along with the Play@k metric for end-to-end playability without logical errors. It shows that 10 state-of-the-art LLMs achieve high compilation rates but near-zero Play@3 scores. The authors propose PlayCoder, a multi-agent repository-aware framework that iteratively generates, evaluates, and repairs GUI code, reporting gains to 38.1% Exec@3 and 20.3% Play@3. PlayTester, an LLM-based agent, is used to automatically detect logic violations during task-oriented playthroughs.

Significance. If the evaluation holds, the work usefully highlights that compilation success is a poor proxy for functional correctness in interactive, stateful GUI applications and provides a concrete multi-agent repair loop that measurably improves semantic alignment. The repository-aware setting and the distinction between Exec@k and Play@k are valuable contributions to GUI code-generation research.

major comments (3)

[PlayTester description and experimental evaluation] The headline Play@3 results (near-zero baseline, up to 20.3% with PlayCoder) are measured entirely by PlayTester, yet the manuscript reports no human agreement study, no precision/recall figures against manual playtesting, and no error analysis on state-transition detection. This directly affects the reliability of both the baseline failure rates and the claimed improvements.
[PlayEval benchmark construction] The selection criteria and construction details for the 43 applications in PlayEval are not specified (e.g., how representativeness across the six categories was ensured or whether any filtering for complexity was applied). This makes it difficult to assess the generalizability of the near-zero Play@3 finding.
[PlayCoder framework and iterative repair loop] The closed-loop repair process in PlayCoder depends on PlayTester feedback; without quantified reliability of that feedback, it is unclear whether the reported gains reflect genuine logic fixes or merely PlayTester's own biases or blind spots.

minor comments (2)

[Metrics definition] Clarify the exact definition and implementation of Play@k (e.g., whether it requires successful completion of all tasks or allows partial success) and how it differs from Exec@k in the reported tables.
[PlayCoder architecture] Provide more detail on the multi-agent roles and prompting strategies inside PlayCoder so that the framework can be reproduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of evaluation reliability and benchmark transparency that we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [PlayTester description and experimental evaluation] The headline Play@3 results (near-zero baseline, up to 20.3% with PlayCoder) are measured entirely by PlayTester, yet the manuscript reports no human agreement study, no precision/recall figures against manual playtesting, and no error analysis on state-transition detection. This directly affects the reliability of both the baseline failure rates and the claimed improvements.

Authors: We agree that the absence of a human validation study for PlayTester is a limitation that affects confidence in the Play@3 metric. In the revised manuscript, we will add a human agreement study performed on a random subset of 10 applications from PlayEval. This study will report inter-rater agreement, precision, and recall of PlayTester's logic violation detections against manual playthroughs by two independent human evaluators. We will also include an error analysis categorizing false positives and negatives in state-transition detection. These additions will directly support the reliability of both baseline and PlayCoder results. revision: yes
Referee: [PlayEval benchmark construction] The selection criteria and construction details for the 43 applications in PlayEval are not specified (e.g., how representativeness across the six categories was ensured or whether any filtering for complexity was applied). This makes it difficult to assess the generalizability of the near-zero Play@3 finding.

Authors: We acknowledge the need for greater transparency in benchmark construction. PlayEval was assembled by curating publicly available open-source GUI repositories in Python, TypeScript, and JavaScript, stratified across the six categories to balance diversity in interaction patterns and domain. In the revision, we will add a new subsection (Section 3.1) that explicitly describes the selection process, inclusion criteria, steps taken to ensure category representativeness, and any complexity-based filtering applied. This will allow readers to better evaluate the generalizability of the near-zero baseline Play@3 results. revision: yes
Referee: [PlayCoder framework and iterative repair loop] The closed-loop repair process in PlayCoder depends on PlayTester feedback; without quantified reliability of that feedback, it is unclear whether the reported gains reflect genuine logic fixes or merely PlayTester's own biases or blind spots.

Authors: This concern is well-founded, as PlayCoder's iterative improvements rely on PlayTester signals. We will address it by incorporating the human agreement study and error analysis described in our response to the first comment; these will quantify PlayTester's reliability and help rule out systematic biases. In addition, the revision will expand the case studies to include manually verified examples of logic bugs that were correctly identified and repaired by the closed loop, providing concrete evidence that gains correspond to genuine fixes rather than artifacts of the tester. We believe these changes will clarify that the reported Play@3 improvements are substantive. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines PlayEval as an independent repository-aware benchmark from 43 GUI apps, introduces Play@k as a metric for end-to-end playable candidates, and develops PlayTester as an LLM agent for detection. PlayCoder is presented as a separate multi-agent repair framework whose outputs are then evaluated on these external definitions. No equations, self-definitions, or fitted parameters reduce the reported Play@3 gains to quantities constructed inside the same loop; the improvements are measured against benchmarks defined independently of the PlayCoder process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The work rests on standard assumptions about LLM capabilities and the feasibility of agent-based testing rather than new physical or mathematical axioms. No free parameters are explicitly fitted in the abstract; the invented components are engineering artifacts rather than postulated entities.

axioms (2)

domain assumption LLM-generated code can be iteratively improved by feeding execution feedback from an LLM-based tester back into the generator.
Invoked in the description of the PlayCoder closed loop.
domain assumption An LLM agent can perform reliable task-oriented GUI playthroughs and detect logic violations automatically.
Central to PlayTester and the evaluation claims.

invented entities (2)

PlayTester no independent evidence
purpose: LLM-based agent that executes interactive playthroughs and detects logic errors in GUI code.
New component introduced to enable the Play@k metric; no independent falsifiable prediction outside the paper.
PlayCoder no independent evidence
purpose: Multi-agent framework for generation, evaluation, and repair of GUI code.
Core proposed system; engineering construct without external validation handle.

pith-pipeline@v0.9.0 · 5613 in / 1719 out tokens · 30373 ms · 2026-05-10T01:58:42.896534+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 29 canonical work pages · 9 internal anchors

[1]

Reem Aleithan. 2025. Revisiting SWE-Bench: On the Importance of Data Quality for LLM-Based Code Models. In 2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 235–236

2025
[2]

Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. 2024. Swe- bench+: Enhanced coding benchmark for llms.arXiv preprint arXiv:2410.06992(2024)

work page arXiv 2024
[3]

Android Developers. 2010. UI/Application Exerciser Monkey. https://developer.android.com/studio/test/monkey. Accessed: 2025-08-14

2010
[4]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al
[5]

Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Kesina Baral, John Johnson, Junayed Mahmud, Sabiha Salma, Mattia Fazzini, Julia Rubin, Jeff Offutt, and Kevin Moran
[7]

InProceedings of the 21st International Conference on Mining Software Repositories

Automating gui-based test oracles for mobile apps. InProceedings of the 21st International Conference on Mining Software Repositories. 309–321
[8]

Tony Beltramelli. 2017. pix2code: Generating Code from a Graphical User Interface Screenshot.arXiv preprint arXiv:1705.07962(2017)

work page arXiv 2017
[9]

Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, et al. 2025. You Don’t Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation.arXiv preprint arXiv:2508.14104(2025)

work page arXiv 2025
[10]

Federico Bianchi, Chiori Hori, Shalini Narayan, and et al. 2024. ScreenAI: A Vision-Language Model for UI and Document Understanding.arXiv preprint arXiv:2404.08547(2024)

work page arXiv 2024
[11]

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. 2024. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. arXiv preprint arXiv:2406.10819(2024)

work page arXiv 2024
[12]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128(2023)

work page internal anchor Pith review arXiv 2023
[14]

DeepSeek. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. https://huggingface.co/papers/2401.14196 Accessed: 2025-02-5

work page internal anchor Pith review arXiv 2024
[15]

DeepSeek. 2025. DeepSeek-R1. https://github.com/deepseek-ai/DeepSeek-R1 Accessed: 2025-02-5

2025
[16]

Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Horton, and et al. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST). 845–854

2017
[17]

Yuwei Deng, Xiao Liu, Yujia Xu, Wenpeng Yin, and et al. 2023. Mind2Web: Grounding Language Models to the Web with Continuous, Generalist Tasks.arXiv preprint arXiv:2306.06070(2023)

work page arXiv 2023
[18]

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-collaboration code generation via chatgpt.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–38

2024
[19]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024
[20]

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2023. InCoder: A Generative Model for Code Infilling and Synthesis. InThe Eleventh International Conference on Learning Representations

2023
[21]

Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, and Zibin Zheng. 2025. Omnigirl: A multilingual and multimodal benchmark for github issue resolution.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 24–46

2025
[22]

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. 2025. SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?arXiv preprint arXiv:2507.12415(2025)

work page arXiv 2025
[23]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938 , Vol. 1, No. 1, Article . Publication date: April 2026. 22 Peng, Tao, et al. (2021)

work page internal anchor Pith review arXiv 2021
[24]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. 2024. MetaGPT: Meta programming for a multi-agent collaborative framework. International Conference on Learning Representations, ICLR

2024
[25]

Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnación, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault-aware neural code rankers.Advances in Neural Information Processing Systems35 (2022), 13419–13432

2022
[26]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations

2025
[27]

Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–30

2024
[28]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InICLR

2024
[29]

Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty
[30]

xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval.arXiv preprint arXiv:2303.03004(2023)

work page arXiv 2023
[31]

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks.Advances in Neural Information Processing Systems36 (2023), 39648–39677

2023
[32]

Data Intelligence Lab. 2025. DeepCode: Open Agentic Coding. https://github.com/HKUDS/DeepCode

2025
[33]

Caihua Li, Lianghong Guo, Yanlin Wang, Daya Guo, Wei Tao, Zhenyu Shan, Mingwei Liu, Jiachi Chen, Haoyu Song, Duyu Tang, et al. 2026. Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey.arXiv preprint arXiv:2601.11655(2026)

work page arXiv 2026
[34]

Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2025. Structured chain-of-thought prompting for code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–23

2025
[35]

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. Evocodebench: An evolving code generation benchmark aligned with real-world code repositories.arXiv preprint arXiv:2404.00599(2024)

work page arXiv 2024
[36]

Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Zhi Jin, Hao Zhu, Huanyu Liu, Kaibo Liu, Lecheng Wang, Zheng Fang, et al
[37]

Deveval: Evaluating code generation in practical software projects.arXiv preprint arXiv:2401.06401(2024)

work page arXiv 2024
[38]

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. 2025. Fea-bench: A benchmark for evaluating repository-level code generation for feature implementation. arXiv preprint arXiv:2503.06680(2025)

work page arXiv 2025
[39]

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

2022
[40]

Yushi Li, Qian Liu, Panupong Pasupat, Zi He, Yu Wang, Benjamin Hsu, and Yang Li. 2020. Mapping Natural Language Instructions to Mobile UI Action Sequences. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 1945–1955

2020
[41]

Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A deep learning-based approach to automated black-box android app testing. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1070–1073

2019
[42]

Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, Sirui Hong, Sheng Fan, and Xiao Tang. 2025. OpenManus: An open-source framework for building general AI agents. doi:10.5281/zenodo.15186407

work page doi:10.5281/zenodo.15186407 2025
[43]

Dianshu Liao, Shidong Pan, Xiaoyu Sun, Xiaoxue Ren, Qing Huang, Zhenchang Xing, Huan Jin, and Qinying Li. 2024. A3-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware.IEEE Transactions on Software Engineering(2024)

2024
[44]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Chatting with gpt-3 for zero-shot human-like mobile automated gui testing.arXiv preprint arXiv:2305.09434(2023)

work page arXiv 2023
[46]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024
[47]

Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang. 2020. Owl eyes: Spotting ui display issues via visual understanding. InProceedings of the 35th IEEE/ACM international conference on automated software engineering. 398–409. , Vol. 1, No. 1, Article . Publication date: April 2026. PlayCoder: Making LLM-Generated GUI Code Playable 23

2020
[48]

Aravind Machiry, Rohan Tahiliani, and Mayur Naik. 2013. Dynodroid: An Input Generation System for Android Apps. InProceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE). 224–234

2013
[49]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

2023
[50]

SM Hasan Mansur, Sabiha Salma, Damilola Awofisayo, and Kevin Moran. 2023. Aidui: Toward automated recognition of dark patterns in user interfaces. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1958–1970

2023
[51]

Ke Mao, Mark Harman, Yue Jia, Yuanyuan Zhang, and Zheng Li. 2016. Sapienz: Multi-Objective Automated Testing for Android Applications. InProceedings of the 2016 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 94–105

2016
[52]

Memon, Mary Lou Soffa, and Martha E

Atif M. Memon, Mary Lou Soffa, and Martha E. Pollack. 2003. Automatically Testing GUIs Using Event-Flow Graphs. IEEE Transactions on Software Engineering29, 6 (2003), 531–555

2003
[53]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. InThe Eleventh International Conference on Learning Representations

2023
[54]

OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/

2022
[55]

Zhiyuan Peng, Wei Tao, Xin Yin, Chenhao Ying, Yuan Luo, and Yiwen Guo. 2025. PlayCoder Replication Package: Source Code, Prompts, and Benchmark. https://github.com/Tencent/PlayCoder

2025
[56]

Zhiyuan Peng, Xin Yin, Rui Qian, Peiqin Lin, Yongkang Liu, Hao Zhang, Chenhao Ying, and Yuan Luo. 2025. SolEval: Benchmarking Large Language Models for Repository-level Solidity Smart Contract Generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 4388–4411

2025
[57]

Zhiyuan Peng, Xin Yin, Chenhao Ying, Chao Ni, and Yuan Luo. 2025. A Preference-Driven Methodology for High- Quality Solidity Code Generation.arXiv preprint arXiv:2506.03006(2025)

work page arXiv 2025
[58]

Zhiyuan Peng, Xin Yin, Pu Zhao, Fangkai Yang, Lu Wang, Ran Jia, Xu Chen, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. 2026. RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository. arXiv preprint arXiv:2601.13943(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review arXiv 2023
[60]

Sijie Shen, Xiang Zhu, Yihong Dong, Qizhi Guo, Yankun Zhen, and Ge Li. 2022. Incorporating domain knowledge through task augmentation for front-end javascript code generation. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1533–1543

2022
[61]

Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023. Repository-level prompt generation for large language models of code. InInternational Conference on Machine Learning. PMLR, 31693–31715

2023
[62]

Ting Su, Guozhu Meng, Yang Chen, Ke Wang, Zhendong Su, Yang Liu, and others. 2017. Stoat: A Model-Based GUI Testing Tool for Android Applications. In2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). 571–575

2017
[63]

Hao Tang, Keya Hu, Jin Zhou, Si Cheng Zhong, Wei-Long Zheng, Xujie Si, and Kevin Ellis. 2024. Code repair with llms gives an exploration-exploitation tradeoff.Advances in Neural Information Processing Systems37 (2024), 117954–117996

2024
[64]

Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. Magis: Llm-based multi-agent framework for github issue resolution.Advances in Neural Information Processing Systems37 (2024), 51963–51993

2024
[65]

Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang. 2025. LLMDroid: Enhancing Automated Mobile App GUI Testing Coverage with Large Language Model Guidance.Proceedings of the ACM on Software Engineering2, FSE (2025), 1001–1022

2025
[66]

Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, and Yepang Liu. 2024. Leveraging large vision-language model for better automatic web GUI testing. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 125–137

2024
[67]

Yanlin Wang, Kefeng Duan, Dewu Zheng, Ensheng Shi, Fengji Zhang, Yanli Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Hongyu Zhang, et al. 2026. Towards an understanding of context utilization in code intelligence.Comput. Surveys (2026)

2026
[68]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859(2021)

work page internal anchor Pith review arXiv 2021
[69]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: empowering code generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. 52632–52657. , Vol. 1, No. 1, Article . Publication date: April 2026. 24 Peng, Tao, et al

2024
[70]

Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831

2024
[71]

Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, and Kai Chen. 2025. Swe-fixer: Training open-source llms for effective and efficient github issue resolution.arXiv preprint arXiv:2501.05040(2025)

work page arXiv 2025
[72]

Mulong Xie, Sidong Feng, Zhenchang Xing, Jieshan Chen, and Chunyang Chen. 2020. UIED: a hybrid tool for GUI element detection. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1655–1659

2020
[73]

Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Shuiguang Deng, et al. 2023. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation.arXiv preprint arXiv:2311.08588(2023)

work page arXiv 2023
[74]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024
[75]

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. 2025. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?. InThe Thirteenth International Conference on Learning Representations

2025
[76]

Faraz YazdaniBanafsheDaragh and Sam Malek. 2021. Deep GUI: Black-box GUI input generation with deep learning. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 905–916

2021
[77]

Xin Yin, Chao Ni, Tien N Nguyen, Shaohua Wang, and Xiaohu Yang. 2024. Rectifier: Code translation with corrector via llms.arXiv preprint arXiv:2407.07472(2024)

work page arXiv 2024
[78]

Xin Yin, Chao Ni, Shaohua Wang, Zhenhao Li, Limin Zeng, and Xiaohu Yang. 2024. Thinkrepair: Self-directed automated program repair. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1274–1286

2024
[79]

Xin Yin, Chao Ni, Xiaodan Xu, and Xiaohu Yang. 2025. What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering

2025
[80]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

2024

Showing first 80 references.